Google and Factual Struggle to Define the Data Marketplace

This week, Google and Factual announced features that will extend and enrich their existing big data offerings. Along with O’Reilly’s Strata Conference earlier this month (and GigaOM’s Structure Big Data conference next month), the recent news suggests that the big data marketplace is growing increasingly complex as startups and established IT companies struggle to control it’s future.

In 2007, Google acquired Gapminder. The company’s software, now Google Public Data Explorer, was designed to make visualizing complex data easier, and included access to a finite set of “useful” official data from institutions such as the World Bank. Earlier this week, Google added the capability to upload new datasets for visualization. Google, of course, has a knack for taking existing capabilities and pushing them to large new audiences.

Perhaps more significantly, Google also this week released a new format for encoding and describing data sets. The Dataset Publishing Language (DSPL) enables data owners to describe structure within their data (“continents” contain “countries”), and takes some rudimentary steps toward encouraging the linking of structures and concepts between data sets.

Factual, meanwhile, is concerned with gathering specific types of data from multiple sources, amalgamating, and then making the whole available to third party applications via APIs. Probably best known for its location-based data (used to power Facebook Places in the UK and Japan, for example), this week’s announcement sees the company add richness and breadth to that data set.

As data volumes continue to grow, new and existing companies scramble to understand and service a new set of requirements. Perhaps the easiest requirements to understand are those around sharing (or selling) collected data and finding data shared (or sold) by third parties. To borrow from the loose taxonomy proposed in a Strata Conference presentation by Stratus Security CEO Pete Soderling and BuzzData co-founder Pete Forde, solutions in this space look a lot like online mail order catalogs. They gather short descriptions for large numbers of products (the data sets) and make them straightforward to acquire from a single source. Examples include Infochimps and Microsoft’s Dallas, now renamed the Windows Azure Marketplace DataMarket. The big Open Government Data initiatives in the U.S., the UK and elsewhere around the world also fall into this category, although the open licensing of their data means that it will be found sitting behind almost every data marketplace across the web.

Also relatively straightforward is the requirement to visualize and manipulate data, and Google, Many Eyes, Impure and Timetric are some of the most visible examples of this type. Visualization is typically the point of these applications, rather than an afterthought.

Matters become far more complex when you want to start combining different data sets, even within a single data marketplace. Typically, it’s not what these services are designed for, and typically, there is insufficient metadata to enable sensible combinations. For example, “height” of buildings in one data set combined with “height” of, say, trees or mountains in another is a recipe for disaster if one is measured in feet and the other in meters. Without knowledge of the units used, the newly combined data set is worthless — and, possibly, dangerously misleading. Factual is already doing some of the work to tidy data that it collects, but Google’s DSPL is an interesting example of encouraging data owners to make these things explicit themselves.

If data marketplaces are to realize their full potential, they need to demonstrate the power of combining data drawn from different sources; from internal enterprise systems, Open Government repositories and the pages of the catalogs. With Google clearly approaching the problem from one end, and a number of rather stealthy escapees from the world of the Semantic Web coming at it from the other, it will be fascinating to see which is first to deliver a compelling — and usable — solution.

Question of the week

Would you rely upon data sourced from one of these data marts?