Data might be the new oil, but a lot of us just need gasoline

One of the biggest tropes in the era of big data is that data is the new oil — it’s very valuable to the companies that have it, but only after it has been mined and processed. The analogy makes some sense, but it ignores the fact that people and companies don’t have the means to collect the data they need or the ability to process it once they have it. A lot of us just need gasoline.

Which is why I was excited to see the new Data for Everyone initiative that crowdsourcing startup CrowdFlower released on Wednesday. It’s a library of interesting and free datasets that have been gathered by CrowdFlower’s users over the years and verified by the company’s crowdsourced labor force. Topics range from Twitter sentiment on various subjects to a collection of labeled medical images.

Data for Everyone is far from comprehensive or from being any sort of one-stop shop for data democratization, but it is a good approach to a problem that lots of folks have been trying to solve for years. Namely, giving people interested in analyzing valuable data access to that data in a meaningful way. Unfortunately, early attempts at data marketplaces such as Infochimps and Quandl, and even earlier incarnations of the federal service, often included poorly formatted data or suffered from a dearth of interesting datasets.

An example of what's available in Data for Everyone.

An example of what’s available in Data for Everyone.

It’s often said that data analysts spend 85 percent of their time formatting data and only 15 percent of it actually analyzing data — a situation that is simply untenable for people whose jobs don’t revolve around data, even as tools for data analysis continue to improve. All the Tableau software or Watson Analytics or DataHero or PowerBI services in the world don’t do a whole lot to help mortals analyze data when it’s riddled with errors or formatted so sloppily it takes a day just to get it ready to upload.

Hopefully, we’ll start to see more high-quality data markets pop up, as well as better tools for collecting data from services such as Twitter. They don’t necessarily need to be so easy a 10-year-old can use them, but they do need to be easy enough that someone with basic programming or analytic skills can get up and running without quitting their day job. Data for Everyone looks like one, as does the new Wolfram Data Drop, also announced on Wednesday.

Because while it’s getting a lot easier for large companies and professional data scientists to collect their data and analyze it for purposes ranging from business intelligence to training robotic brains — topics we’ll be discussing at our Structure Data conference later this month — the little guy, strapped for time and resources, still needs more help.

A massive database now translates news in 65 languages in real time

I have written quite a bit about GDELT (the Global Database of Events, Languages and Tone) over the past year, because I think it’s a great example of the type of ambitious project only made possible by the advent of cloud computing and big data systems. In a nutshell, it’s database of more than 250 million socioeconomic and geopolitical events and their metadata dating back to 1979, all stored (now) in Google’s cloud and available to analyze for free via Google BigQuery or custom-built applications.

On Thursday, version 2.0 of GDELT was unveiled, complete with a slew of new features — faster updates, sentiment analysis, images, a more-expansive knowledge graph and, most importantly, real-time translation across 65 different languages. That’s 98.4 percent of the non-English content GDELT monitors. Because you can’t really have a global database, or expect to get a full picture of what’s happening around the world, if you’re limited to English language sources or exceedingly long turnaround times for translated content.

For a quick recap of GDELT, you can read the story linked to above, as well as our coverage of project creator Kalev Leetaru’s analyses of the Arab Spring and Ukrainian crisis and the Ebola outbreak. For a deeper understanding of the project and its creator –who also helped measure the “Twitter heartbeat” and uploaded millions of images from the Internet Archive’s digital book collection to Flickr — check our Structure Show podcast interview with Leetaru from August (embedded below). He’ll also be presenting on GDELT and his future plans at our Structure Data conference next month.


An time-series analysis of the Arab Spring compared with similar periods since 1979.

Leetaru explains GDELT 2.0’s translation system in some detail in a blog post, but even at a high level the methods it uses to achieve near real-time speed are interesting. It works sort of like buffering does on Netflix:

“GDELT’s translation system must be able to provide at least basic translation of 100% of monitored material every 15 minutes, coping with sudden massive surges in volume without ever requiring more time than the 15 minute window. This ‘streaming’ translation is very similar to streaming compression, in which the system must dynamically modulate the quality of its output to meet time constraints: during periods with relatively little content, maximal translation accuracy can be achieved, with accuracy linearly degraded as needed to cope with increases in volume in order to ensure that translation always finishes within the 15 minute window. In this way GDELT operates more similarly to an interpreter than a translator. This has not been a focal point of current machine translation research and required a highly iterative processing pipeline that breaks the translation process into quality stages and prioritizes the highest quality material, accepting that lower-quality material may have a lower-quality translation to stay within within the available time window.”

In addition, Leetaru wrote:

“Machine translation systems . . . do not ordinarily have knowledge of the user or use case their translation is intended for and thus can only produce a single ‘best’ translation that is a reasonable approximation of the source material for general use. . . . Using the equivalent of a dynamic language model, GDELT essentially iterates over all possible translations of a given sentence, weighting them both by traditional linguistic fidelity scores and by a secondary set of scores that evaluate how well each possible translation aligns with the specific language needed by GDELT’s Event and GKG systems.”

It will be interesting to see how and if usage of GDELT picks up with the broader, and richer, scope of content it now covers. With an increasingly complex international situation that runs the gamut from the climate change to terrorism, it seems like world leaders, policy experts and even business leaders could use all the information they can get about what’s connected to what, who’s connected to whom and how this all might play out.

[soundcloud url=”” params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Open data progress is slow, warns Web Foundation

Accessible open data about government spending and services remains a pipe dream across most of the world, an 86-country survey by the World Wide Web Foundation has found.

The second edition of the Open Data Barometer, which came out on Tuesday, showed that fewer than 8 percent of surveyed countries publish datasets on things like government budgets, spending and contracts, and on the ownership of companies, in bulk machine-readable formats and under open re-use licenses.

This is particularly disappointing as both the G7 and G20 groups of countries have said they will try to create more governmental transparency by providing open data that anyone can crunch and build new businesses upon. Globally, the report states that “the trend is towards steady, but not outstanding, growth in open data readiness and implementation.”

According to web inventor and Foundation founder Tim Berners-Lee:

The G7 and G20 blazed a trail when they recognised open data as a crucial tool to strengthen transparency and fight corruption. Now they need to keep their promises to make critical areas like government spending and contracts open by default. The unfair practice of charging citizens to access public information collected with their tax resources must cease.

The G7 (which was the G8 before Russia left last year) signed a charter in 2013 in which the advanced economies said they would be open “by default”, and would publish key datasets in that year.

Now, out of those nations, only the U.K. has an open company register and only the U.K. and Canada publish land ownership data in open formats and under open licenses. Only the U.K. and the U.S. publish detailed open data on government spending, and only the U.S., Canada and France publish open data on national environment statistics. Open mapping data is only published in the U.K., the U.S. and Germany.

As you can no doubt tell, the U.K. is the global leader in this field, followed by the U.S., then Sweden, then France and New Zealand in tied fourth place – France is improving rapidly, having been in 10th place in 2013. G7 members Japan and Italy languish in 19th and 22nd place respectively, publishing almost no key datasets as open data except for Japan’s crime statistics. (Incidentally, the University of Chicago’s Jens Ludwig will be giving an interesting talk about tackling crime with data at our upcoming Structure Data conference in March.)

Of those in the “emerging and advancing” cluster of countries, Spain and Chile (up 10 places on 2013) are on top of the pile with rankings of 13th and 15th place respectively. The worst performer out of all the surveyed countries was Burma.

Google is shutting down its Freebase knowledge base

Google announced Tuesday that the company is shutting down Freebase, the crowdsourced knowledge base it acquired in 2010 when it bought Metaweb. Freebase is a popular source of information about topics — more than 46 million of them — that can be searched, queried like a database and used to provide information to applications.

According to the Google+ post announcing Freebase’s fate, the project’s information will begin being exported to the Wikidata project by the end of March, and Freebase will be retired by June 30, 2015. Information from Freebase helped feed Google’s fast-growing Knowledge Graph, and Freebase developer APIs will be replaced by a set of Knowledge Graph-powered ones.

“We believe strongly in a robust community-driven effort to collect and curate structured knowledge about the world, but we now think we can serve that goal best by supporting Wikidata — they’re growing fast, have an active community, and are better-suited to lead an open collaborative knowledge base,” the post reads.

The real value of knowledges bases like Wikidata isn’t just the information — which is already available via Wikipedia in many cases — but the structured format it takes and, in the case of Google’s Knowledge graph, the semantic nature of it. Efforts to build smarter search engines, AI systems and even robots need places where their systems can go to learn more about the words they’re seeing or the objects they’re encountering, and they need it in a format they can read.

The tricky business of acting on live data before it’s too late

We’re generating more data than ever and analyzing a lot more of it, too. But when it comes to responding quickly to potential public health crises or other situations, we need more data, more analysis and more people paying attention to it all.