Google on Monday released the latest in a string of text datasets designed to make it easier for people outside its hallowed walls to build applications that can make sense of all the words surrounding them.
As explained in a blog post, the company analyzed the [company]New York Times[/company] Annotated Corpus — a collection of millions of articles spanning 20 years, tagged for properties such as people, places and things mentioned — and created a dataset that ranks the salience (or relative importance) of every name mentioned in each one of those articles.
[pullquote person=”Dr. Olivier Lichtarge” attribution=”Dr. Olivier Lichtarge, Baylor College of Medicine”]”A computer certainly may not reason as well as a scientist but the little it can, logically and objectively, may contribute greatly when applied to our entire body of knowledge.”[/pullquote]
Essentially, the goal with the dataset is to give researchers a base understanding of which entities are important within particular pieces of content, an understanding that should then be complemented with background data sources that will provide even more information. So while the number of times a person or company is mentioned in an article can be a very strong sign of which words are important — especially when compared to the usual mention count for that word, one of the early methods for ranking search results — a more telling method of ranking importance would also leverage existing knowledge of broader concepts to capture important words that don’t stand out from a volume perspective.
For example, in an article about NBA coach Becky Hammon, the blog post’s authors explain:
“‘Basketball’ is more than a string of characters; it is a reference to something in the real word which we already already know quite a bit about.
“Background information about entities ought to help us decide which of them are most salient. After all, an article’s author assumes her readers have some general understanding of the world, and probably a bit about sports too. Using background knowledge, we might be able to infer that the WNBA is a salient entity in the Becky Hammon article even though it only appears once….
“Features like mention count and document positioning give reasonable salience predictions. But because they only describe what’s explicitly in the document, we expect a system that uses background information to expose what’s implicit could give better results.”
This type of work is important, because although there’s a lot of talk about advances in artificial intelligence, the truth is that we have a long way to go before machines can match human capabilities. As the post’s authors also note, “Reading comes pretty easily to people — we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task?”
The end results of accomplishing this mission (and accomplishing it well) will include better search results, sure, but possibly better science and medicine, as well. A team of researchers from [company]IBM[/company] and [company]Baylor College of Medicine[/company] just published a paper as part of the KDD 2014 conference that details work they did to analyze more than 70,000 scholarly articles about a particular protein using IBM’s Watson system. The program they created, called KnIT, analyzed the relationships between the target protein and others based (this is a simplified explanation) on how often and closely they appear in the articles.
The system predicted seven out of nine proteins that have since been determined as important in the realm of tumor suppression.
In a separate paper published as part of KDD 2014, [company]Allen Institute for Artificial Intelligence[/company] researchers detailed a question-answering system designed to read natural language questions and derive answers by scouring public knowledge bases such as Freebase. Oren Etzioni, the Allen Institute’s executive director, used a Monday morning keynote at the conference to talk about that research as well as the institute’s flagship, Project Aristo, which aims to build a system that can reason over what it reads at the same level as a fourth-grader (to begin with).
He also discussed a project called Semantic Scholar, which is similar in aim to the IBM-Baylor one, except that it wants to enable semantic search over scholarly papers so researchers don’t need to nail their keywords in order to find what they want over an ever-growing body of work. Going back to the WNBA example in Google’s work, one could imagine searching for papers about the league and missing one that only used the exact term sparingly — if at all — but is nonetheless very relevant.
Of course, these are just the latest examples of many new approaches to language understanding, including recent deep learning projects coming out of places such as [company]Google[/company], [company]Stanford[/company] and DARPA, targeting use cases such as automatically detecting the meanings of words, sentiment analysis and anomaly detection.
One of the Baylor researchers, quoted in a press release about its work with IBM, explained the promise of all this work to build systems that can start to understand what text is actually about: “A computer certainly may not reason as well as a scientist but the little it can, logically and objectively, may contribute greatly when applied to our entire body of knowledge.”
Essentially, computers can read a lot and fast, and programmed correctly can help us find a lot of things we might never have the time to find ourselves.
Correction: The IBM research was in conjunction with Baylor College of Medicine, not Baylor University as originally written.