Although their approaches differ (full papers are available here for Stanford and here for Google), both groups essentially combined deep convolutional neural networks — the type of deep learning models responsible for the huge advances in computer vision accuracy over the past few years — with recurrent neural networks that excel at text analysis and natural language processing. Recurrent neural networks have been responsible for some of the significant improvements in language understanding recently, including the machine translation that powers Microsoft’s Skype Translate and Google’s word2vec libraries.
In a comment on a Hacker News post, pointing to a New York Times story about the research out of Google and Stanford, one of the authors of the Stanford paper points to similar points similar research also coming out of Baidu and UCLA, the University of Toronto, and the University of California, Berkeley.
(Coincidentally, University of Toronto research and Google Distinguished Scholar Geoff Hinton was asked in a recent Reddit Ask Me Anything session, which we recapped here, about how deep learning models might account for various elements and objects present in a single image. The closing lines in his response: “I guess we should just train [a recurrent neural network] to output a caption so that it can tell us what it thinks is there. Then maybe the philosophers and cognitive scientists will stop telling us what our nets cannot do.”]
The research is potentially very promising from an application perspective. Models that can accurately assess the entirety of a scene rather than just picking out individual objects will deliver more-accurate image search results and content libraries. As it gets more accurate, this type of approach could help robotics systems in fields from driverless cars to law enforcement make better, more context-aware decisions about how to act.
In May, we covered a conceptually similar, but technically different, research project in which Allen Institute for Artificial Intelligence researchers developed a system they said can “learn everything about anything” by cross-referencing common phrases about a certain thing (“jumping horse,” “quarter horse” and “horse racing,” for example) with images labeled using the same terms.
Beyond the immediate applications of the technology, research like this is also important as a means of demonstrating how intelligent today’s AI systems are and what might be possible in the future. Some prominent voices in the field, including Oren Etzioni, executive director of the aforementioned Allen Institute, have taken public potshots at deep learning as mere classification while touting their own work to build AI systems that possess real knowledge. Others are pushing for an alternative version of the famous Turing test in which computers must correctly answer questions containing ambiguous pronoun usage.
The work out of Google and Stanford points toward a future in which multiple approaches to AI might be combined to create systems capable of some very impressive feats. For example, whereas a straight-up object-recognition system might be able to recognize a cat and a goldfish in an image, these new hybrid systems could presumably determine the scene is actually a cat reaching into a goldfish bowl. Tied to a knowledge base that understands the predator-prey relationship between the two animals, the same system might be able to predict the goldfish will soon be eaten.
It wouldn’t yet be like having the AI agents in movies such as Her or 2001: A Space Odyssey around to interact with, or anything near having a human around. But AI systems that could be not only useful, but also helpful, would be a pretty big deal.
Update: This post was updated at 8:18 a.m. to include references to additional research, as well as the New York Times’ reporting on the work from Google and Stanford.