Forget for a minute the smartphone-like capabilities of a technology like Google(s goog) Glass and think only about the video-recording capabilities. Think about a society full of people wearing miniature, always-on cameras shooting videos as prolifically as they currently snap photos. Think about all the potentially useful, entertaining or perhaps heart-warming content all those videos will contain, perhaps taking place only on the periphery of where the camera is focused.
And then think of it all going to waste because there’s no realistic way to search through it all without metatagging the heck out of every video. Untold hours of video holding untold amounts of information, all amounting to little more than a gigantic box full of VHS home movies gathering dust in your parents’ attic.
If video is ever going to become as useful or as widely watched as text and still images, we’re going to need some way to find what we need from it or a way to let us view it as quickly as possible while still getting the gist of what’s going on. But don’t fear, computer scientists are here — and they have access to all sorts of machine learning and deep learning techniques that could help video live up to its potential.
Take, for example, the researchers from the University of Texas who have developed a machine-learning-based method for “summarizing” what they call “egocentric” videos (i.e., those taken by someone wearning a miniature camera) based on the most-important objects in any given frame. Essentially, they’re able to score the significance of objects based on factors such as its location in the frame and whether the camera-wearer touched it. Once they’ve figured out what’s important, they can figure out which frames out of potentially many thousands are most indicative of the video’s theme.
Then comes the fun part: the researchers’ system is able to piece together selected frames based on which earlier shots influenced the later ones, effectively creating a summary of the longer video. Right now, the team notes in its paper, the method is best for videos where there is a single clear theme to the video (such as cooking soup or buying ice cream, they suggest) and when there are objects from which to infer importance. However, what makes the approach unique is not just on highlighting important objects, but the identification of “subshots” that help denote transitions from event to another.
Going forward, the researchers think their technique can apply to other areas, as well, including videos that are more motion-centric than object-centric.
As impressive as it is, though, the UT researchers’ work isn’t the only attempt to put machine learning to work on video content. In fact, anyone who has been following Google’s extensive research into neural networks and deep learning probably sees a strong connection between its efforts and those of the UT team — and they probably recognize why this is so important to Google.
Its users are producing incredible amounts of video and posting incredible amounts of video to YouTube, and they’ll be producing even more once Glass and other wearable technologies become mainstream. And, eventually, all of Google’s business boils down to being able to accurately search and surface content so more eyes end up on the ads displayed alongside it. As cool as it is, the business implication of Google’s neural networks being able to recognize cats and human faces in YouTube videos is the potential for people being able to search videos based on what’s going on rather than how they’re described.
Elliot Turner, founder and CEO of deep learning startup AlchemyAPI, says this capability will be available faster than we might think. Both his company and Google have already developed systems for accurately classifying objects within images (AlchemyAPI will be unveiling its product next week), and as the UT research highlights, videos are ultimately little more than collections of images snapped in rapid succession.
Actually, Turner added, “video is a little easier than still images” because the temporal nature of the frames gives neural networks more data from which to predict what they’re seeing. Already, he noted, some of his research team have created a Google Glass app that sends live video from Glass to a server and then returns near real-time predictions of the images in the frames. Running on GPUs, they’re able to process 66 frames of full-resolution video per second.
But there’s no reason we have to stop at search, or even at creating summaries of videos as the UT team did. Turner says that “word embedding” — the process that Google’s new open source word2vec tool uses to create unique numeric identifiers for words — can also be applied to video. That means just as deep learning can help us identify similar words and concepts from textual data, it could also automatically identify similar videos based on what the neural networks are able to identify within the frames.
It’s not too difficult to think of possible use cases, from surfacing funny YouTube videos or first-person cooking demonstrations, to automatically identifying child pornography or police abuse. Based on what he’s already seen from the 35,000 developers using AlchemyAPI’s flagship text-analytics service, Turner says we should prepare for “all sorts of really cool applications that, honestly, none of us are even thinking about now.”