Microsoft is building fast, low-power neural networks with FPGAs

Microsoft on Monday released a white paper explaining a current effort to run convolutional neural networks — the deep learning technique responsible for record-setting computer vision algorithms — on FPGAs rather than GPUs.

Microsoft claims that new FPGA designs provide greatly improved processing speed over earlier versions while consuming a fraction of the power of GPUs. This type of work could represent a big shift in deep learning if it catches on, because for the past few years the field has been largely centered around GPUs as the computing architecture of choice.

If there’s a major caveat to Microsoft’s efforts, it might have to do with performance. While Microsoft’s research shows FPGAs consuming about one-tenth the power of high-end GPUs (25W compared with 235W), GPUs still process images at a much higher rate. Nvidia’s Tesla K40 GPU can do between 500 and 824 images per second on one popular benchmark dataset, the white paper claims, while Microsoft predicts its preferred FPGA chip — the Altera Arria 10 — will be able to process about 233 images per second on the same dataset.

However, the paper’s authors note that performance per processor is relative because a multi-FPGA cluster could match a single GPU while still consuming much less power: “In the future, we anticipate further significant gains when mapping our design to newer FPGAs . . . and when combining a large number of FPGAs together to parallelize both evaluation and training.”

In a Microsoft Research blog post, processor architect Doug Burger wrote, “We expect great performance and efficiency gains from scaling our [convolutional neural network] engine to Arria 10, conservatively estimated at a throughput increase of 70% with comparable energy used.”


This is not Microsoft’s first rodeo when it comes deploying FPGAs within its data centers, and in fact is a corollary of an earlier project. Last summer, the company detailed a research project called Catapult in which it was able to improve the speed and performance of Bing’s search-ranking algorithms by adding FPGA co-processors to each server in a rack. The company intends to port production Bing workloads onto the Catapult architecture later this year.

There have also been other attempts to port deep learning algorithms onto FPGAs, including one by State University of New York at Stony Brook professors and another by Chinese search giant Baidu. Ironically, Baidu Chief Scientist, and deep learning expert, Andrew Ng is big proponent of GPUs, and the company claims a massive GPU-based deep learning system as well as a GPU-based supercomputer designed for computer vision. But this needn’t be and either/or situation: companies could still use GPUs to maximize performance while training their models, and then port them to FPGAs for production workloads.

Expect to hear more about the future of deep learning architectures and applications at Gigaom’s Structure Data conference March 18 and 19 in New York, which features experts from Facebook, Microsoft and elsewhere. Our Structure Intelligence conference, September 22-23 in San Francisco, will dive even deeper into deep learnings, as well as the broader field of artificial intelligence algorithms and applications.

Why deep learning is at least inspired by biology, if not the brain

As deep learning continues gathering steam among researchers, entrepreneurs and the press, there’s a loud-and-getting-louder debate about whether its algorithms actually operate like the human brain does.

The comparison might not make much of a difference to developers who just want to build applications that can identify objects or predict the next word you’ll text, but it does make a difference. Researchers leery of another “AI winter” or trying to refute worries of a forthcoming artificial superintelligence worry that the brain analogy is setting people up for disappointment, if not undue stress. When people hear “brain,” they think about machines that can think like us.

On this week’s Structure Show podcast, we dove into the issue with Ahna Girschick, an accomplished neuroscientist, visual artist and senior data scientist at deep learning startup Enlitic. Girschick’s colleague, Enlitic Founder and CEO (and former Kaggle chief scientist) Jeremy Howard, also joined us for what turned out to be a rather insightful discussion.

[soundcloud url=”″ params=”secret_token=s-lutIw&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed

Below are some of the highlights, focused on Girshick and Howard view the brain analogy. (They take a different tack than Google researcher Greg Corrado, who recently called the analogy “officially overhyped.”). But we also talk at length about deep learning, in general, and how Enlitic is using it to analyze medical images and hopefully help overcome a global shortage of doctors.

If you’re interested in hearing more from Girshick, Enlitic and deep learning, come to our Structure Data conference next month, where she’ll be accepting a startup award and joining me on stage for an in-depth talk about how artificial intelligence can improve the health care system. If you want two full days of all AI, all the time, start making plans for our Structure Intelligence conference in September.

Ahna Girshick, Enlitic's senior data scientist.

Ahna Girshick

Natural patterns at work in deep learning systems

“It’s true, deep learning was inspired by how the human brain works,” Girshick said on the Structure Show, “but it’s definitely very different.”

Just like with our vision systems, deep learning systems for computer vision process stuff in layers, if you will. They start with edges and then get more abstract with each layer, focusing on faces or perhaps whole objects, she explained. “That said, our brain has many different types of neurons,” she added. “Everywhere we look in the brain we see diversity. In these artificial networks, every node is trying to basically do the same thing.”

This is why our brains are able to navigate a dynamic world and do many things, while deep learning systems are usually focused on one task with a clear objective. Still, Girshick said, “From a computer vision standpoint, you can learn so much by looking at the brain that why not.”

She explained some of these connections by discussing a research project she worked on at NYU:

“We were interested in, kind of, the statistics of the of the world around us, the visual world around us. And what that means is basically the patterns in the visual world around us. If you were to take a bunch of photos of the world and run some statistics on them, you’ll find some patterns — things like more horizontals than verticals. . . . And then we look inside the brain and we see,  ‘Gee, wow, there’s all these neurons that are sensitive to edges and there’s more of them that are sensitive to horizontals than verticals!’ And then we measured . . . the behavioral response in a type of psychology experiment and we see, ‘Gee, people are biased to perceive things as more horizontal or more vertical than they actually are!'”

Asked if computer vision has been such a big focus of deep learning research so far because of those biological parallels, or because that’s companies such as Google and Facebook have the most need for, Girshick suggested it’s a bit of both. “It’s the same in the neuroscience department at a university,” she said. “The reason that people focus on vision is because a third of our cortex is devoted to vision — it’s a major chunk of our brain. . . . It’s also something that’s easier for us to think about, because we see it.”

Structure Data 2012: Ryan Kim – Staff Writer, GigaOM, Eric Huls – VP, Allstate Insurance Company, Jeremy Howard – President and Chief Scientist, Kaggle

Jeremy Howard (left) at Structure: Data 2012.

Howard noted that the team at Enlitic keeps finding more connections between Girshick’s research and the cutting edge of deep learning, and suggested that attempts to distance the two fields are sometimes insincere. “I think it’s kind of fashionable for people to say how deep learning is just math and these people who are saying ‘brain-like’ are crazy, but the truth is … it absolutely is inspired by the brain,” he said. “It’s a massive simplification, but we keep on finding more and more inspirations.”

The issue probably won’t be resolved any time soon — in part because it’s so easy for journalists and others to take the easy way out when explaining deep learning — but Girshick offered a solution.

“Maybe they should say ‘inspired by biology’ instead of ‘inspired by the brain,'” she said. “. . . Yes, deep learning is kind of amazing and very flexible compared to other generations of algorithms, but it’s not like the intelligent system I was studying when I studied the brain — at all.”

Microsoft says its new computer vision system can outperform humans

Microsoft researchers claim in a recently published paper that they have developed the first computer system capable of outperforming humans on a popular benchmark. While it’s estimated that humans can classify images in the ImageNet dataset with an error rate of 5.1 percent, Microsoft’s team said its deep-learning-based system achieved an error rate of only 4.94 percent.

Their paper was published less than a month after Baidu published a paper touting its record-setting system, which it claimed achieved an error rate of 5.98 percent using a homemade supercomputing architecture. The best performance in the actual ImageNet competition so far belongs to a team of Google researchers, who in the 2014 built a deep learning system with a 6.66 percent error rate.

A set of images that the Microsoft system classified correctly.

A set of images that the Microsoft system classified correctly. “GT” means ground truth; below are the top five predictions of the deep learning system.

“To our knowledge, our result is the first published instance of surpassing humans on this visual recognition challenge,” the paper states. “On the negative side, our algorithm still makes mistakes in cases that are not difficult for humans, especially for those requiring context understanding or high-level knowledge…

“While our algorithm produces a superior result on this particular dataset, this does not indicate that machine vision outperforms human vision on object recognition in general . . . Nevertheless, we believe our results show the tremendous potential of machine algorithms to match human-level performance for many visual recognition tasks.”

A set of images where the deep learning system didn't match the given label, although it did correctly classify objects in the scene.

A set of images where the deep learning system didn’t match the given label, although it did correctly classify objects in the scene.

One of the Microsoft researchers, Jian Sun, explains the difference in plainer English in a Microsoft blog post: “Humans have no trouble distinguishing between a sheep and a cow. But computers are not perfect with these simple tasks. However, when it comes to distinguishing between different breeds of sheep, this is where computers outperform humans. The computer can be trained to look at the detail, texture, shape and context of the image and see distinctions that can’t be observed by humans.”

If you’re interested in learning how deep learning works, why it’s such a hot area right now and how it’s being applied commercially, think about attending our Structure Data conference, which takes place March 18 and 19 in New York. Speakers include deep learning and machine learning experts from Facebook, Yahoo, Microsoft, Spotify, Hampton Creek, Stanford and NASA, as well as startups Blue River Technology, Enlitic, MetaMind and TeraDeep.

We’ll dive even deeper into artificial intelligence at our Structure Intelligence conference (Sept. 22 and 23 in San Francisco), where early confirmed speakers come from Baidu, Microsoft, Numenta and NASA.

PhotoTime is a deep learning application for the rest of us

A Sunnyvale, California, startup called Orbeus has developed what could be the best application yet for letting everyday consumers benefit from advances in deep learning. It’s called PhotoTime and, yes, it’s yet another photo-tagging app. But it looks really promising and, more importantly, it isn’t focused on business uses like so many other recent deep-learning-based services, nor has it been acquired and dissolved into Dropbox or Twitter or Pinterest or Yahoo.

Deep learning, to anyone unfamiliar with the term, is essentially a term for a class of artificial intelligence algorithms that excel at learning the latent features of the data they analyze. The more data that deep learning systems have to train on, the better they perform. The field has made big strides in recent years, largely with regard to machine-perception workloads such as computer vision, speech recognition and language understanding.

(If you want to get a crash course in what deep learning is and why web companies are investing billion of dollars into it, come to Structure Data in March and watch my interview with Rob Fergus of Facebook Artificial Intelligence Research, as well as several other sessions.)

The Orbeus team. L to R: TK, Yi Li, Wei Xia and Meng Wang.

The Orbeus team. L to R: Yuxin Wu, Yi Li, Wei Xia and Meng Wang.

I am admittedly late to the game in writing about PhotoTime (it was released in November) because, well, I don’t often write about mobile apps. The people who follow this space for a living, though, also seemed impressed with it when they reviewed it back then. Orbeus, the company behind PhotoTime, launched in 2012 and its first product is a computer vision API called ReKognition. According to CEO Yi Li, it has already raised nearly $5 million in venture capital.

But I ran into the Orbeus team at a recent deep learning conference and was impressed with what they were demonstrating. As an app for tagging and searching photos, it appears very rich. It tags smartphone photos using dozens of different categories, including place, date, object and scene. It also recognizes faces — either by connecting to your social networks and matching contacts with people in the photos, or by building collections of photos including the same face and letting users label them manually.

You might search your smartphone, for example, for pictures of flowers you snapped in San Diego, or for pictures of John Smith at a wedding in Las Vegas in October 2013. I can’t vouch for its accuracy personally because the PhotoTime app for Android isn’t yet available, but I’ll give it the benefit of the doubt.


More impressive than the tagging features, though — and the thing that could really set it apart from other deep-learning-powered photo-tagging applications, including well-heeled ones such as Google+, Facebook and Flickr — is that PhotoTime actually indexes the album locally on users’ phones. Images are sent to the cloud, ran through Orbeus’s deep learning models, and then the metadata is sent back to your phone so you can search existing photos even without a network connection.

The company does have a fair amount of experience in the deep learning field, with several members, including research scientist Wei Xia, winning a couple categories at last year’s ImageNet object-recognition competition as part of a team from the National University of Singapore. Xia told me that while PhotoTime’s application servers run largely on Amazon Web Services, the company’s deep learning system resides on a homemade, liquid-cooled GPU cluster in the company’s headquarters.

Here’s what that looks like.

The Orbeus GPU cluster.

The Orbeus GPU cluster.

As I’ve written before, though, tagging photos is only part of the ideal photo-app experience, and there’s still work to do there no matter how nice the product functions. I’m still waiting for some photo application to perfect the curated photo album, something Disney Research is working on using another machine learning approach.

And while accuracy continues to improve for recognizing objects and faces, researchers are already hard at work applying deep learning to everything from recognizing the positions of our bodies to the sentiment implied by our photos.

TeraDeep wants to bring deep learning to your dumb devices

Open the closet of any gadget geek or computer nerd, and you’re likely to find a lot of skeletons. Stacked deep in a cardboard box or Tupperware tub, there they are: The remains of webcams, routers, phones and other devices deemed too obsolete to keep using and left to rot, metaphorically speaking, until they eventually find their way to a Best Buy recycling bin.

However, an under-the-radar startup called TeraDeep has developed a way to revive at least a few of those old devices by giving them the power of deep learning. The company has built a module that it calls the CAMCUE, which runs on an ARM-based processor and is designed to plug into other gear and run deep neural network algorithms on the inputs they send through. It could turn an old webcam into something with the smart features of a Dropcam, if not smarter.

“You can basically turn our little device into anything you want,” said TeraDeep co-founder and CTO Eugenio Culurciello during a recent interview. That potential is why the company won a Structure Data award as one of most-promising startups to launch in 2014, and will be presenting at our Structure Data conference in March.

Didier Lacroix (left) and Eugenio Culurciello (right)

Didier Lacroix (left) and Eugenio Culurciello (right)

But before TeraDeep can start transforming the world’s dumb gear into smart gear, the company needs to grow — a lot. It’s headquartered in San Mateo, California, and is the brainchild of Culurciello, who moonlights as an associate professor of engineering at Purdue University in Indiana. It has 10 employees, only three of which are full-time. It has a prototype of the CAMCUE, but isn’t ready to start mass-producing the modules and getting them into developers’ hands.

I recently saw a prototype of it at a deep learning conference in San Francisco, and was impressed by its how well it worked, albeit in a simple use case. Culurciello hooked the CAMCUE up to a webcam and to a laptop, and as he panned the camera, the display on the computer screen would alert the presence of a human when I was in the shot.

“As long as you look human-like, it’s going to detect you,” he said.

The prototype system can be set to detect a number of objects, including iPhones, which it was able to do when the phone was held vertically.

teradeep setup

The webcam setup on a conference table.

TeraDeep also has developed a web application, software libraries and a cloud platform that Culurciello said should make it fairly easy for power users and application developers, initially, and then perhaps everyday consumers to train TeraDeep-powered devices to do what they want them to do. It could be “as easy as uploading a bunch of images,” he said.

“You don’t need to be a programmer to make these things do magic,” TeraDeep CEO Didier Lacroix added.

But Culurciello and Lacroix have bigger plans for the company’s technology — which is the culmination of several years of work by Culurciello to develop specialized hardware for neural network algorithms — than just turning old webcams into smarter webcams. They’d like the company to become a platform player in the emerging artificial intelligence market, selling embedded hardware and software to fulfill the needs of hobbyists and large-scale device manufacturers alike.

A TeraDeep module, up close.

A TeraDeep module, up close.

It already has a few of the pieces in place. Aside from the CAMCUE module, which Lacroix said will soon shrink to about the surface area of a credit card, the company has also tuned its core technology (called nn-x, or neural network accelerator) to run on existing smartphone platforms. This means developers could build mobile apps that do computer vision at high speed and low power without relying on GPUs.

TeraDeep has also worked in system-on-a-chip design for partners that might want to embed more computing power into their devices. Think drones, cars and refrigerators, or smart-home gadgets a la the Amazon Echo and Jibo that rely heavily on voice recognition.

Lacroix said all the possibilities, and the interest it has received from folks who’ve seen and heard about the technology, are great, but noted that it might lead such a small company to suffer from a lack of focus or perhaps option paralysis.

“It’s overwhelming. We are a small company, and people get very excited,” he said. “… We cannot do everything. That’s a challenge for us.”

New to deep learning? Here are 4 easy lessons from Google

Google employs some of the world’s smartest researchers in deep learning and artificial intelligence, so it’s not a bad idea to listen to what they have to say about the space. One of those researchers, senior research scientist Greg Corrado, spoke at RE:WORK’s Deep Learning Summit on Thursday in San Francisco and gave some advice on when, why and how to use deep learning.

His talk was pragmatic and potentially very useful for folks who have heard about deep learning and how great it is — well, at computer vision, language understanding and speech recognition, at least — and are now wondering whether they should try using it for something. The TL;DR version is “maybe,” but here’s a little more nuanced advice from Corrado’s talk.

(And, of course, if you want to learn even more about deep learning, you can attend Gigaom’s Structure Data conference in March and our inaugural Structure Intelligence conference in September. You can also watch the presentations from our Future of AI meetup, which was held in late 2014.)

1. It’s not always necessary, even if it would work

Probably the most-useful piece of advice Corrado gave is that deep learning isn’t necessarily the best approach to solving a problem, even if it would offer the best results. Presently, it’s computationally expensive (in all meanings of the word), it often requires a lot of data (more on that later) and probably requires some in-house expertise if you’re building systems yourself.

So while deep learning might ultimately work well on pattern-recognition tasks on structured data — fraud detection, stock-market prediction or analyzing sales pipelines, for example — Corrado said it’s easier to justify in the areas where it’s already widely used. “In machine perception, deep learning is so much better than the second-best approach that it’s hard to argue with,” he explained, while the gap between deep learning and other options is not so great in other applications.

That being said, I found myself in multiple conversations at the event centered around the opportunity to soup up existing enterprise software markets with deep learning and met a few startups trying to do it. In an on-stage interview I did with Baidu’s Andrew Ng (who worked alongside Corrado on the Google Brain project) earlier in the day, he noted how deep learning is currently powering some ad serving at Baidu and suggested that data center operations (something Google is actually exploring) might be a good fit.

Greg Corrado

Greg Corrado

2. You don’t have to be Google to do it

Even when companies do decide to take on deep learning work, they don’t need to aim for systems as big as those at Google or Facebook or Baidu, Corrado said. “The answer is definitely not,” he reiterated. “. . . You only need an engine big enough for the rocket fuel available.”

The rocket analogy is a reference to something Ng said in our interview, explaining the tight relationship between systems design and data volume in deep learning environments. Corrado explained that Google needs a huge system because it’s working with huge volumes of data and needs to be able to move quickly as its research evolves. But if you know what you want to do or don’t have major time constraints, he said, smaller systems could work just fine.

For getting started, he added later, a desktop computer could actually work provided it has a sufficiently capable GPU.

3. But you probably need a lot of data

However, Corrado cautioned, it’s no joke that training deep learning models really does take a lot of data. Ideally as much as you can get yours hands on. If he’s advising executives on when they should consider deep learning, it pretty much comes down to (a) whether they’re trying to solve a machine perception problem and/or (b) whether they have “a mountain of data.”

If they don’t have a mountain of data, he might suggest they get one. At least 100 trainable observations per feature you want to train is a good start, he said, adding that it’s conceivable to waste months of effort trying to optimize a model that would have been solved a lot quicker if you had just spent some time gathering training data early on.

Corrado said he views his job not as building intelligent computers (artificial intelligence) or building computers that can learn (machine learning), but as building computers that can learn to be intelligent. And, he said, “You have to have a lot of data in order for that to work.”

Source: Google

Training a system that can do this takes a lot of data.

4. It’s not really based on the brain

Corrado received his Ph.D. in neuroscience and worked on IBM’s SyNAPSE neurosynaptic chip before coming to Google, and says he feels confident in saying that deep learning is only loosely based on how the brain works. And that’s based on what little we know about the brain to begin with.

Earlier in the day, Ng said about the same thing. To drive the point home, he noted that while many researchers believe we learn in an unsupervised manner, most production deep learning models today are still trained in a supervised manner. That is, they analyze lots of labeled images, speech samples or whatever in order to learn what it is.

And comparisons to the brain, while easier than nuanced explanations, tend to lead to overinflated connotations about what deep learning is or might be capable of. “This analogy,” Corrado said, “is now officially overhyped.”

Update: This post was updated on Feb. 2 to correct a statement about Corrado’s tenure at Google. He was with the company before Andrew Ng and the Google Brain project, and was not recruited by Ng to work on it, as originally reported.

Baidu built a supercomputer for deep learning

Chinese search engine company Baidu says it has built the world’s most-accurate computer vision system, dubbed Deep Image, which runs on a supercomputer optimized for deep learning algorithms. Baidu claims a 5.98 percent error rate on the ImageNet object classification benchmark; a team from Google won the 2014 ImageNet competition with a 6.66 percent error rate.

In experiments, humans achieved an estimated error rate of 5.1 percent on the ImageNet dataset.

The star of Deep Image is almost certainly the supercomputer, called Minwa, which Baidu built to house the system. Deep learning researchers have long (well, for the past few years) used GPUs in order to handle the computational intensity of training their models. In fact, the Deep Image research paper cites a study showing that 12 GPUs in a 3-machine cluster can rival the performance of the performance of the 1,000-node CPU cluster behind the famous Google Brain project, on which Baidu Chief Scientist Andrew Ng worked.


But no one has yet built a system like this dedicated to the task of computer vision using deep learning. Here’s how paper author Ren Wu, a distinguished scientist at the Baidu Institute of Deep Learning, describes its specifications:

[blockquote person=”” attribution=””]It is comprised of 36 server nodes, each with 2 six-core Intel Xeon E5-2620 processors. Each sever contains 4 Nvidia Tesla K40m GPUs and one FDR InfiniBand (56Gb/s) which is a high-performance low-latency interconnection and supports RDMA. The peak single precision floating point performance of each GPU is 4.29TFlops and each GPU has 12GB of memory.

… In total, Minwa has 6.9TB host memory, 1.7TB device memory, and about 0.6 [petaflops] theoretical single precision peak performance.[/blockquote]

Sheer performance aside, Baidu built Minwa to help overcome problems associated with the types of algorithms on which Deep Image was trained. “Given the properties of stochastic gradient decent algorithms, it is desired to have very high bandwidth and ultra low latency interconnects to minimize the communication costs, which is needed for the distributed version of the algorithm,” the authors wrote.

A sample of the effects Baidu used to augment images.

A sample of the effects Baidu used to augment images.

Having such a powerful system also allowed the researchers to work with different, and arguably better, training data than most other deep learning projects. Rather than using the 256 x 256-pixel images commonly used, Baidu used higher-resolution images (512 x 512 pixels) and augmented them with various effects such as color-casting, vignetting and lens distortion. The goal was to let the system take in more features of smaller objects and to learn what objects look like without being thrown off by editing choices, lighting situations or other extraneous factors.

Baidu is investing heavily in deep learning, and Deep Image follows up a speech-recognition system called Deep Speech that the company made public in December. As executives there have noted before, including Ng at our recent Future of AI event in September, the company already sees a relatively high percentage of voice and image searches and expects that number to increase. The better its products can perform with real-world data (research datasets tend to be fairly optimal), the better the user experience will be.


However, Baidu do is far from the only company — especially on the web — investing significant resources into deep learning and getting impressive results. Google, which still holds the ImageNet record in the actual competition, is probably the company most associated with deep learning and this week unveiled new Google Translate features that likely utilize the technology. Microsoft and Facebook also have very well-respected deep learning researchers and continue to do cutting-edge research in the space while releasing products that use that research.

Yahoo, Twitter, Dropbox and other companies also have deep learning and computer vision teams in place.

Our Structure Data conference, which takes place in March, will include deep learning and machine learning experts from many organizations, including Facebook, Yahoo, NASA, IBM, Enlitic and Spotify.


Machine learning will eventually solve your JPEG problem

I take a lot of photos on my smartphone. So many, in fact, that my wife calls me Cellphone Ansel Adams. I can’t imagine how many more digital photos we’d have cluttering up our hard drives and cloud drives if I ever learned how to really use the DSLR.

So I get excited when I read and write about all the advances in computer vision, whether they’re the result of deep learning or some other technique, and all the photo-related acquisitions in that space (Google, Yahoo, Pinterest, Dropbox and Twitter have all bought computer vision startups). I’m well aware there are much wider-ranging and important implications, from better image-search online to disease detection — and we’ll discuss them all at our Structure Data conference in March — but I personally love being able to search through my photos by keyword even though I haven’t tagged them (we’ll probably discuss that at Structure Data, too).

A sample of the results when I search my Google+ photos for "lake."

A sample of the results when I search my Google+ photos for “lake.”

I love that Google+ can detect a good photo, or series of photos, and then spice it up with some Auto-Awesome.

IMG_20131226_121710-SNOW (1)

Depending on the service you use to manage photos, there has never been a better time to take too many of them.

If there’s one area that has lagged, though, it’s the creation of curated photo albums. Sometimes Google makes them for me and, although I like it in theory (especially for sharing an experience in a neatly packaged way), they’re usually not that good. It will be an album titled “Trip to New York and Jersey City,” for example, and will indeed include a handful of photos I took in New York, just usually not the ones I would have selected.

Although I’m not about to go through my thousands of photos (or even dozens of photos the day after a trip) and create albums, I’ll gladly let a service to do it for me. But it’s only if the albums are good that I’ll do something beyond glance at them. Usually, I love getting the alert that an album is ready, and then get over the excitement really quickly.

So I was interested to read a new study by Disney Research discussing how its researchers have developed an algorithm creates photo albums based on more factors than just time and geography, or even whether photos are “good.” The full paper goes into a lot more detail about how they trained the system (sorry, no deep learning) but this description from a press release about it sums up the results nicely:

[blockquote person=”” attribution=””]To create a computerized system capable of creating a compelling visual story, the researchers built a model that could create albums based on variety of photo features, including the presence or absence of faces and their spatial layout; overall scene textures and colors; and the esthetic quality of each image.

Their model also incorporated learned rules for how albums are assembled, such as preferences for certain types of photos to be placed at the beginning, in the middle and at the end of albums. An album about a Disney World visit, for instance, might begin with a family photo in front of Cinderella’s castle or with Mickey Mouse. Photos in the middle might pair a wide shot with a close-up, or vice versa. Exclusionary rules, such as avoiding the use the same type of photo more than once, were also learned and incorporated.[/blockquote]


It’s just research and surely isn’t perfect, but it feels like a step in the right direction. It could make sharing photos so much easier and more enjoyable for everyone involved. There’s no doubt the folks at Google, Yahoo and elsewhere are already working on similar things so they can roll them out across services such as Flickr and Google+.

Remember physical slide shows with projectors? The same rules still apply: Your aunt and your friends don’t want to skip through 5 pictures of your finger over the lens, marvel at the beauty of the same rock formation shot from 23 slightly different angles, or laugh at that at that sign that you had to be there to get why it’s funny. They want a handful of pictures of you looking nice in front of famous landmarks or pretty sunsets. Probably on their phone while waiting in line at the checkout.

I don’t always have the self-control or editorial sense to deliver that experience. I’ll be happy if an algorithm can do it for me.

AI is coming to IoT, and not all the brains will be in the cloud

Smart devices, appliances and the internet of things are dominating International CES this week, but we’re probably just getting a small taste of what’s to come — not only in quantity, but also in capabilities. As consumers get used to buying so-called smart devices, they’re eventually going to expect them to actually be smart. They might even expect them to be smart all the time.

So far, this type of expectation has been kind of problematic for devices and apps trying to perform feats of artificial intelligence such as computer vision. The current status quo of offloading processing to the cloud, which is the preferred method of web companies like Google, Microsoft and Baidu (for Android speech recognition, for example), works well enough computationally but can lag in terms of latency and reliability.

Running those types of algorithms locally hasn’t been too feasible historically because they’re often computationally intensive (especially in the training process), and low-power smartphone and embedded processors haven’t been up to the task. But times, they are a changing.


Take for example, the new mobile GPU, called the Tegra X1, that Nvidia announced over the weekend. Its teraflop of computing performance is impressive, but less so than what the company hopes it will be used for. It’s the foundation of the company’s new DRIVE PX automotive computer (pictured above), which Nvidia claims will allow cars to spot available parking spaces, park themselves and pick up drivers like a valet, and be able to distinguish between the various types of vehicles a car might encounter while on the road.

These capabilities “draw heavily on recent developments in computer vision and deep learning,” according to an Nvidia press release.

Indeed, Nvidia spotted a potential goldmine in the machine learning space a while ago as research teams began setting record after record in computer-vision competitions by training deep learning networks on GPU-powered systems. It has been releasing development kits and software libraries ever since to make it as easy as possible to embed its GPUs and program deep learning systems that can run on them.

There’s a decent-enough business selling GPUs to the webscale companies driving deep learning research (Baidu actually claims to have the biggest and best GPU-powered deep learning infrastructure around), but that’s nothing compared with the potential of being able to put a GPU in every car, smartphone and robot manufactured over the next decade.

drive px lights

Nvidia is not the only company hoping to capitalize on the smart-device gold rush. IBM has built a low-power neurosynaptic (i.e., modeled after the brain) chip called SyNAPSE that’s designed specifically for machine learning tasks such as object recognition and that consumes less than a tenth of a watt of power. Qualcomm has built a similar learning chip called Zeroth that it hopes to embed within the next generation of devices. (The folks responsible for building both will be speaking at our Structure Data conference this March in New York.)

A startup called TeraDeep says it’s working on deep learning algorithms that can run on traditional ARM and other mobile processor platforms. I’ve seen other demos of deep learning algorithms running on smartphones; one was created by Jetpac co-founder Pete Warden (whose company was acquired by Google in August) and the other was an early version of technology from stealth-mode startup called Perceptio (the founders of which Re/code profiled in this piece). TeraDeep, however, hopes to take things a step further by releasing a line of deep learning modules can be embedded directly into other connected devices, as well.

teradeep copy

Among the benefits that companies such as Google hope to derive from quantum computing is the ability to develop quantum machine learning algorithms that can run on mobile phones — and presumably other connected devices — while consuming very little power.

Don’t get me wrong, though: cloud computing will still play a big role for consumers as AI makes its way further into the internet of things. The cloud will still process data for applications that analyze aggregate user data, and it will still provide the computing brains for stuff that’s too small and cheap to justify any meaningful type of chip. But soon, it seems, we at least won’t have to do without all the smarts of our devices just because we’re without an internet connection.

A startup wants to quantify video content using computer vision

Computer vision has seen some major advances over the past couple of years, and a New York-based startup called Dextro wants to take the field to a new level by making it easier to quantify what the computers are seeing. Founded in 2012 by a pair of Ivy League graduates, the company is building an object-recognition platform that it says excels on busy images and lets users query their videos using an API a la other unstructured datasets.

The idea behind Dextro, according to co-founder David Luan, is to evolve computer vision services beyond tagging and into something more useful. He characterizes the difference between Dextro and most other computer vision startups (MetaMind, AlchemyAPI and Clarifai, for example) in terms of categorization versus statistics. Tagging photos automatically is great for image search and bringing order to stockpiles of unlabeled pictures, “but we found that most of the value and most of the interest … is when people know what they’re trying to get out of it,” he said.

Dextro has created an API that lets users query their images and, now, videos for specific categories of objects and receive results as JSON records. This way, they can analyze, visualize or otherwise use that data just like they might do with records containing usage metrics for websites or mobile apps. People might want to ask, for example, how many of their images contain certain objects, at what time within a video certain objects tend to appear, or what themes are the most present among their content libraries.

“You have a question about your data,” he said, “let’s help you answer it.”

I used Dextro's video demo to search a YouTube video (about installing a toilet) for toilets, beds and pistols.

I used Dextro’s video demo to search a YouTube video (about installing a toilet) for toilets, beds and pistols.

Aside from the ability to query image and video data, Dextro is trying to differentiate itself by training its vision models to detect objects and themes within chaotic scenes (not nicely focused, single-subject, or what Luan calls “iconic,” shots) and by analyzing videos as they are. “There’s so much information about your video that you lose by chopping it up into frames,” Luan said.

Turns out there really is a bed in it, too.

Turns out there really is a bed in it, too.

He’s quick to note is that although Dextro uses deep learning as part of its secret sauce, it’s not a deep learning company.

In fact, focusing on a narrow set of technologies or use cases is just the opposite of what he and co-founder Sanchit Arora hope the company will become. Luan already tried that in 2011 when he left Yale, accepted a Thiel Fellowship (he completed his bachelor’s degree at Yale in 2013), and took a first stab at the company as a computer vision and manipulation platform for robots. The name Dextro is a play on “dextrous manipulation.”

Although he and Arora both have lots of experience in robotics, Luan said the present incarnation of Dextro –which has raised $1.56 million in seed funding from a group of investors that includes Yale, Two Sigma Ventures and KBS+ Ventures — aims to be a general-purpose platform. Robots could eventually be a great form factor for the type of platform the company is building, but that market isn’t big enough just yet and there’s so much video being generated elsewhere.

David Luan (second from left) speaking at a Yale event.

David Luan (second from left) speaking at a Yale event.

And like most machine learning systems, the more that Dextro’s system sees, the smarter it gets. Luan thinks computer vision platforms will ultimately be a winner-take-all space, with the company analyzing the most and best content having the most-accurate models. “We want to power all the cameras and visual datasets out there,” he said.

That’s a lofty, and perhaps unrealistic, goal, but it’s indicative of the excitement surrounding the fields that companies like Dextro are playing in. One of the themes of our upcoming Structure Data conference is the convergence of artificial intelligence, robotics, analytics, and business that’s happening right now and changing how people think about their data. As computers get better at reading and analyzing data such as pictures, video and text, the onus falls on innovative users to figure out how take advantage of it.