Microsoft is building fast, low-power neural networks with FPGAs

Microsoft on Monday released a white paper explaining a current effort to run convolutional neural networks — the deep learning technique responsible for record-setting computer vision algorithms — on FPGAs rather than GPUs.

Microsoft claims that new FPGA designs provide greatly improved processing speed over earlier versions while consuming a fraction of the power of GPUs. This type of work could represent a big shift in deep learning if it catches on, because for the past few years the field has been largely centered around GPUs as the computing architecture of choice.

If there’s a major caveat to Microsoft’s efforts, it might have to do with performance. While Microsoft’s research shows FPGAs consuming about one-tenth the power of high-end GPUs (25W compared with 235W), GPUs still process images at a much higher rate. Nvidia’s Tesla K40 GPU can do between 500 and 824 images per second on one popular benchmark dataset, the white paper claims, while Microsoft predicts its preferred FPGA chip — the Altera Arria 10 — will be able to process about 233 images per second on the same dataset.

However, the paper’s authors note that performance per processor is relative because a multi-FPGA cluster could match a single GPU while still consuming much less power: “In the future, we anticipate further significant gains when mapping our design to newer FPGAs . . . and when combining a large number of FPGAs together to parallelize both evaluation and training.”

In a Microsoft Research blog post, processor architect Doug Burger wrote, “We expect great performance and efficiency gains from scaling our [convolutional neural network] engine to Arria 10, conservatively estimated at a throughput increase of 70% with comparable energy used.”

fpgacnn

This is not Microsoft’s first rodeo when it comes deploying FPGAs within its data centers, and in fact is a corollary of an earlier project. Last summer, the company detailed a research project called Catapult in which it was able to improve the speed and performance of Bing’s search-ranking algorithms by adding FPGA co-processors to each server in a rack. The company intends to port production Bing workloads onto the Catapult architecture later this year.

There have also been other attempts to port deep learning algorithms onto FPGAs, including one by State University of New York at Stony Brook professors and another by Chinese search giant Baidu. Ironically, Baidu Chief Scientist, and deep learning expert, Andrew Ng is big proponent of GPUs, and the company claims a massive GPU-based deep learning system as well as a GPU-based supercomputer designed for computer vision. But this needn’t be and either/or situation: companies could still use GPUs to maximize performance while training their models, and then port them to FPGAs for production workloads.

Expect to hear more about the future of deep learning architectures and applications at Gigaom’s Structure Data conference March 18 and 19 in New York, which features experts from Facebook, Microsoft and elsewhere. Our Structure Intelligence conference, September 22-23 in San Francisco, will dive even deeper into deep learnings, as well as the broader field of artificial intelligence algorithms and applications.

TeraDeep wants to bring deep learning to your dumb devices

Open the closet of any gadget geek or computer nerd, and you’re likely to find a lot of skeletons. Stacked deep in a cardboard box or Tupperware tub, there they are: The remains of webcams, routers, phones and other devices deemed too obsolete to keep using and left to rot, metaphorically speaking, until they eventually find their way to a Best Buy recycling bin.

However, an under-the-radar startup called TeraDeep has developed a way to revive at least a few of those old devices by giving them the power of deep learning. The company has built a module that it calls the CAMCUE, which runs on an ARM-based processor and is designed to plug into other gear and run deep neural network algorithms on the inputs they send through. It could turn an old webcam into something with the smart features of a Dropcam, if not smarter.

“You can basically turn our little device into anything you want,” said TeraDeep co-founder and CTO Eugenio Culurciello during a recent interview. That potential is why the company won a Structure Data award as one of most-promising startups to launch in 2014, and will be presenting at our Structure Data conference in March.

Didier Lacroix (left) and Eugenio Culurciello (right)

Didier Lacroix (left) and Eugenio Culurciello (right)

But before TeraDeep can start transforming the world’s dumb gear into smart gear, the company needs to grow — a lot. It’s headquartered in San Mateo, California, and is the brainchild of Culurciello, who moonlights as an associate professor of engineering at Purdue University in Indiana. It has 10 employees, only three of which are full-time. It has a prototype of the CAMCUE, but isn’t ready to start mass-producing the modules and getting them into developers’ hands.

I recently saw a prototype of it at a deep learning conference in San Francisco, and was impressed by its how well it worked, albeit in a simple use case. Culurciello hooked the CAMCUE up to a webcam and to a laptop, and as he panned the camera, the display on the computer screen would alert the presence of a human when I was in the shot.

“As long as you look human-like, it’s going to detect you,” he said.

The prototype system can be set to detect a number of objects, including iPhones, which it was able to do when the phone was held vertically.

teradeep setup

The webcam setup on a conference table.

TeraDeep also has developed a web application, software libraries and a cloud platform that Culurciello said should make it fairly easy for power users and application developers, initially, and then perhaps everyday consumers to train TeraDeep-powered devices to do what they want them to do. It could be “as easy as uploading a bunch of images,” he said.

“You don’t need to be a programmer to make these things do magic,” TeraDeep CEO Didier Lacroix added.

But Culurciello and Lacroix have bigger plans for the company’s technology — which is the culmination of several years of work by Culurciello to develop specialized hardware for neural network algorithms — than just turning old webcams into smarter webcams. They’d like the company to become a platform player in the emerging artificial intelligence market, selling embedded hardware and software to fulfill the needs of hobbyists and large-scale device manufacturers alike.

A TeraDeep module, up close.

A TeraDeep module, up close.

It already has a few of the pieces in place. Aside from the CAMCUE module, which Lacroix said will soon shrink to about the surface area of a credit card, the company has also tuned its core technology (called nn-x, or neural network accelerator) to run on existing smartphone platforms. This means developers could build mobile apps that do computer vision at high speed and low power without relying on GPUs.

TeraDeep has also worked in system-on-a-chip design for partners that might want to embed more computing power into their devices. Think drones, cars and refrigerators, or smart-home gadgets a la the Amazon Echo and Jibo that rely heavily on voice recognition.

Lacroix said all the possibilities, and the interest it has received from folks who’ve seen and heard about the technology, are great, but noted that it might lead such a small company to suffer from a lack of focus or perhaps option paralysis.

“It’s overwhelming. We are a small company, and people get very excited,” he said. “… We cannot do everything. That’s a challenge for us.”

Baidu is trying to speed up image search using FPGAs

Chinese search engine Baidu is trying to speed the performance of its deep learning models for image search using field programmable gate arrays, or FPGAs, made by Altera. Baidu has been experimenting with FPGAs for a while (including with Altera rival Xilinx’s gear) as a means of boosting performance on its convolutional neural networks without having to go whole hog down the GPU route. FPGAs are likely most applicable in production data centers where they can be paired with existing CPUs to serve queries, while GPUs can still power much behind-the-scenes training of deep learning models.