Baidu built a supercomputer for deep learning

Chinese search engine company Baidu says it has built the world’s most-accurate computer vision system, dubbed Deep Image, which runs on a supercomputer optimized for deep learning algorithms. Baidu claims a 5.98 percent error rate on the ImageNet object classification benchmark; a team from Google won the 2014 ImageNet competition with a 6.66 percent error rate.

In experiments, humans achieved an estimated error rate of 5.1 percent on the ImageNet dataset.

The star of Deep Image is almost certainly the supercomputer, called Minwa, which Baidu built to house the system. Deep learning researchers have long (well, for the past few years) used GPUs in order to handle the computational intensity of training their models. In fact, the Deep Image research paper cites a study showing that 12 GPUs in a 3-machine cluster can rival the performance of the performance of the 1,000-node CPU cluster behind the famous Google Brain project, on which Baidu Chief Scientist Andrew Ng worked.

baiduDV2

But no one has yet built a system like this dedicated to the task of computer vision using deep learning. Here’s how paper author Ren Wu, a distinguished scientist at the Baidu Institute of Deep Learning, describes its specifications:

[blockquote person=”” attribution=””]It is comprised of 36 server nodes, each with 2 six-core Intel Xeon E5-2620 processors. Each sever contains 4 Nvidia Tesla K40m GPUs and one FDR InfiniBand (56Gb/s) which is a high-performance low-latency interconnection and supports RDMA. The peak single precision floating point performance of each GPU is 4.29TFlops and each GPU has 12GB of memory.

… In total, Minwa has 6.9TB host memory, 1.7TB device memory, and about 0.6 [petaflops] theoretical single precision peak performance.[/blockquote]

Sheer performance aside, Baidu built Minwa to help overcome problems associated with the types of algorithms on which Deep Image was trained. “Given the properties of stochastic gradient decent algorithms, it is desired to have very high bandwidth and ultra low latency interconnects to minimize the communication costs, which is needed for the distributed version of the algorithm,” the authors wrote.

A sample of the effects Baidu used to augment images.

A sample of the effects Baidu used to augment images.

Having such a powerful system also allowed the researchers to work with different, and arguably better, training data than most other deep learning projects. Rather than using the 256 x 256-pixel images commonly used, Baidu used higher-resolution images (512 x 512 pixels) and augmented them with various effects such as color-casting, vignetting and lens distortion. The goal was to let the system take in more features of smaller objects and to learn what objects look like without being thrown off by editing choices, lighting situations or other extraneous factors.

Baidu is investing heavily in deep learning, and Deep Image follows up a speech-recognition system called Deep Speech that the company made public in December. As executives there have noted before, including Ng at our recent Future of AI event in September, the company already sees a relatively high percentage of voice and image searches and expects that number to increase. The better its products can perform with real-world data (research datasets tend to be fairly optimal), the better the user experience will be.

baiduDV1

However, Baidu do is far from the only company — especially on the web — investing significant resources into deep learning and getting impressive results. Google, which still holds the ImageNet record in the actual competition, is probably the company most associated with deep learning and this week unveiled new Google Translate features that likely utilize the technology. Microsoft and Facebook also have very well-respected deep learning researchers and continue to do cutting-edge research in the space while releasing products that use that research.

Yahoo, Twitter, Dropbox and other companies also have deep learning and computer vision teams in place.

Our Structure Data conference, which takes place in March, will include deep learning and machine learning experts from many organizations, including Facebook, Yahoo, NASA, IBM, Enlitic and Spotify.

GIGAOM STRUCTURE DATA 2014

Baidu claims deep learning breakthrough with Deep Speech

Chinese search engine giant Baidu says it has developed a speech recognition system, called Deep Speech, the likes of which has never been seen, especially in noisy environments. In restaurant settings and other loud places where other commercial speech recognition systems fail, the deep learning model proved accurate nearly 81 percent of the time.

That might not sound too great, but consider the alternative: commercial speech-recognition APIs against which Deep Speech was tested, including those for [company]Microsoft[/company] Bing, [company]Google[/company] and Wit.AI, topped out at nearly 65 percent accuracy in noisy environments. Those results probably underestimate the difference in accuracy, said [company]Baidu[/company] Chief Scientist Andrew Ng, who worked on Deep Speech along with colleagues at the company’s artificial intelligence lab in Palo Alto, California, because his team could only compare accuracy where the other systems all returned results rather than empty strings.

baidu1

Ng said that while the research is still just research for now, Baidu is definitely considering integrating it into its speech-recognition software for smartphones and connected devices such as Baidu Eye. The company is also working on an Amazon Echo-like home appliance called CoolBox, and even a smart bike.

“Some of the applications we already know about would be much more awesome if speech worked in noisy environments,” Ng said.

Deep Speech also outperformed, by about 9 percent, top academic speech-recognition models on a popular dataset called Hub5’00. The system is based on a type of recurrent neural network, which are often used for speech recognition and text analysis. Ng credits much of the success to Baidu’s massive GPU-based deep learning infrastructure, as well as to the novel way them team built up a training set of 100,000 hours of speech data on which to train the system on noisy situations.

Baidu gathered about 7,000 hours of data on people speaking conversationally, and then synthesized a total of roughly 100,000 hours by fusing those files with files containing background noise. That was noise from a restaurant, a television, a cafeteria, and the inside of a car and a train. By contrast, the Hub5’00 dataset includes a total of 2,300 hours.

“This is a vast amount of data,” said Ng. ” … Most systems wouldn’t know what to do with that much speech data.”

Another big improvement, he said, came from using an end-to-end deep learning model on that huge dataset rather than using a standard, and computationally expensive, type of acoustic model. Traditional approaches will break recognition down into multiple steps, including one called speaker adaption, Ng explained, but “we just feed our algorithm a lot of data” and rely on it to learn everything it needs to. Accuracy aside, the Baidu approach also resulted in a dramatically reduced code base, he added.

You can hear Ng talk more about Baidu’s work in deep learning in this Gigaom Future of AI talk embedded below. That event also included a talk from Google speech recognition engineer Johan Schalkwyk. Deep learning will also play a prominent role at our upcoming Structure Data conference, where speakers from [company]Facebook[/company], [company]Yahoo[/company] and elsewhere will discuss how they do it and how it impacts their businesses.

[youtube=http://www.youtube.com/watch?v=MPiUQrJ9tGc&w=640&h=360]

Baidu is trying to speed up image search using FPGAs

Chinese search engine Baidu is trying to speed the performance of its deep learning models for image search using field programmable gate arrays, or FPGAs, made by Altera. Baidu has been experimenting with FPGAs for a while (including with Altera rival Xilinx’s gear) as a means of boosting performance on its convolutional neural networks without having to go whole hog down the GPU route. FPGAs are likely most applicable in production data centers where they can be paired with existing CPUs to serve queries, while GPUs can still power much behind-the-scenes training of deep learning models.

Viki brings its video catalog to China’s Baidu

Viki, the international streaming video service that recently got acquired by Japan’s Rakuten, just took a big step into the Chinese market: Viki is supplying Baidu with movies and TV shows from a variety of countries including South Korea, Japan, the U.K. and the U.S., which will be available with ads on Baidu’s site as well as through the company’s mobile apps. U.S. fare will come from the Cartoon Network and other Turner channels. Viki CEO and co-founder Razmig Hovaghimian told me recently that global expansion is a big part of his plans for 2014.