Chinese search engine company Baidu says it has built the world’s most-accurate computer vision system, dubbed Deep Image, which runs on a supercomputer optimized for deep learning algorithms. Baidu claims a 5.98 percent error rate on the ImageNet object classification benchmark; a team from Google won the 2014 ImageNet competition with a 6.66 percent error rate.
In experiments, humans achieved an estimated error rate of 5.1 percent on the ImageNet dataset.
The star of Deep Image is almost certainly the supercomputer, called Minwa, which Baidu built to house the system. Deep learning researchers have long (well, for the past few years) used GPUs in order to handle the computational intensity of training their models. In fact, the Deep Image research paper cites a study showing that 12 GPUs in a 3-machine cluster can rival the performance of the performance of the 1,000-node CPU cluster behind the famous Google Brain project, on which Baidu Chief Scientist Andrew Ng worked.
But no one has yet built a system like this dedicated to the task of computer vision using deep learning. Here’s how paper author Ren Wu, a distinguished scientist at the Baidu Institute of Deep Learning, describes its specifications:
[blockquote person=”” attribution=””]It is comprised of 36 server nodes, each with 2 six-core Intel Xeon E5-2620 processors. Each sever contains 4 Nvidia Tesla K40m GPUs and one FDR InfiniBand (56Gb/s) which is a high-performance low-latency interconnection and supports RDMA. The peak single precision floating point performance of each GPU is 4.29TFlops and each GPU has 12GB of memory.
… In total, Minwa has 6.9TB host memory, 1.7TB device memory, and about 0.6 [petaflops] theoretical single precision peak performance.[/blockquote]
Sheer performance aside, Baidu built Minwa to help overcome problems associated with the types of algorithms on which Deep Image was trained. “Given the properties of stochastic gradient decent algorithms, it is desired to have very high bandwidth and ultra low latency interconnects to minimize the communication costs, which is needed for the distributed version of the algorithm,” the authors wrote.
Having such a powerful system also allowed the researchers to work with different, and arguably better, training data than most other deep learning projects. Rather than using the 256 x 256-pixel images commonly used, Baidu used higher-resolution images (512 x 512 pixels) and augmented them with various effects such as color-casting, vignetting and lens distortion. The goal was to let the system take in more features of smaller objects and to learn what objects look like without being thrown off by editing choices, lighting situations or other extraneous factors.
Baidu is investing heavily in deep learning, and Deep Image follows up a speech-recognition system called Deep Speech that the company made public in December. As executives there have noted before, including Ng at our recent Future of AI event in September, the company already sees a relatively high percentage of voice and image searches and expects that number to increase. The better its products can perform with real-world data (research datasets tend to be fairly optimal), the better the user experience will be.
However, Baidu do is far from the only company — especially on the web — investing significant resources into deep learning and getting impressive results. Google, which still holds the ImageNet record in the actual competition, is probably the company most associated with deep learning and this week unveiled new Google Translate features that likely utilize the technology. Microsoft and Facebook also have very well-respected deep learning researchers and continue to do cutting-edge research in the space while releasing products that use that research.
Our Structure Data conference, which takes place in March, will include deep learning and machine learning experts from many organizations, including Facebook, Yahoo, NASA, IBM, Enlitic and Spotify.