Baidu claims deep learning breakthrough with Deep Speech

Chinese search engine giant Baidu says it has developed a speech recognition system, called Deep Speech, the likes of which has never been seen, especially in noisy environments. In restaurant settings and other loud places where other commercial speech recognition systems fail, the deep learning model proved accurate nearly 81 percent of the time.

That might not sound too great, but consider the alternative: commercial speech-recognition APIs against which Deep Speech was tested, including those for [company]Microsoft[/company] Bing, [company]Google[/company] and Wit.AI, topped out at nearly 65 percent accuracy in noisy environments. Those results probably underestimate the difference in accuracy, said [company]Baidu[/company] Chief Scientist Andrew Ng, who worked on Deep Speech along with colleagues at the company’s artificial intelligence lab in Palo Alto, California, because his team could only compare accuracy where the other systems all returned results rather than empty strings.


Ng said that while the research is still just research for now, Baidu is definitely considering integrating it into its speech-recognition software for smartphones and connected devices such as Baidu Eye. The company is also working on an Amazon Echo-like home appliance called CoolBox, and even a smart bike.

“Some of the applications we already know about would be much more awesome if speech worked in noisy environments,” Ng said.

Deep Speech also outperformed, by about 9 percent, top academic speech-recognition models on a popular dataset called Hub5’00. The system is based on a type of recurrent neural network, which are often used for speech recognition and text analysis. Ng credits much of the success to Baidu’s massive GPU-based deep learning infrastructure, as well as to the novel way them team built up a training set of 100,000 hours of speech data on which to train the system on noisy situations.

Baidu gathered about 7,000 hours of data on people speaking conversationally, and then synthesized a total of roughly 100,000 hours by fusing those files with files containing background noise. That was noise from a restaurant, a television, a cafeteria, and the inside of a car and a train. By contrast, the Hub5’00 dataset includes a total of 2,300 hours.

“This is a vast amount of data,” said Ng. ” … Most systems wouldn’t know what to do with that much speech data.”

Another big improvement, he said, came from using an end-to-end deep learning model on that huge dataset rather than using a standard, and computationally expensive, type of acoustic model. Traditional approaches will break recognition down into multiple steps, including one called speaker adaption, Ng explained, but “we just feed our algorithm a lot of data” and rely on it to learn everything it needs to. Accuracy aside, the Baidu approach also resulted in a dramatically reduced code base, he added.

You can hear Ng talk more about Baidu’s work in deep learning in this Gigaom Future of AI talk embedded below. That event also included a talk from Google speech recognition engineer Johan Schalkwyk. Deep learning will also play a prominent role at our upcoming Structure Data conference, where speakers from [company]Facebook[/company], [company]Yahoo[/company] and elsewhere will discuss how they do it and how it impacts their businesses.


IBM bringing its skin-cancer computer vision system to hospitals

IBM says it has developed a machine learning system that identified images of skin cancer with better than 95 percent accuracy in experiments, and it’s now teaming up with doctors to see how it can help them do the same. On Wednesday, the company announced a partnership with Memorial Sloan Kettering — one of IBM’s early partners on its Watson system — to research the computer vision technology might be applied in medical settings.

According to one study, cited in the IBM infographic below, diagnostic accuracy for skin cancer today is estimated at between 75 percent and 84 percent even with computer assistance. If IBM’s research results hold up in the real world, they would constitute a significant improvement.

As noted above, the skin cancer research is not IBM’s first foray into applying machine learning and artificial intelligence techniques — which it prefers to call cognitive computing — in the health care setting. In fact, the company announced earlier this week a partnership with the Department of Veterans’ Affairs to investigate the utility of the IBM Watson system for analyzing medical records.

And [company]IBM[/company] is certainly not the first institution to think about how advances in computer vision could be used to diagnose disease. Two startups — Enlitic and Butterfly Network — recently launched with the goal of improving diagnostics using deep learning algorithms, and the application of machine learning to medical imagery has been, and continues to be, the subject of numerous academic studies.

We will be discussing the state of the art in machine learning, and computer vision specifically, at our Structure Data conference in March with speakers from IBM, Facebook, Yahoo, Stanford and Qualcomm, among others.

What we read about deep learning is just the tip of the iceberg

The artificial intelligence technique known as deep learning is white hot right now, as we have noted numerous times before. It’s powering many of the advances in computer vision, voice recognition and text analysis at companies including Google, Facebook, Microsoft and Baidu, and has been the technological foundation of many startups (some of which were acquired before even releasing a product). As far as machine learning goes, these public successes receive a lot of media attention.

But they’re only the public face of a field that appears to be growing like mad beneath the surface. So much research is happening at places that are not large web companies, and even most of the large web companies’ work goes unreported. Big breakthroughs and ImageNet records get the attention, but there’s progress being made all the time.

Just recently, for example, Google’s DeepMind team reported on initial efforts to build algorithm-creating systems that it calls “Neural Turing Machines”; Facebook showed off a “generic” 3D feature for analyzing videos; and Microsoft researchers concluded that quantum computing could prove a boon for certain types of deep learning algorithms.

We’ll talk more about some of these efforts at our Structure Data conference in March, where speakers include a senior researcher from Facebook’s AI lab, as well as prominent AI and robotics researchers from labs at Stanford and MIT.


But anyone who really needs to know what’s happening in deep learning was probably at the Neural Information Processing Systems, or NIPS, conference that happened last week in Montreal, Quebec. It’s a long-running conference that’s increasingly dominated by deep learning. Of the 411 papers accepted to this year’s conference, 46 of them included the word “deep” among their 100 most-used words (according to a topic model by Stanford Ph.D. student Andrej Karpathy). That doubles last year’s number of 23, which itself was 65 percent more than the 15 in 2012.

At the separate deep learning workshop co-located with the NIPS conference, the number of poster presentations this year shot up to 47 from last year’s 28. While some of the bigger research breakthroughs presented at NIPS have already been written about (e.g., the combination of two types of neural networks to automatically produce image captions, research on which Karpathy worked), other potentially important work goes largely unnoticed by the general public.

Yoshua Bengio — a University of Montreal researcher well known in deep learning circles, who has so far resisted the glamour of corporate research labs — and his team appear very busy. Bengio is listed as a coauthor on five of this year’s NIPS papers, and another seven at the workshop, but his name doesn’t often come up in stories about Skype Translate or Facebook trying to check users posting drunken photos.


In a recent TEDx talk, Enlitic CEO Jeremy Howard talked about advances in translation and medical imaging that have flown largely under the radar, and also showed off how software like the stuff his company is building could help doctors train computers to classify medical images in just minutes.

The point here is not just to say, “Wow! Look how much research is happening.” Nor is it to warn of an impending AI takeover of humanity. It’s just a heads-up that there’s a lot going on underneath the surface that goes largely underreported by the press, but of which certain types of people should try to keep abreast nonetheless.

Lawmakers, national security agents, ethicists and economists (Howard touches on the economy in that TEDx talk and elaborates in a recent Reddit Ask Me Anything session) need to be aware of what’s happening and what’s possible if our foundational institutions are going to be prepared for the effects of machine intelligence, however it’s defined. (In another field of AI research, Paul Allen is pumping money into projects that are trying to give computers actual knowledge.)

Some example results of Stanford's system. Source: Andrej Karpathy and Li Fei-Fei / Stanford

Results of Karpathy’s research on image captions. Beyond automating that process, imagine combing through millions of unlabeled images to learn about what’s happening in them.

But CEOs, product designers and other business types also need to be aware. We’re seeing a glut of companies claiming they can analyze the heck out of images, text and databases, and others delivering capabilities such as voice interaction and voice search as a service. Even research firm IDC is predicting video, audio and image analytics “will at least triple in 2015 and emerge as the key driver for [big data and analytics] technology investment.”

Smart companies investing in these technologies will see deep learning as much more than a way to automatically tag images for search or analyze sentiment. They’ll see it as a way to learn a whole lot more about their businesses and the customers buying their products.

In deep learning, especially, we’re talking about a field where operational systems exist, techniques are being democratized rapidly and research appears to be increasing exponentially. It’s not just a computer science project anymore; two and a half years later, jokes about Google’s cat-recognizing computers already seem dated.

With $8M and star team, MetaMind does deep learning for enterprise

A Palo Alto startup called MetaMind launched on Friday promising to help enterprises use deep learning to analyze their images, text and other data. The company has raised $8 million from Khosla Ventures and Marc Benioff, and and Khosla operating partner and CTO Sven Strohband is its co-founder and CEO. He’s joined by co-founder and CTO Richard Socher — a frequently published researcher — and a small team of other data scientists.

Natural language processing expert Chris Manning of Stanford and Yoshua Bengio of the University of Montreal, considered one of the handful of deep learning masters, are MetaMind’s advisers.

Rather than trying to help companies deploy and train their own deep neural networks and artificial intelligence systems, as some other startups are doing, MetaMind is providing simple interfaces for predetermined tasks. Strohband thinks a lot of users will ultimately care less about the technology underneath and more about what it can do for them.

“I think people, in the end, are trying to solve a problem,” he said.

Sven Strohband (second from left) at Structure Data 2014.

Sven Strohband (second from left) at Structure Data 2014.

Right now, there are several tools (what the company calls “smart modules”) for computer vision — including image, localization and segmentation — as well as for language. The latter, where much of Socher’s research has focused, includes modules for text classification, sentiment analysis and question-answering, among other things. (MetaMind incorporates a faster, more accurate version of the etcML text-analysis service that Socher helped create while pursuing a Ph.D. at Stanford.)

During a briefing on MetaMind, Socher demonstrated a capability that merges language and vision and that’s similar, inversely, to a spate of recent work from Google, Stanford and elsewhere around automatically generating detailed captions for images. When he typed in phrases such as “birds on water” or “horse with bald man,” the application surfaced pictures fitting those descriptions and even clustered them based on how similar they are.

Testing out MetaMind's sentiment analysis for Twitter.

Testing out MetaMind’s sentiment analysis for Twitter

Socher and Strohband claim MetaMind’s accuracy in language and vision tasks is comparable to, if not better than, previous systems that have won competitions in those fields. Where applicable, the company’s website shows these comparisons.

MetaMind is also working on modules for reasoning over databases, claiming the ability to automatically fill in missing values and predict column headings. Demo versions of several of these features are available on the company’s website, including a couple that let users import their own text or images and train their own classifiers. Socher calls this “drag-and-drop deep learning.”

The bare image-training interface.

The bare image-training interface

On the surface, the MetaMind service seems similar to those of a couple other deep-learning-based startups, including computer-vision specialist Clarifai but especially AlchemyAPI, which is rapidly expanding its collection of services. If there’s a big difference on the product side right now, it’s that AlchemyAPI has been around for years and has a fairly standard API-based cloud service, and a business model that seems to work for it.

After being trained on 5 pics of chocolate chip cookies and five pics of oatmeal raisin cookies, I tested it on this one.

After being trained on five pictures of chocolate chip cookies and five pictures of oatmeal raisin cookies, I tested it on this one.

MetaMind is only four months old, but Strohband said the company plans to keep expanding its capabilities and become a general-purpose artificial intelligence platform. It intends to make money by licensing its modules to enterprise users along with commercial support. However, it does offer some free tools and an API in order to get the technology in front of a lot of users to gin up excitement and learn from what they’re doing.

“Making these tools so easy to use will open up a lot of interesting use cases,” Socher said.

Asked about the prospect of acquiring skilled researchers and engineers in a field where hiring is notoriously difficult — and in a geography, Palo Alto, where companies like [company]Google[/company] and [company]Facebook[/company] are stockpiling AI experts — Socher suggested it’s not quite as hard as it might seem. Companies like MetaMind just need to look a little outside the box.

“If [someone is] incredibly good at applied math programming … I can teach that person a lot about deep learning in a very short amount of time,” he said.

He thinks another important element, if MetaMind is to be successful, will be for him to continue doing his own research so the company can develop its own techniques and remain on the cutting edge. That’s increasingly difficult in the world of deep learning and neural network research, where large companies are spending hundreds of millions of dollars, universities are doubling down and new papers are published seemingly daily.

“If you rest a little on your laurels here,” Strohband said, “this field moves so fast [you’ll get left behind].”

Deep learning might help you get an ultrasound at Walgreens

A new startup called Butterfly Network, from genomic-technology pioneer Jonathan Rothberg, hopes to improve the world of medical imaging using advanced chip technologies, tablet devices and deep learning. Rothberg explains how and why deep learning is key to the company’s plans.

On Reddit, Geoff Hinton talks Google and future of deep learning

University of Toronto researcher and part-time Google distinguished researcher Geoff Hinton is responsible for many recent advances in deep learning, and many advances in neural network research over the past few decades. Here are some highlight of a recent Reddit AMA with Hinton.