Wit.AI co-founder Alex Lebrun has just one small dream: He wants to make artificially intelligent personalities available to every device we own. But even artificial intelligence systems have to crawl before they can walk.
Any plan to create a system like Joaquin Phoenix’s love interest in the movie Her “has to be grounded in reality,” Lebrun says. Right now, reality is not smooth dialog with a smartphone-based avatar that understands us and the world around us. A more realistic scenario for the next couple years will be empowering your TV to turn down the volume when you ask it to, by turning speech into machine-readable JSON lines.
“If you want to teach something to an AI,” Lebrun explained, “it first has to understand simple voice commands.” (In a recent Ask Me Anything on Reddit, Facebook AI director Yann LeCun also weighed in on Her, writing that “Something like the intelligent agent in ‘Her’ is totally out of reach of current technology.”)
Scaling speech recognition means embracing machine learning
Lebrun is no neophyte when it comes to interactions between humans and computers. He (along with fellow Wit employee Laurent Landowski) previously founded an online customer service company called VirtuOz that Nuance bought in 2013. So he’s not trying not to get ahead of himself — that goes for the technology as well as the business model. Because the vocabulary at Symantec, for example, doesn’t mean a whole lot at Nestle, VirtuOz, which provided a virtual customer service agent for websites, cost its users about $100,000 to deploy and it took months to hard-code the systems to know everything they needed to know for each individual company.
In trying to take this type of technology mainstream, neither the cost nor the static language model would work. That’s why Lebrun, Landowski and co-founder and CTO Willy Blandin decided to do things differently with Wit. It’s delivered as a free API that developers can use to build voice-command capabilities into their connected devices. Because it’s a cloud-based service, Wit can use machine learning to expand its knowledge base with each developer who adds commands to the system, rather than forcing everyone to hardwire their own sets of words and actions.
Currently, Wit has signed up about 3,500 developers, mostly in the world of connected devices and the internet of things. It was working with Nest before the Google acquisition (it’s not anymore) and is working with SmartThings and various devices with which its connected-home hub interacts. Ideally, someone sitting in his lounge chair will be able to say “Turn the temperature to 75 degrees” or “Set an alarm for 8:15 a.m.,” and the appropriate device will recognize the command, send it to the Wit servers for processing, and then perform the command when it receives its JSON instructions.
Because it’s dealing with a relatively small vocabulary (there are only so many connected devices right now and, therefore, only so many commands), Wit is able to hit accuracy rates up to 95 percent in some situations. “By connecting all these dots, we have a very good coverage of language,” Lebrun explained.
Provided it sticks to a discernible set of rules, he added, “If you invent your own language … it will work.”
To Siri and beyond
If voice commands are the first step toward a fully realized AI experience, the next step is a version of Apple’s Siri that works better (Microsoft would argue its Cortana virtual assistant fits this bill) and that’s omnipresent. Talking to a phone isn’t a particularly intuitive experience (using the keyboard is probably easier, Lebrun suggested) but engaging in some sort of dialog with other devices or appliances might feel a lot more natural. Lebrun thinks it will be at least three years before Wit’s API can enable full dialog between people and their devices.
Getting to something closer to what’s presented in Her? He thinks that’s 10 or 20 years away, maybe more. Lebrun applauds the current advances in fields such as deep learning, but, he said, “It’s still just 1 percent of AI.”
There are technical hurdles — adequately connecting various APIs or other systems for speech, language and vision, for example — as well as the difficulty of teaching systems to perceive things beyond pattern recognition. They’ll need to be able to figure out that tables or lamps, for example, don’t always look like tables or lamps. They’ll need to go beyond mere recognition and begin to understand what it means for people or objects move from Point A to Point B, to predict the future based on all the sensory experiences they’ve already ingested.
You know, like humans do.
Humans might need to make a few adjustments, too — including recognizing that no matter how good an AI system is or what they’ve been promised, it’s still a machine. Lebrun said about 25 percent of people who interacted with a VirtuOz agent thought it was a person — and that technology was relatively rudimentary. Many people felt obliged to type “Thank you” at the end of a chat session; about 15 percent tried to go off-topic and pick up “female” agents.
Amid all this talk about artificial intelligence, that last bit of info might actually be a comforting thought to some people. The more things change, the more they do, indeed, stay the same.