Voices in AI – Episode 38: A Conversation with Carolina Galleguillos

In this episode Byron and Carolina discuss computer vision, machine learning, biology and more.
[podcast_player name=”Episode 38: A Conversation with Carolina Galleguillos” artist=”Byron Reese” album=”Voices in AI” url=”https://voicesinai.s3.amazonaws.com/2018-03-29-(00-59-32)-carolina-galleguillos.mp3″ cover_art_url=”https://voicesinai.com/wp-content/uploads/2018/03/voices-headshot-card-3.jpg”]
Byron Reese: This is Voices in AI brought to you by Gigaom, I’m Byron Reese. Today our guest is Carolina Galleguillos. She’s an expert in machine learning and computer vision. She did her undergrad work in Chile and has a master’s and PhD in Computer Science from UC San Diego. She’s presently a machine learning engineer at Thumbtack. Welcome to the show.
Carolina Galleguillos: Thank you. Thank you for having me.
So, let’s start at the very beginning with definitions. What exactly is “artificial” about artificial intelligence?
Well, I read somewhere that artificial intelligence is basically trying to make machines think, which is very “sci-fi,” I think, but what I’m trying to say here is we’re trying to automate a lot of different tasks that humans do. We have done that before in the Industrial Revolution, but now we’re trying to do it with computers and with interfaces that look more human-like. We also have robots that also have computers inside. I think that’s more of the artificial part. The intelligence, we’ll see how intelligent these machines will become in time.
Alan Turing asked the question, “Can a machine think?” Do you think a machine can think, or will a machine be able to think?
I think we’re really far from that. The brain is a really, really complex thing. I think that we can approximate the thinking of a machine to be able to follow certain rules, or learn patterns that seem more like common sense, but at the end of the day, it won’t think autonomously, I think. We’re really far from that.
I want to get into computer vision here in just a minute.
But I’m really fascinated by this, because that’s a pretty low bar. If you say it’s using machines to do things people do, then a calculator is an artificial intelligence in that view. Would you agree with that?
Well, not really, because a calculator is just executing commands.
But isn’t that what a computer program does?
Yeah, it does. But I would say that in machine learning, you don’t need to program those rules. The program will infer the rules by seeing data. So you’re not explicitly writing down the rules in that program and that’s what makes it different from a calculator.
Humans do something really kind of cool. You show a human an object, like a little statue of some figure, and then you show them a hundred pictures, and they can tell what that figure is—even if it’s upside down, if it’s underwater, if the lighting changes, if they only see half of it. We’re really far away from being able to do that with a machine, correct?
Well, it depends, always I think it is depends. We can do very well now in certain conditions, but we are far from—I’m not saying super far—doing it when you don’t have all the information, I would say.
How do humans do that? Is it that we’re really good at transfer learning or what do you think we’re doing?
Well, yes, transfer learning, but also a lot about the context. I think that the brain is able to store so many different connections—millions and millions of connections, it has so much experience—and that information goes into recognizing objects. It’s very implicit. A person cannot recognize something they’ve never seen before, but if that person has the context about what it should be, it would be able to find it. So I think that’s the main point.
If I took you into a museum gallery, and there was a giant wall with two hundred paintings on it—they’re all well-known paintings, they’re all realistic and all of that—and I hang one of them upside down, a human notices that pretty quickly. But a computer doesn’t. A computer uses the same kind of laborious algorithm to try to figure out which painting is upside down, but a human just spots it right away. What do you think is going on there?
I think that what’s going on is probably the fact that we have context about what we usually face. We usually see paintings that are straight, that they point up, so we are really quick to identify when things are not the way we expect them to be. A computer doesn’t have that knowledge, so they start from a clear slate.
What would giving them that kind of context look like? I mean, if I just said, “Here’s 100,000 paintings and they’re all right side up. Now, quick, glance at that wall and tell me which one’s upside down,” the computer wouldn’t necessarily be able to do it right away, would it? What kind of context do we have that they don’t—what paintings look like right side up, or what reality looks like right side up? What do you think?
Well, if there are objects on that painting, the computer will probably also be able to say that it’s upside down. Now, if it’s a very modern piece, I don’t think a human could also figure out if it’s upside down or not. I think that’s the key of the problem. If it’s basically a bunch of colors, I wouldn’t be able to say that’s upside down. But if it is the painting of a lady, the face of a woman, I would be very quick to spot that that painting is upside down. And I think a computer also could do that, because you can train a computer to identify faces. When that face is upside down, it would be able to say that, too.
It’s interesting because if you were an artist and you drew fantastic landscapes in science fiction worlds, and you showed people different ones; somebody could point at something and say that’s not very realistic, but that one is. But in reality, of course, they’re alien planets. But it’s because we have a really deep knowledge about things, like, the shapes of biological forms, and the effects of gravity—just this really intuitive level of what “looks right” and what doesn’t. What are the steps you go through to get a computer to have just that kind of natural understanding of reality?
That’s a good question.  I think, as part of recognizing objects—let’s say that’s our main task—we try to also give more information about how these objects are presented in reality.  So, you can have algorithms that can code the spatial information of objects—usually you’re going to find the sky above, the grass is down below, and usually you won’t find a car on top of a building, and all that. So you can actually train an algorithm that can surface those patterns, and then, when you show them something that is different, it’s going to make those assumptions, and one of the outcomes is that it might not recognize objects correctly because those objects are not in the context that the algorithm was trained on.
And do you think that’s what humans are doing, that we just have this knowledge base? Because I could totally imagine somebody looking at these alien landscape paintings and saying, “That one doesn’t look right,” and then they say, “Well, why doesn’t it look right?” and it’s like, “I don’t know. It just doesn’t look right.” Is it that there’s some much deeper level that humans are able to understand things, that wouldn’t necessarily be able to be explicitly programmed, or is that not the case?
I think that there’s a belief in machine learning, and now especially with deep learning, that if you have enough data, say millions and millions of examples, those patterns will surface. You don’t have to explicitly put them there. And then those images—let’s say we’re doing computer vision—will encode those rules, like, you’re always going to see the size of a car and the size of a person are mostly around the same, even though you see them at different distances.
I know that as humans, because we have so much experience and information, we can make those claims when we see something that seems odd. At the same time, we can have algorithms that—if you have enough data to get those patterns surfaced—could also be able to spot that. I think that it’s happening more and more in areas like medicine, when you want to find cancer. So they’re trying to leverage those algorithms to be able to detect those anomalies.
How much do you think about biology and how humans recognize things when you’re considering how to build a system that recognizes things? Are they close analogs or is it just trivia that we both happen to recognize things but we’re going to do it in such radically different ways? There’s not really much we can learn from the brain?
This is a very hot topic, I’d say, in the community. There’s definitely a lot of machine learning that is inspired by the brain, or by biology. And so, they’re trying to build architectures that simulate the way that the brain works, or how the eyes would process information. I think that they do that in order to understand how the brain works, in order to do the other way around, which is create algorithms that emulate the brain, because I think that would be extremely hard to do.
When I build machine learning systems, either computer vision or just generic machine learning systems, I usually am not inspired by biology, because I’m usually trying to focus on very specific tasks. And if I were to be inspired by the brain, I would have to take into account a lot of different things into my algorithm, which sometimes just wants to do something still very smart, but very focused, and the brain actually tries to take into account a lot of different inputs. So that’s how I usually approach the work I do.
Humans have a lot of cognitive biases. So we have ways that our brain doesn’t work. It appears to have these bugs in it. For instance, we over-see patterns, right? I guess we over fit. You can look up at a cloud and you see a dog. And the thesis goes that a long time ago it was far better to mistake a rock for a bear and run away than to mistake the bear for the rock and get eaten. 
Do you think that when we build computer systems that can recognize objects, are they going to have our cognitive biases because we’re coding them? Or, are they going to have their own ones that we can’t really predict? Or, will they be kind of free of bias because they’re just trained off the data?
I think it depends. Basically, I think it depends on how you are going to build that system. Like, if you do it by being inspired by the brain; you might actually be able to put your own bias against it, because you might say, well, this is a rock and this is a bear and bears and rocks show up together in certain occasions, and you might actually be able to put your own bias in it. Now, if you let the data sort of speak by itself, by showing examples through algorithms, then the machine, or the computer, will just make their own judgment about that, without any bias.  You can always bias the data as well, that’s a different problem, but let’s say we take all the images in the world where all the objects appear, then we usually will pick up very general patterns, and if, usually, rocks look like bears, then they might make those mistakes pretty easy.
I guess the challenge is that every photograph is an editorial decision of what to photograph and so every photograph reflects a human’s bias. So even if you had a hundred million photos, you’re still instantiating some kind of bias. 
So, people have this ability…we change focus. You look at something, and then a bear walks in the room, and you’re like, “Oh, my gosh! A bear walked in the room!” and then somebody yells, “Fire! Fire!” and you turn over to see where the fire is. So we’re always switching from thing to thing, and that seems to be a feature associated with our consciousness, that it allows us to switch. Does the fact that the computer is not embodied, it doesn’t have a form, and it doesn’t have consciousness, is that an inherent limitation to what it’s going to be able to see and recognize?
Yes. I think so. I mean, once again, if the computer doesn’t have any extra sensors, it wouldn’t even realize what’s going on, apart from the task that it’s actually executing. But let’s say that computer has a camera, it also has a tactile device, and many other things, then you’re starting to enable a little bit more context to that computer, or that program. I mean, if those events occur once in a while, then it would be able to react, or say something about it.
If you think about it, photographs that we use generally are the visible spectrum of light that humans are able to see, but that’s just a tiny fraction. Are there pattern recognition or image recognition efforts underway that are using a full spectrum of light and color? So they show infrared, ultraviolet…
Yes. Definitely. Yes.
Can you give an example of that? I find that fascinating.
Well, a very good example is self-driving cars. They have infrared cameras. They could potentially give you an idea of, “There is a body there, there is something there that is not animated,” so you don’t hit it when you’re driving. So, definitely, there are not just photographs, but for MRIs, all medical imaging, basically you use all that information that you can get.
Our audience is familiar, broadly, with machine learning, but can you talk a little more specifically about how, conceptually, “Here’s a million photos of a cat, learn what a cat looks like, and try to find the cat in these photos,” but peel that onion back one layer. How does that actually work, especially in a neural net situation?
Yeah, you basically tell the computer “there’s a cat in here.” At every single image, you’ll say “there’s a cat in here.” Sometimes you even label the contours of a cat, or even maybe just a rectangle around it, to differentiate it between the actual foreground and background. What the computer is going to do at the first level, it’s going to do very low-level operations, which means it’s going to start finding edges, connected components, all at the very low, granular level. So, it starts finding patterns at that level basically; that’s the first stage. And depending on how deep—let’s say if it’s a neural network—the neural network is, the higher the granularity of these patterns; they start getting more and more. So the representation of a cat starts from very low-level, until you start getting things like paws, and ears, and eyes, until you actually get to what it is a full cat at the end of the layers of this neural network. That’s where the layers of the neural network start encoding. So when you have a new picture where there’s not a cat—maybe there is a person—it’s going to try to find those patterns. And it’s going to be able to say, “Well, in this area of this image, there’s no cat because I don’t see all those patterns coming up the way I see it when a cat is present.”
And what are the inherent limitations of that approach, or is that the be-all and end-all of image recognition? Give it enough images, it will figure out how to recognize them almost flawlessly?
There are always limitations. There are objects that are easier to recognize than others. Now we have done amazing progress in even recognizing different types of dogs and different types of cats, which is amazing, but there are always constraints to lighting conditions, the quality of the image, to things that. You know, some dogs can look like cats, so, we can’t do anything about that. We always have constraints. I think that algorithms are not perfect, but depending on what we’re trying to use them for; they can get very accurate.
The same techniques are used, not just for training for images, but making credit decisions or hiring decisions or identifying illnesses—it’s all the basic same approach, correct?
What do you think of the efforts being considered in Europe, that dictate that you have a right to know why the algorithm suggested what it is? How do you reconcile that? For instance, you have denied a person’s mortgage application, and that person says, “Why?” and then you say, “The neural net said so.” And, of course, that person wants to know, “Well, why did it say so?” And it’s like, “Well, that’s a pretty hard question to answer.” How do you solve that, or how do you balance that? Because as we get better with neural nets, they’re only going to get more obfuscated and convoluted and nuanced, right?
I think the harder the problem, like say in the case of computer vision, it’s really hard to say what are the things that trigger a certain outcome. But luckily, you can still come up with algorithms that are simpler to train, but also simpler to figure out what are the main features that are triggering certain outcomes. And then you’ll be able to say to that person, if you pay your credit cards, then your score will improve and we’ll be able to give you a mortgage.
I think that’s the trade off, right? I think it’s always task-dependent. There is a lot of hype with deep learning and neural networks. Sometimes you just need a little bit more simple algorithms. They are still very accurate, but they can actually give you insights about your prediction, and also the data that you are looking at, and then you can actually build a better product. If your aim is to be extremely complete, or to try to solve a task that is very difficult, then you’re going to have to deal with the fact that there are a lot of things you won’t know about the data and also why the outcome of that algorithm came about.
Pedro Domingos wrote a book called The Master Algorithm where he said there are five different tribes, where he kind of divides all of that up. You have your symbolists, and you have your Bayesians, and so forth; and he posits that there must exist a master algorithm, a single general-purpose algorithm that would be able to solve a wide range of problems, in theory, all problems that are solvable that way. Do you think such a thing exists?
I don’t think it exists now. Given the fact that deep learning has been extremely useful across different type of tasks—going from computer vision, to even, like, music or signal processing, and things like that—there might be an algorithm that can help with a lot of different tasks, like a master algorithm, if you want to call it like that. But it will always be in some way modified to fit the actual problem that you want. Because of the fact that these algorithms are very complex, sometimes you actually need to know why the outcome is the outcome that you’re getting. So, yes, I think that algorithm might exist at some point. I don’t think it exists now. Deep learning, maybe, is one of the frameworks—because it’s not an algorithm but it’s more like a framework, or an architecture—that is helping to be able to accurately make predictions in different areas. But at the end of the day, we want to know why, because it will affect a lot of different people, at the end of the day.
One argument for the existence of such a thing—and that it may not be very much code—is human DNA, which is, of course, the instructions set to build a general intelligence. And the part of the DNA that makes you different than creatures that aren’t intelligent is very tiny. And so the argument goes that somehow a very little bit of code gives you the instructions to build us and we’re intelligent, so therefore there may be an analog in the computer world. That it’s just a small amount of code that can build something as versatile as a human. What do you think of that analogy?
Yeah, that’s mind blowing. That would be really cool if that happened, but at the same time, very scary. I never really thought about that before.
Do you think we’re going to build an artificial general intelligence? Will we build a computer as smart and versatile as a human?
This is very personal answer, humans are social beings and the only way that this could happen is that we’re alone, and we need something like a human to be with us. Hopefully, we’re very far from that future, but in the actual present, I don’t think that’s something that we aim to do.
I think it’s also more about, like, figuring out humanity by itself, like understanding why we come to be the way we are, why people are violent, why people are peaceful, why people are happy, or why people are sad. And that’s the best way of understanding that, like, basically, reconstructing a human brain and maybe extending that brain to have arms and become a robot. But I don’t think it would be the actual goal. It’s more like the way to understand humanity.
I also don’t think it would be a way of executing tasks. We always see in sci-fi movies that robots do things that humans don’t want to do, but they wouldn’t be humanoids. I was looking to buy a Roomba yesterday and that could possibly be a robot that is cleaning, it’s trying to do something that I don’t want to do, but I don’t consider it as being artificial intelligence or a smart being. So I think that it is in some way possible but I don’t think as an end to build something like a human.
Certainly not for a lot of things, but in some parts of the world, that is a real widespread goal. The idea being that places where you have aging populations, and you have lonely people, and you want robot companions that have faces that you recognize, and that display emotions and can listen to your stories, chuckle at your jokes, recognize your jokes, all of that. What are your thoughts on that? So, in those cases, they are trying to build artificial humans, aren’t some people?
But they won’t be complete humans, right? They will be machines that are very good at solving certain tasks, which is recognizing your voice, recognizing that you’re saying a joke, or being able to say things that make you feel better. But I don’t think that they are artificial humans, because that’s a very complex thing. That robot that is helping a senior person, for that person not to be alone, it won’t be able to do other much more complex tasks that a human can do.
I think it’s all about being very specific to solve very specific tasks.  And I think robots in Japan are doing that. I mean, we have smart assistance, right? And they are very good at understanding what you’re trying to say so they can execute a command, but I don’t think about them as another “human” that is trying to understand me, or actually know about who I am.
I don’t know if you saw that Tom Hanks movie years ago called Castaway, but his only companion is this soccer ball that he named Wilson, and then there’s a point where Wilson is floating off and he’s like, “Wilson!” And he risks his life to save Wilson. And then you look at how attached people get to their pets and their animals. And so, you can imagine, if you just kind of straight line graph that, how people might feel towards robots that really do look and act human. It’s undoubtable that people will develop strong emotions for them.
Yes, I agree with that.
So, it’s interesting, you’re talking about these digital systems, and some vendors choose to name them. Apple has Siri, Amazon has Alexa, Microsoft has Cortana, but Google, interestingly, doesn’t personify theirs. It’s the Google Assistant. Why do you think, and not necessarily those specific four cases, but why do you think sometimes they’re personified and sometimes they aren’t? What does that say about us, or them, or how we want to interact with them, or something like that?
That is very interesting, because sometimes when I have my son with me I’ll ask Alexa to play some music. Having a name makes it feel like it’s part of your family, and probably my son will wonder who this Alexa person is that I’m always asking to play 80s pop music. But it definitely makes you feel that it’s not awkward, that your interaction with the machine is smooth and it’s just not an execution, right? It’s part of your environment.
I think that’s what these companies are going for when they put a real name—it’s not a very common person’s name, but still a name that you could say. Alexa, probably, has a female voice because that’s, sort of, the gender that they’re aiming to represent. With respect to Google, I think maybe they want to see it in a more task-driven way. I don’t know. It could be many things.
I think I read that Alexa may have come from—in addition to having the hard “x,” which makes it sound distinctive—an oblique reference to The Library of Alexandria, way back in ancient times. 
Whenever my alarm goes off, I’m like, “Alexa, set me a five-minute timer”—which, luckily, it didn’t hear me—but when the timer goes off, I go, “Alexa, stop,” and it feels rude to me. Like, I don’t talk to people that way, and, therefore, it’s jarring to me. So, in a way I prefer not having it personified, because then I don’t have that conflict. But what you just said about your child, that they may not grow up having any of those sorts of mixed minds about these things. What do you think?
Yeah, I think that it’s true. Sometimes, I feel the same way like you do when you say, “stop,” and it feels like a very commanding way. With the next generation, you have iPads and computers are an old thing, almost; it’s all about new interfaces. It’s definitely going to shape the way that people communicate with machines and products. It’s very hard for me to know how that’s going to be, but it’s going to be very natural—the way that interactions will come with websites, products, gadgets, things like that.
I think that the fact that Google is still the “Google Assistant” also has to do with the fact that when you’re in a conversation, people don’t say Google a lot, right? So then you won’t trigger those devices to be listening all the time, which is another problem. But yeah, it’s very interesting. I always think about how the next generation is going to behave or how the experience is going to be for them, growing up with these devices.
The funny thing is, of course, because Google has become a verb, you could imagine a future in a hundred years or two hundred years when the company no longer exists, but we still use the word, and people are like, “I wonder why we say ‘google’? Where did that come from?” 
This is, in a sense, a personal question, but do you think a computer could ever feel anything? So, for example, you could put a sensor on a computer that detects temperature, and you can program the computer to play a wav file of a person screaming if it ever hits five hundred degrees, but that’s a different thing than actually feeling pain. Do you think it’s possible for a machine to feel, or is that something that’s purely biological, or purely related to life, or some other aspect of us?
I think that for a human to be able to feel something, that aspect of humanity, is such a complex thing. We know from biology that it’s mostly our nerves perceiving pain. They’re perceiving things and then sending that signal to the brain, and the brain is trying to interpret that information into something.
You could, if you want to be very analytical about it, then you could possibly have a computer that feels pain, like you said, something that can give input to the computer and then goes through the processor and the processor will infer a rule and it will say “this is pain.” I don’t think they can do it in the way that we, as humans, perceive it. It’s such a complex thing.
But in the end, we have a self. We have a self that experiences the world.
Can a computer have a self and can a computer, therefore experience the world, as opposed to just coldly sense it?
I think it’s really hard. Unless you can build a computer with cells and things that are more common to a human, which will be a really interesting thing. Personally, I don’t think that is possible, because even pain, like we’re talking about, is very different for everyone, because it’s mostly given by the experiences, right? And a computer can store a lot of information, but there’s much more than that signal, just the way that interpreting that data is what makes humans so interesting.
Humans, we have brains, and our brains do all the things, but we also have a theory of something called a “mind,” which, you know, “are you out of your mind?” And I guess we think of the mind as all the stuff that we don’t really understand how just a bunch of neurons can do, like creativity, emotions, and all of that. In that movie, iRobot, when Spooner, the Will Smith character, is talking to Sonny, the robot, he says, “Can you paint a painting? Can you write a symphony?” And of course, Sonny says, “Well, can you?” But, the point being, that all of those things are things we associate with the “mind.” Do you think computers will ever be able to write beautiful symphonies and bestselling novels, and blockbuster movies, and all of that? And if so, is machine learning a path to that? Like, how would you ever get enough movie scripts or even books or even stories to train it?
That’s interesting. I actually read that there is a movie that was written by a machine learning algorithm, the script actually, and they made a movie out of it. Now, is it good? I don’t know. So, it’s definitely possible. I think that, per se, computers cannot be creative. In  my experience, they’re basically looking at patterns of things that people find funny, or exciting, or makes them feel things.
You can say, “This song is very pleasing because it’s very slow and romantic and relaxing,” and then a computer could just take all those songs that are tagged that way and come up with a new song that has those specific patterns, that make that song relaxing, or pleasing, right? And you could say, “Yes, they are being creative,” because they created something new from something old, from patterns and previous examples. So, in that case, it’s happening already, or a lot of people are trying to make it happen.
Now, you could also argue that artists are the same way. They have their idols, and they somehow are going to try to take those things they like from their heroes, and incorporate them in their own work, and then they become creative, and they have their own creations, or their own art. A computer can actually do the same process.
I think humans are able to capture even more than a computer could ever capture, because a human is doing something for other humans, so they can actually understand the things that move people or make people feel sad or happy. Computers could also just catch the patterns that for certain people, for certain data that they have, produces those emotions but they will never feel those emotions like humans do.
There’s a lot of fear wrapped up in artificial intelligence machine learning with regards to automation. There are three broad beliefs. One is that we’re going to soon enter a period where there are people with not enough education or training to do certain jobs, and you’re going to have kind of the permanent Great Depression of twenty to twenty-five percent unemployment. Another group of people believes that eventually machines can do everything a person can do, like we’re all out of work. And then there’s a group of people who say, look, every time we get new technology, even things as fundamental as electricity and steam, and replace animals with machines; unemployment never goes up. People just use these new technologies to increase their productivity, and therefore their standard of living. Which of those three camps, or a fourth one, do you find yourself sympathetic to?
I think definitely the third one. I agree. A really good example; my dad studied technical drafting for architecture, and then there were computer programs that did that, and he didn’t have a job. He did it by hand, and then computers could do it easily. But then, he decided that he was really good at sales and that’s where his career started to develop. You know, you need to be personable, you need to be able to talk to people, engage them, sell them things, right?
I think that, in general, we are going to make people develop new skills they never thought they had. We are going to make them more efficient. For example, at Thumbtack, we’re empowering professionals to do the things that they’re really good at, and, you know, me, personally; it’s helping them through machine learning to optimize processes so they can be just focused on the things they love doing.
I don’t really like the fact that people say that AI, or machine learning, will take people’s jobs. I think we have to see it as a new wave of optimized processes that will actually give us more time to spend with our families or develop skills that we always thought would be interesting, or actually things that we love to do and we can make a job out of it. We can support our families by doing the things that we love, instead of being stuck in an office doing things that are super automatic, that you don’t put your heart, or even your mind to it. Let’s leave that to machines to automate it, and let’s just do something that makes our life better.
You mentioned Thumbtack, where you head up machine learning. Can you tell us a little bit about that? What excited you about that? What the mission of the company is, for people who aren’t familiar with it, and where you’re at in your life cycle?
So, Thumbtack is a marketplace where people like you and I can go and find a pro that’s going to do the right project for you. What’s really exciting is the fact that you don’t have to go to a listing, and call different places, and ask them “Are you interested?” When you go to Thumbtack, you put a request and only the pros, which are super qualified, that are interested will contact you back with a quote, and with information to tell you, “I’m ready to help you to get your project done.” And that’s it.
It’s amazing that we’re at 2017, and finding a plumber to fix your toilet, or even a DJ because you’re getting married, all those things are so hard. And what I really like by working at Thumbtack is that we are making that super easy for our customers. And we are empowering pros to be good at what they do, to just not have to be worried about putting out flyers or putting up a website, and spending all that time in marketing, and all these things, instead of helping people with their projects, and, for them, building their business.
It’s such a complex problem, but at the same time, it has such a good outcome for everyone, which, is one of the things that attracted me. And also the fact that we’re a startup, and startups are always a hard road, because we’re trying to disrupt a market that’s been untouched forever. And I think that’s a super challenging problem as well and being part of that is actually super exciting.
It’s true. It wasn’t that long ago when you needed a plumber, and you opened up the Yellow Pages, and you just saw how they were able to put a number of As in front of their names, AAA Plumbing, AAAA Plumbing, but that was how we figured things out. 
So, tell me a kind of machine learning challenge, a real day-to-day one that you have to deal with; what data do you use to solve what problem in that as you outlined it?
There are many different things. For some things, like automating some tasks that can make our team more productive, machine learning helps you to do that. For example, making sure that they can curate content. We get a lot of photos and reviews and things like that from our customers, and also content from our professionals, and we want to make sure that we’re showing all the things that are good for our customers, or surface information that is very relevant for them, when they’re looking to hire a professional.
There are also things, like, using information on our marketplace to enhance the experience of our users when they come to Thumbtack, and be able to recommend them another category, like, say they put a request for a DJ, and maybe if they are having a party they might also want a cleaning person the next day, right? Things like that. So, machine learning has always helped there to be able to use a lot of the data that we get from our marketplace, and make our product better.
All right. We’re nearing the end. I do have two questions for you. Do you enjoy any science fiction—like books or movies or any of that—and, if so, is there anything you’ve seen that you look at and think, yes, I could see the future unfolding that way, yes, that could really happen?
Yes, I definitely like science fiction. Foundation is one of the books that I really like.
Of course. That’s been one that’s resisted being able to be made into a movie, although I hear there’s one in the works for it, but that’s such a big project.
Yeah, I enjoy any type of science fiction, in general. I think it’s so interesting how humans see the future, right? It’s so creative. At the same time, I don’t particularly agree with any of those movies, and things like that. There are a lot of movies in Hollywood, too, where computers or robots become bad and they kill people.
I don’t think that’s the future we’ll see with machine learning. I think that we’ll be able to disrupt a lot of areas, and the one I’m most excited about is medicine, because that can really change the game in humanity by being able to accurately diagnose people with very few resources. In so many places in the world where there are no doctors, to be able to take a picture, or send a sample of something and having algorithms that can help doctors to get to that diagnosis quickly; that’s going to change the way that the world is today.
Gene Roddenberry, the creator of Star Trek, said, “In the future, there would be no hunger, and there would be no greed, and all the children would know how to read.” What do you think of that? Or, a broader question, because you are in the vanguard of this technology, you’re building these technologies that everybody reads about.  Are you optimistic about the future? How do you think it’s all going to turn out?
Actually, it feel like a renaissance in some ways. Always, after some renaissance, some big shift in culture, there’s always these new creative things happening. In the past, there were painters that revolutionized art by coming up with new ways of being creative, of painting.  So, my view of the future is that, yes, a lot of the basic needs of humans might be satisfied, which is great. Mortality probably is going to be very low. But also there is the opportunity for us to have enough time to be creative again, and think about new ways of living. Because we have that foundation, then people will be able to think long-term, be more wild about new ideas. I think that’s mostly how I see it.
That’s a great place to end it. I want to thank you so much for taking the time. It was a fascinating hour. Have a good day.
Sure. Thank you. Thank you for having me.
Byron explores issues around artificial intelligence and conscious computers in his upcoming book The Fourth Age, to be published in April by Atria, an imprint of Simon & Schuster. Pre-order a copy here.

Computers Are Opening Their Eyes — and They’re Already Better at Seeing Than We Are

For the past several decades we’ve been teaching computers to understand the visual world. And like everything artificial intelligence these days, computer vision is making rapid strides. So much so that it’s starting to beat us at ‘name that object.’
Every year the ImageNet project runs a competition testing the current capability of computers to identify objects in photographs. And in 2015, they hit a milestone…
Microsoft reported a 4.94% error rate for their vision system, compared to a 5.1% human counterpart.
While that doesn’t quite give computers the ability to do everything that human vision can (yet), it does mean that computer vision is ready for prime time. In fact, computer vision is very good — and lightning fast — at narrow tasks. Tasks like:

  • Social listening: Track buzz about your brand and products in images posted to social media
  • Visual auditing: Remotely monitor for damage, defects or regulatory compliance in a fleet of trucks, planes, or windmills
  • Insurance: Quickly process claims by instantly classifying new submissions into different categories
  • Manufacturing: Ensure components are being positioned correctly on an assembly line
  • Social commerce: Use an image of a food dish to find out which restaurant serves it, or use a travel photo to find vacation suggestions based on similar experiences, or find similar homes for sale
  • Retail: Find stores with similar clothes in stock or on sale, or use a travel image to find retail suggestions in that area

This is a game-changer for business. An A.I.-powered tool that can digitize the visual world can add value to a wide range of business processes — from marketing to security to fleet management.

Unlocking Data in Visual Content

So here’s a step-by-step guide for building a powerful image recognition service — powered by IBM Watson — and capable of facial recognition, age estimation, object identification, etc.
The application wrapped around this service (originally developed by IBM’s Watson Developer Cloud) is preconfigured to identify objects, faces, text, scenes and other contexts in images.

A quick example.

And by the way…Here’s what Watson found in our featured image above:

Classes Score
bass (musical instrument)
musical instrument
orange color
Type Hierarchy
/device/bass (musical instrument)
Faces Score
age 18 – 24

Not too shabby. Watson correctly identified the image as a person with a guitar. It also found the face, which is pretty impressive. But was unsure about the guitarist’s age and gender.
Personally, I would guess our guitarist is a woman based on the longer hair and fingernails. And no doubt Watson will be able to pick up those subtle clues as well in the near future.

Note: The “Score” is a numerical representation (0-1) of how confident the system is in a particular classification. The higher the number, the higher the confidence.

A.I.-Powered Vision

Using an artificial intelligence platform to instantly translate the things we see into written common language, is like having an army of experts continuously reviewing and describing your images.
Allowing you to quickly — and accurately — organize visual information. Turning piles of images — or video frames — into useful data for your business. Data that can then be acted upon, shared or stored.
What will you learn from your visual data?
Let’s find out…

If you’d like to preview the source code, here’s our fork of the application on GitHub.

The End Result

The steps in this guide will create an application similar to the following…

You can also preview a live version of this application. The major features are:

  • Object determination — Classifies things in the image
  • Text extraction — Extracts text displayed in the image
  • Face detection — Detects human faces, including an estimation of age & gender
  • Celebrity identifier — Names the person if your image includes a public figure (when a face is found)

And this is just the beginning, this application can be extended in many different ways — it’s only limited by your imagination.

How it works.

Here’s a quick diagram of the major components…

The application uses just one cloud-based service from IBM Watson:

Note: Most of the following steps can be accomplished through command line or point-and-click. To keep it as visual as possible, this guide focuses on point-and-click. But the source code also includes command line scripts if that’s your preference.

What You’ll Need

Before we create the service instance and application container, let’s get the system requirements knocked.

Download the source repository.

To start, go ahead and download the source files.
Note: You’ll need a git client installed on your computer for this step.
Simply move to the directory you want to use for this demo and run the following commands in a terminal…

# Download source repository git clone https://github.com/10xNation/ibm-watson-visual-recognition.git cd ibm-watson-visual-recognition

At this point, you can keep the terminal window open and set it aside for now…we’ll need it in a later step.

Name the application.

Right away, let’s nail down a name for your new image recognition app.

...   # Application name - name: xxxxxxxxxxxxxxx ...

Replace xxxxxxxxxxxxxxx in the manifest.yml file with a globally unique name for your instance of the application.
The name you choose will be used to create the application’s URL — eg. http://visual-recognition-12345.mybluemix.net/.

Create a Bluemix account.

Go to the Bluemix Dashboard page (Bluemix is IBM’s cloud platform).

If you don’t already have one, create a Bluemix account by clicking on the “Sign Up” button and completing the registration process.

Install Cloud-foundry.

A few of the steps in this guide require a command line session, so you’ll need to install the Cloud-foundry CLI tool. This toolkit allows you more easily interact with Bluemix.

Open a terminal session with Bluemix.

Once the Cloud-foundry CLI tool is installed, you’ll be able to log into Bluemix through the terminal.

# Log into Bluemix cf api https://api.ng.bluemix.net cf login -u YOUR_BLUEMIX_ID -p YOUR_BLUEMIX_PASSOWRD

Replace YOUR_BLUEMIX_ID and YOUR_BLUEMIX_PASSOWRD with the respective username and password you created above.

Step 1: Create the Application Container

Go to the Bluemix Dashboard page.

Then on the next page, click on the “Create App” button to add a new application.

In this demo, we’ll be using a Node application, so click on “SDK for Node.js.”

Then fill out the information required, using the application name you chose in What You’ll Need — and hit “Create.”

Set the application memory.

Before we move on, let’s give the application a little more memory to work with.

Click on your application.

Then click on the “plus” sign for “MB MEMORY PER INSTANCE” — set it to 512 MB — and hit “Save.”

Step 2: Create the Visual Recognition Instance

To set up your Visual Recognition service, jump back to the Bluemix Dashboard page.

Click on your application again.

And that should take you to the Overview tab for your application. And since this is a brand new application, you should see a “Create new” button in the Connections widget — click that button.

You should now see a long list of services. Click “Watson” in the Categories filter and then click on “Visual Recognition” to create an instance of that service.

Go ahead and choose a Service Name that makes sense for you — eg. Visual Recognition-Demo. For this demo, the “Free” Pricing Plan will do just fine. And by default, you should see your application’s name listed in the “Connected to” field.
Click the “Create” button when ready. And enter the Name and Pricing Plan you chose into the manifest.yml file…

...   # Visual Recognition   Visual Recognition-Demo:     label: watson_vision_combined     plan: free ... - services:    - Visual Recognition-Demo ...

If needed, replace both instances of Visual Recognition-Demo with your Service Name and free with your chosen Pricing Plan.
Feel free to “Restage” your application when prompted.

Enter service credentials.

After your Visual Recognition instance is created, click on the respective “View credentials” button.

And that will pop up a modal with your details.

Copy/paste your API key into the respective portion of your .env file.

# Environment variables VISUAL_RECOGNITION_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Replace xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx with the key listed for api_key.
Your Visual Recognition service is now ready. So let’s fire this thing up!

Step 3: Launch It

To bring the application to life, simply run the following command — making sure the terminal is still in the repository directory and logged into Bluemix…

cf push

This command will upload all the needed files, configure the settings — and start the application.
Note: You can use the same cf push command to update the same application after it’s originally published.

Take a look.

After the application has started, you’ll be able to open it in your browser at the respective URL.

The page should look something like this…

Play around with it and get a feel for the functionality.

Custom classifier.

The application also supports a custom classifier, which allows you to customize the type of objects the system can identify within your images.
To check it out, click on the “Train” button.

The “Free” pricing plan only supports one custom classifier. So if you want to test multiple versions, you’ll need to delete the previous one. And you can do that by deleting and recreating the Visual Recognition service — step #2 above. Or you can modify the existing service using the following command…

Note: You’ll need the curl command installed for this.

# Get classifier ID curl -X GET "https://gateway-a.watsonplatform.net/visual-recognition/api/v3/classifiers/?api_key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&version=2017-05-06" # Remove existing custom classifier curl -X DELETE "https://gateway-a.watsonplatform.net/visual-recognition/api/v3/classifiers/xxxxxxxxxxxx?api_key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&version=2017-05-06"

Replace 2017-05-06 with the date you created the classifier, xxxxxxxxxxxx with the classifier_id returned from the first command, and both instances of xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx with the Visual Recognition service api key you retrieved in step #2


If you’re having any problems with the application, be sure to check out the logs…

Just click on the “Logs” tab within your application page.
And that’s pretty much the end of the road. You’re now a computer vision pro!

Take it to the Next Level

Feel like you’re ready to give your applications and devices the power of sight? The sky’s the limit for how and where you apply this technology.
And under the current pricing, you can classify 250 images/day for free. So there’s no reason not to jump right in.
You can dig deeper into the Visual Recognition service at the Watson Developer documentation.

Why you can’t program intelligent robots, but you can train them

If it feels like we’re in the midst of robot renaissance right now, perhaps it’s because we are. There is a new crop of robots under development that we’ll soon be able to buy and install in our factories or interact with in our homes. And while they might look like robots past on the outside, their brains are actually much different.

Today’s robots aren’t rigid automatons built by a manufacturer solely to perform a single task faster than, cheaper than and, ideally, without much input from humans. Rather, today’s robots can be remarkably adaptable machines that not only learn from their experiences, but can even be designed to work hand in hand with human colleagues. Commercially available (or soon to be) technologies such as Jibo, Baxter and Amazon Echo are three well-known examples of what’s now possible, but they’re also just the beginning.

Different technological advances have spurred the development of smarter robots depending on where you look, although they all boil down to training. “It’s not that difficult to builtd the body of the robot,” said Eugene Izhikevich, founder and CEO of robotics startup Brain Corporation, “but the reason we don’t have that many robots in our homes taking care of us is it’s very difficult to program the robots.”

Essentially, we want robots that can perform more than one function, or perform one function very well. And it’s difficult to program a robot to do multiple things, or at least the things that users might want, and it’s especially difficult to program to do these things in different settings. My house is different than your house, my factory is different than your factory.

A collection of RoboBrain concepts.

A collection of RoboBrain concepts.

“The ability to handle variations is what enables these robots to go out into the world and actually be useful,” said Ashutosh Saxena, a Stanford University visiting professor and head of the RoboBrain project. (Saxena will be presenting on this topic at Gigaom’s Structure Data conference March 18 and 19 in New York, along with Julie Shah of MIT’s Interactive Robotics Group. Our Structure Intelligence conference, which focuses on the cutting edge in artificial intelligence, takes place in September in San Francisco.)

That’s where training comes into play. In some cases, particularly projects residing within universities and research centers, the internet has arguably been a driving force behind advances in creating robots that learn. That’s the case with RoboBrain, a collaboration among Stanford, Cornell and a few other universities that crawls the web with the goal of building a web-accessible knowledge graph for robots. RoboBrain’s researchers aren’t building robots, but rather a database of sorts (technically, more of a representation of concepts — what an egg looks like, how to make coffee or how to speak to humans, for example) that contains information robots might need in order to function within a home, factory or elsewhere.

RoboBrain encompasses a handful of different projects addressing different contexts and different types of knowledge, and the web provides an endless store of pictures, YouTube videos and other content that can teach RoboBrain what’s what and what’s possible. The “brain” is trained with examples of things it should recognize and tasks it should understand, as well as with reinforcement in the form of thumbs up and down when it posits a fact it has learned.

For example, one of its flagship projects, which Saxena started at Cornell, is called Tell Me Dave. In that project, researchers and crowdsourced helpers across the web train a robot to perform certain tasks by walking it through the necessary steps for tasks such as cooking ramen noodles.  In order for it to complete a task, the robot needs to know quite a bit: what each object it sees in the kitchen is, what functions it performs, how it operates and at which step it’s used in any given process. In the real world, it would need to be able to surface this knowledge upon, presumably, a user request spoken in natural language — “Make me ramen noodles.”

The Tell Me Dave workflow.

The Tell Me Dave workflow.

Multiply that by any number of tasks someone might actually want a robot to perform, and it’s easy to see why RoboBrain exists. Tell Me Dave can only learn so much, but because it’s accessing that collective knowledge base or “brain,” it should theoretically know things it hasn’t specifically trained on. Maybe how to paint a wall, for example, or that it should give human beings in the same room at least 18 inches of clearance.

There are now plenty of other examples of robots learning by example, often in lab environments or, in the case of some recent DARPA research using the aforementioned Baxter robot, watching YouTube videos about cooking (pictured above).

Advances in deep learning — the artificial intelligence technique du jour for machine-perception tasks such as computer vision, speech recognition and language understanding — also stand to expedite the training of robots. Deep learning algorithms trained on publicly available images, video and other media content can help robots recognize the objects they’re seeing or the words they’re hearing; Saxena said RoboBrain uses deep learning to train robots on proper techniques for moving and grasping objects.

The Brain Corporation platform.

The Brain Corporation platform.

However, there’s a different school of thought that says robots needn’t necessarily be as smart as RoboBrain wants to make them, so long as they can at least be trained to know right from wrong. That’s what Izhikevich and his aforementioned startup, Brain Corporation, are out to prove. It has built a specialized hardware and software platform, based on the idea of spiking neurons, that Izhikevich says can go inside any robot and “you can train your robot on different behaviors like you can train an animal.”

That is to say, for example, that a vacuum robot powered by the company’s operating system (called BrainOS) won’t be able to recognize that a cat is a cat, but it will be able to learn from its training that that object — whatever it is — is something to avoid while vacuuming. Conceivably, as long as they’re trained well enough on what’s normal in a given situation or what’s off limits, BrainOS-powered robots could be trained to follow certain objects or detect new objects or do lots of other things.

If there’s one big challenge to the notion of training robots versus just programming them, it’s that consumers or companies that use the robots will probably have to do a little work themselves. Izhikevich noted that the easiest model might be for BrainOS robots to be trained in the lab, and then have that knowledge turned into code that’s preinstalled in commercial versions. But if users want to personalize robots for their specific environments or uses, they’re probably going to have to train it.

Part of the training process with Canary. The next step is telling the camera what its seeing.

Part of the training process with Canary. The next step is telling the camera what it’s seeing.

As the internet of things and smart devices, in general, catch on, consumers are already getting used the idea — sometimes begrudgingly. Even when it’s something as simple as pressing a few buttons in an app, like training a Nest thermostat or a Canary security camera, training our devices can get tiresome. Even those of us who understand how the algorithms work can get get annoyed.

“For most applications, I don’t think consumers want to do anything,” Izhikevich said. “You want to press the ‘on’ button and the robot does everything autonomously.”

But maybe three years from now, by which time Izhikevich predicts robots powered by Brain Corporation’s platform will be commercially available, consumers will have accepted one inherent tradeoff in this new era of artificial intelligence — that smart machines are, to use Izhikevich’s comparison, kind of like animals. Specifically, dogs: They can all bark and lick, but turning them into seeing eye dogs or K-9 cops, much less Lassie, is going to take a little work.

Microsoft is building fast, low-power neural networks with FPGAs

Microsoft on Monday released a white paper explaining a current effort to run convolutional neural networks — the deep learning technique responsible for record-setting computer vision algorithms — on FPGAs rather than GPUs.

Microsoft claims that new FPGA designs provide greatly improved processing speed over earlier versions while consuming a fraction of the power of GPUs. This type of work could represent a big shift in deep learning if it catches on, because for the past few years the field has been largely centered around GPUs as the computing architecture of choice.

If there’s a major caveat to Microsoft’s efforts, it might have to do with performance. While Microsoft’s research shows FPGAs consuming about one-tenth the power of high-end GPUs (25W compared with 235W), GPUs still process images at a much higher rate. Nvidia’s Tesla K40 GPU can do between 500 and 824 images per second on one popular benchmark dataset, the white paper claims, while Microsoft predicts its preferred FPGA chip — the Altera Arria 10 — will be able to process about 233 images per second on the same dataset.

However, the paper’s authors note that performance per processor is relative because a multi-FPGA cluster could match a single GPU while still consuming much less power: “In the future, we anticipate further significant gains when mapping our design to newer FPGAs . . . and when combining a large number of FPGAs together to parallelize both evaluation and training.”

In a Microsoft Research blog post, processor architect Doug Burger wrote, “We expect great performance and efficiency gains from scaling our [convolutional neural network] engine to Arria 10, conservatively estimated at a throughput increase of 70% with comparable energy used.”


This is not Microsoft’s first rodeo when it comes deploying FPGAs within its data centers, and in fact is a corollary of an earlier project. Last summer, the company detailed a research project called Catapult in which it was able to improve the speed and performance of Bing’s search-ranking algorithms by adding FPGA co-processors to each server in a rack. The company intends to port production Bing workloads onto the Catapult architecture later this year.

There have also been other attempts to port deep learning algorithms onto FPGAs, including one by State University of New York at Stony Brook professors and another by Chinese search giant Baidu. Ironically, Baidu Chief Scientist, and deep learning expert, Andrew Ng is big proponent of GPUs, and the company claims a massive GPU-based deep learning system as well as a GPU-based supercomputer designed for computer vision. But this needn’t be and either/or situation: companies could still use GPUs to maximize performance while training their models, and then port them to FPGAs for production workloads.

Expect to hear more about the future of deep learning architectures and applications at Gigaom’s Structure Data conference March 18 and 19 in New York, which features experts from Facebook, Microsoft and elsewhere. Our Structure Intelligence conference, September 22-23 in San Francisco, will dive even deeper into deep learnings, as well as the broader field of artificial intelligence algorithms and applications.

Why deep learning is at least inspired by biology, if not the brain

As deep learning continues gathering steam among researchers, entrepreneurs and the press, there’s a loud-and-getting-louder debate about whether its algorithms actually operate like the human brain does.

The comparison might not make much of a difference to developers who just want to build applications that can identify objects or predict the next word you’ll text, but it does make a difference. Researchers leery of another “AI winter” or trying to refute worries of a forthcoming artificial superintelligence worry that the brain analogy is setting people up for disappointment, if not undue stress. When people hear “brain,” they think about machines that can think like us.

On this week’s Structure Show podcast, we dove into the issue with Ahna Girschick, an accomplished neuroscientist, visual artist and senior data scientist at deep learning startup Enlitic. Girschick’s colleague, Enlitic Founder and CEO (and former Kaggle chief scientist) Jeremy Howard, also joined us for what turned out to be a rather insightful discussion.

[soundcloud url=”https://api.soundcloud.com/tracks/190680894″ params=”secret_token=s-lutIw&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed

Below are some of the highlights, focused on Girshick and Howard view the brain analogy. (They take a different tack than Google researcher Greg Corrado, who recently called the analogy “officially overhyped.”). But we also talk at length about deep learning, in general, and how Enlitic is using it to analyze medical images and hopefully help overcome a global shortage of doctors.

If you’re interested in hearing more from Girshick, Enlitic and deep learning, come to our Structure Data conference next month, where she’ll be accepting a startup award and joining me on stage for an in-depth talk about how artificial intelligence can improve the health care system. If you want two full days of all AI, all the time, start making plans for our Structure Intelligence conference in September.

Ahna Girshick, Enlitic's senior data scientist.

Ahna Girshick

Natural patterns at work in deep learning systems

“It’s true, deep learning was inspired by how the human brain works,” Girshick said on the Structure Show, “but it’s definitely very different.”

Just like with our vision systems, deep learning systems for computer vision process stuff in layers, if you will. They start with edges and then get more abstract with each layer, focusing on faces or perhaps whole objects, she explained. “That said, our brain has many different types of neurons,” she added. “Everywhere we look in the brain we see diversity. In these artificial networks, every node is trying to basically do the same thing.”

This is why our brains are able to navigate a dynamic world and do many things, while deep learning systems are usually focused on one task with a clear objective. Still, Girshick said, “From a computer vision standpoint, you can learn so much by looking at the brain that why not.”

She explained some of these connections by discussing a research project she worked on at NYU:

“We were interested in, kind of, the statistics of the of the world around us, the visual world around us. And what that means is basically the patterns in the visual world around us. If you were to take a bunch of photos of the world and run some statistics on them, you’ll find some patterns — things like more horizontals than verticals. . . . And then we look inside the brain and we see,  ‘Gee, wow, there’s all these neurons that are sensitive to edges and there’s more of them that are sensitive to horizontals than verticals!’ And then we measured . . . the behavioral response in a type of psychology experiment and we see, ‘Gee, people are biased to perceive things as more horizontal or more vertical than they actually are!'”

Asked if computer vision has been such a big focus of deep learning research so far because of those biological parallels, or because that’s companies such as Google and Facebook have the most need for, Girshick suggested it’s a bit of both. “It’s the same in the neuroscience department at a university,” she said. “The reason that people focus on vision is because a third of our cortex is devoted to vision — it’s a major chunk of our brain. . . . It’s also something that’s easier for us to think about, because we see it.”

Structure Data 2012: Ryan Kim – Staff Writer, GigaOM, Eric Huls – VP, Allstate Insurance Company, Jeremy Howard – President and Chief Scientist, Kaggle

Jeremy Howard (left) at Structure: Data 2012.

Howard noted that the team at Enlitic keeps finding more connections between Girshick’s research and the cutting edge of deep learning, and suggested that attempts to distance the two fields are sometimes insincere. “I think it’s kind of fashionable for people to say how deep learning is just math and these people who are saying ‘brain-like’ are crazy, but the truth is … it absolutely is inspired by the brain,” he said. “It’s a massive simplification, but we keep on finding more and more inspirations.”

The issue probably won’t be resolved any time soon — in part because it’s so easy for journalists and others to take the easy way out when explaining deep learning — but Girshick offered a solution.

“Maybe they should say ‘inspired by biology’ instead of ‘inspired by the brain,'” she said. “. . . Yes, deep learning is kind of amazing and very flexible compared to other generations of algorithms, but it’s not like the intelligent system I was studying when I studied the brain — at all.”

Microsoft says its new computer vision system can outperform humans

Microsoft researchers claim in a recently published paper that they have developed the first computer system capable of outperforming humans on a popular benchmark. While it’s estimated that humans can classify images in the ImageNet dataset with an error rate of 5.1 percent, Microsoft’s team said its deep-learning-based system achieved an error rate of only 4.94 percent.

Their paper was published less than a month after Baidu published a paper touting its record-setting system, which it claimed achieved an error rate of 5.98 percent using a homemade supercomputing architecture. The best performance in the actual ImageNet competition so far belongs to a team of Google researchers, who in the 2014 built a deep learning system with a 6.66 percent error rate.

A set of images that the Microsoft system classified correctly.

A set of images that the Microsoft system classified correctly. “GT” means ground truth; below are the top five predictions of the deep learning system.

“To our knowledge, our result is the first published instance of surpassing humans on this visual recognition challenge,” the paper states. “On the negative side, our algorithm still makes mistakes in cases that are not difficult for humans, especially for those requiring context understanding or high-level knowledge…

“While our algorithm produces a superior result on this particular dataset, this does not indicate that machine vision outperforms human vision on object recognition in general . . . Nevertheless, we believe our results show the tremendous potential of machine algorithms to match human-level performance for many visual recognition tasks.”

A set of images where the deep learning system didn't match the given label, although it did correctly classify objects in the scene.

A set of images where the deep learning system didn’t match the given label, although it did correctly classify objects in the scene.

One of the Microsoft researchers, Jian Sun, explains the difference in plainer English in a Microsoft blog post: “Humans have no trouble distinguishing between a sheep and a cow. But computers are not perfect with these simple tasks. However, when it comes to distinguishing between different breeds of sheep, this is where computers outperform humans. The computer can be trained to look at the detail, texture, shape and context of the image and see distinctions that can’t be observed by humans.”

If you’re interested in learning how deep learning works, why it’s such a hot area right now and how it’s being applied commercially, think about attending our Structure Data conference, which takes place March 18 and 19 in New York. Speakers include deep learning and machine learning experts from Facebook, Yahoo, Microsoft, Spotify, Hampton Creek, Stanford and NASA, as well as startups Blue River Technology, Enlitic, MetaMind and TeraDeep.

We’ll dive even deeper into artificial intelligence at our Structure Intelligence conference (Sept. 22 and 23 in San Francisco), where early confirmed speakers come from Baidu, Microsoft, Numenta and NASA.

Industrial IoT startup Sight Machine raises $5M, expands to robots

Sight Machine, a startup trying to simplify the collection and analysis of industrial data, has raised a $5 million venture capital round from Mercury Fund, Michigan eLab, Huron River Ventures, Orfin Ventures and Funders Club, as well as its existing investors. The company originally focused on computer vision and letting users easily analyze images from assembly-line cameras, but has expanded its platform to include data from sensors, robots, and other industrial instruments and systems. Sight Machine was co-founded by Nathan Oostendorp, who also co-founded tech news site Slashdot.

PhotoTime is a deep learning application for the rest of us

A Sunnyvale, California, startup called Orbeus has developed what could be the best application yet for letting everyday consumers benefit from advances in deep learning. It’s called PhotoTime and, yes, it’s yet another photo-tagging app. But it looks really promising and, more importantly, it isn’t focused on business uses like so many other recent deep-learning-based services, nor has it been acquired and dissolved into Dropbox or Twitter or Pinterest or Yahoo.

Deep learning, to anyone unfamiliar with the term, is essentially a term for a class of artificial intelligence algorithms that excel at learning the latent features of the data they analyze. The more data that deep learning systems have to train on, the better they perform. The field has made big strides in recent years, largely with regard to machine-perception workloads such as computer vision, speech recognition and language understanding.

(If you want to get a crash course in what deep learning is and why web companies are investing billion of dollars into it, come to Structure Data in March and watch my interview with Rob Fergus of Facebook Artificial Intelligence Research, as well as several other sessions.)

The Orbeus team. L to R: TK, Yi Li, Wei Xia and Meng Wang.

The Orbeus team. L to R: Yuxin Wu, Yi Li, Wei Xia and Meng Wang.

I am admittedly late to the game in writing about PhotoTime (it was released in November) because, well, I don’t often write about mobile apps. The people who follow this space for a living, though, also seemed impressed with it when they reviewed it back then. Orbeus, the company behind PhotoTime, launched in 2012 and its first product is a computer vision API called ReKognition. According to CEO Yi Li, it has already raised nearly $5 million in venture capital.

But I ran into the Orbeus team at a recent deep learning conference and was impressed with what they were demonstrating. As an app for tagging and searching photos, it appears very rich. It tags smartphone photos using dozens of different categories, including place, date, object and scene. It also recognizes faces — either by connecting to your social networks and matching contacts with people in the photos, or by building collections of photos including the same face and letting users label them manually.

You might search your smartphone, for example, for pictures of flowers you snapped in San Diego, or for pictures of John Smith at a wedding in Las Vegas in October 2013. I can’t vouch for its accuracy personally because the PhotoTime app for Android isn’t yet available, but I’ll give it the benefit of the doubt.


More impressive than the tagging features, though — and the thing that could really set it apart from other deep-learning-powered photo-tagging applications, including well-heeled ones such as Google+, Facebook and Flickr — is that PhotoTime actually indexes the album locally on users’ phones. Images are sent to the cloud, ran through Orbeus’s deep learning models, and then the metadata is sent back to your phone so you can search existing photos even without a network connection.

The company does have a fair amount of experience in the deep learning field, with several members, including research scientist Wei Xia, winning a couple categories at last year’s ImageNet object-recognition competition as part of a team from the National University of Singapore. Xia told me that while PhotoTime’s application servers run largely on Amazon Web Services, the company’s deep learning system resides on a homemade, liquid-cooled GPU cluster in the company’s headquarters.

Here’s what that looks like.

The Orbeus GPU cluster.

The Orbeus GPU cluster.

As I’ve written before, though, tagging photos is only part of the ideal photo-app experience, and there’s still work to do there no matter how nice the product functions. I’m still waiting for some photo application to perfect the curated photo album, something Disney Research is working on using another machine learning approach.

And while accuracy continues to improve for recognizing objects and faces, researchers are already hard at work applying deep learning to everything from recognizing the positions of our bodies to the sentiment implied by our photos.

TeraDeep wants to bring deep learning to your dumb devices

Open the closet of any gadget geek or computer nerd, and you’re likely to find a lot of skeletons. Stacked deep in a cardboard box or Tupperware tub, there they are: The remains of webcams, routers, phones and other devices deemed too obsolete to keep using and left to rot, metaphorically speaking, until they eventually find their way to a Best Buy recycling bin.

However, an under-the-radar startup called TeraDeep has developed a way to revive at least a few of those old devices by giving them the power of deep learning. The company has built a module that it calls the CAMCUE, which runs on an ARM-based processor and is designed to plug into other gear and run deep neural network algorithms on the inputs they send through. It could turn an old webcam into something with the smart features of a Dropcam, if not smarter.

“You can basically turn our little device into anything you want,” said TeraDeep co-founder and CTO Eugenio Culurciello during a recent interview. That potential is why the company won a Structure Data award as one of most-promising startups to launch in 2014, and will be presenting at our Structure Data conference in March.

Didier Lacroix (left) and Eugenio Culurciello (right)

Didier Lacroix (left) and Eugenio Culurciello (right)

But before TeraDeep can start transforming the world’s dumb gear into smart gear, the company needs to grow — a lot. It’s headquartered in San Mateo, California, and is the brainchild of Culurciello, who moonlights as an associate professor of engineering at Purdue University in Indiana. It has 10 employees, only three of which are full-time. It has a prototype of the CAMCUE, but isn’t ready to start mass-producing the modules and getting them into developers’ hands.

I recently saw a prototype of it at a deep learning conference in San Francisco, and was impressed by its how well it worked, albeit in a simple use case. Culurciello hooked the CAMCUE up to a webcam and to a laptop, and as he panned the camera, the display on the computer screen would alert the presence of a human when I was in the shot.

“As long as you look human-like, it’s going to detect you,” he said.

The prototype system can be set to detect a number of objects, including iPhones, which it was able to do when the phone was held vertically.

teradeep setup

The webcam setup on a conference table.

TeraDeep also has developed a web application, software libraries and a cloud platform that Culurciello said should make it fairly easy for power users and application developers, initially, and then perhaps everyday consumers to train TeraDeep-powered devices to do what they want them to do. It could be “as easy as uploading a bunch of images,” he said.

“You don’t need to be a programmer to make these things do magic,” TeraDeep CEO Didier Lacroix added.

But Culurciello and Lacroix have bigger plans for the company’s technology — which is the culmination of several years of work by Culurciello to develop specialized hardware for neural network algorithms — than just turning old webcams into smarter webcams. They’d like the company to become a platform player in the emerging artificial intelligence market, selling embedded hardware and software to fulfill the needs of hobbyists and large-scale device manufacturers alike.

A TeraDeep module, up close.

A TeraDeep module, up close.

It already has a few of the pieces in place. Aside from the CAMCUE module, which Lacroix said will soon shrink to about the surface area of a credit card, the company has also tuned its core technology (called nn-x, or neural network accelerator) to run on existing smartphone platforms. This means developers could build mobile apps that do computer vision at high speed and low power without relying on GPUs.

TeraDeep has also worked in system-on-a-chip design for partners that might want to embed more computing power into their devices. Think drones, cars and refrigerators, or smart-home gadgets a la the Amazon Echo and Jibo that rely heavily on voice recognition.

Lacroix said all the possibilities, and the interest it has received from folks who’ve seen and heard about the technology, are great, but noted that it might lead such a small company to suffer from a lack of focus or perhaps option paralysis.

“It’s overwhelming. We are a small company, and people get very excited,” he said. “… We cannot do everything. That’s a challenge for us.”

Move over Emeril: Robot learns how to prep food from YouTube

These days, you can learn just about anything from YouTube videos — from how to tie a knot to the best way to open a wine bottle with a shoe. And it isn’t just humans who are benefitting. University of Maryland researchers have programmed a robot to learn basic cooking skills from YouTube videos, a feat that could eventually be expanded into other skills like equipment repair.

The researchers worked with Baxter, a robot built by Rethink Robotics. Baxter is popular for its safety around humans and ease of use; it can be programmed quickly just by moving its arms, and it is smart enough to adapt to a changing work area.

But Baxter’s default software isn’t smart enough to understand a video, let alone recognize the correct measuring cup or ingredient. With the Maryland team’s help, Baxter was able to watch YouTube videos and learn what types of objects to recognize, catalog directions by picking out action verbs and observe which type of grasp would be most effective for holding each tool. Baxter then repeated the steps in the video without any input from its human operators.

University of Maryland computer scientist Yiannis Aloimonos (center) observes as Baxter measures ingredients.

University of Maryland computer scientist Yiannis Aloimonos (center) observes as Baxter measures ingredients.

“This system allows robots to continuously build on previous learning—such as types of objects and grasps associated with them—which could have a huge impact on teaching and training,” DARPA program manager Reza Ghanadan said in a release. “Instead of the long and expensive process of programming code to teach robots to do tasks, this research opens the potential for robots to learn much faster, at much lower cost and, to the extent they are authorized to do so, share that knowledge with other robots.”

DARPA, which funded the project, has other applications in mind like military repairs and logistics. Now that Baxter is cooking, there is no word on when the robot will learn to do the dishes.