Voices in AI – Episode 11: A Conversation with Gregory Piatetsky-Shapiro

[voices_in_ai_byline]
In this episode, Byron and Gregory talk about consciousness, jobs, data science, transfer learning.
[podcast_player name=”Episode 11: A Conversation with Gregory Piatetsky-Shapiro” artist=”Byron Reese” album=”Voices in AI” url=”https://voicesinai.s3.amazonaws.com/2017-10-16-(00-43-05)-gregory-piatestsky.mp3″ cover_art_url=”https://voicesinai.com/wp-content/uploads/2017/10/voices-headshot-card-3.jpg”]
[voices_in_ai_link_back]
Byron Reese: This is “Voices in AI”, brought to you by Gigaom. I’m Byron Reese. Today our guest is Gregory Piatetsky. He’s a leading voice in Business Analytics, Data Mining, and Data Science. Twenty years ago, he founded and continues to operate a site called KDnuggets about knowledge discovery. It’s dedicated to the various topics he’s interested in. Many people think it’s a must-read resource. It has over 400,000 regular monthly readers. He holds an MS and a PhD in computer science from NYU. 
Welcome to the show.
Gregory Piatetsky: Thank you, Byron. Glad to be with you.
I always like to start off with definitions, because in a way we’re in such a nascent field in the grand scheme of things that people don’t necessarily start off agreeing on what terms mean. How do you define artificial intelligence?
Artificial intelligence is really machines doing things that people think require intelligence, and by that definition the goalposts of artificial intelligence are constantly moving. It was considered very intelligent to play checkers back in the 1950s, then there was a program. The next boundary was playing chess, and then computers mastered it. Then people thought playing Go would be incredibly difficult, or driving cars. General artificial intelligence is the field that tries to develop intelligent machines. And what is intelligence? I’m sure we will discuss, but it’s usually in the eye of the beholder.
Well, you’re right. I think a lot of the problem with the term artificial intelligence is that there is no consensus definition of what intelligence is. So, are you saying if we’re constantly moving the goalposts, it sounds like you’re saying we don’t have systems today that are intelligent.
No, no. On the contrary, we have lots of systems today that would have been considered amazingly intelligent 20 or even 10 years ago. And the progress is such that I think it’s very likely that those systems will exceed our intelligence in many areas, you know maybe not everywhere, but in many narrow, defined areas they’ve already exceeded our intelligence. We have many systems that are somewhat useful. We don’t have any systems that are fully intelligent, possessing what is a new term now, AGI, Artificial General Intelligence. Those systems remain still ahead in the future.
Well, let’s talk about that. Let’s talk about an AGI. We have a set of techniques that we use to build the weak or narrow AI we use today. Do you think that achieving an AGI is just continuing to apply to evolve those faster chips, better algorithms, bigger datasets, and all of that? Or do you think that an AGI really is qualitatively a different thing?
I think AGI is qualitatively a different thing, but I think that it is not only achievable but also inevitable. Humans also can be considered as biological machines, so unless there is something magical that we possess that we cannot transfer to machines, I think it’s quite possible that the smartest people can develop some of the smartest algorithms, and machines can eventually achieve AGI. And I’m sure it will require additional breakthroughs. Just like deep learning was a major breakthrough that contributed to significant advances in state of the art, I think we will see several such great breakthroughs before AGI is achieved.
So if you read the press about it and you look at people’s predictions on when we might get an AGI, they range, in my experience, from 5 to 500 years, which is a pretty telling fact alone that it’s that kind of range. Do you care to even throw in a dart in that general area? Like do you think you’ll live to see it or not?
Well, my specialty as a data scientist is making predictions, and I know when we don’t have enough information. I think nobody really knows. And I have no basis on which to make a prediction. I hope it’s not 5 years and I think our experience as a society shows that we have no idea how to make predictions for 100 years from now. It’s very instructive to find so-called futurology articles, things that were written 50 years ago about what will happen in 50 years, and see how naive were those people 50 years ago. I don’t think we will be very successful in predicting in 50 years. I have no idea how long it will take, but I think it will be more than 5 years.
So some people think that what makes us intelligent, or an indispensable part of our intelligence, is our consciousness. Do you think a machine would need to achieve consciousness in order to be an AGI?
We don’t know what is consciousness. I think machine intelligence would be very different from human intelligence, just like airplane flight is very different from a bird, you know. Both airplanes and birds fly, the flight is governed by the same laws of aerodynamics and physics, but they use very different principles. The airplane flight does not copy bird flight, it is inspired by it. I think in the same way, we’re likely to see that machine intelligence doesn’t copy human intelligence, or human consciousness. “What exactly is consciousness?” is more a question for philosophers, but probably it involves some form of self-awareness. And we can certainly see that machines and robots can develop self-awareness. And you know, self-driving cars already need to do some of that. They need to know exactly where they’re located. They need to predict what will happen. If they do something, what will other cars do? They have a form that is called model of the mind, mirror intelligence. One interesting anecdote on this topic is that when Google’s self-driving car was originally started their experiments, it couldn’t cross the intersection because it was always yielding to other cars. It was following the rules as they were written, but not the rules as people actually execute them. And so it was stuck at that intersection supposedly for an hour or so. Then the engineers adjusted the algorithm so it would better predict what people will do and what it will do, and it’s now able to negotiate the intersections. It has some form of self-awareness. I think other robots and machine intelligence will develop some form of self-awareness, and whether it will be called consciousness or not will be to our descendants to discuss.
Well, I think that there is an agreed upon definition of consciousness. I mean, you’re right that nobody knows how it comes about, but it’s qualia, it’s experiencing things. It’s, if you’ve ever had that sensation when you’re driving and you kind of space, and all of a sudden two miles later you kind of snap to and think, “Oh my gosh, I’ve got no recollection of how I got here.” That time you were driving, that’s intelligence without consciousness. And then when you kind of snap to, and all of the sudden you’re aware, you’re experiencing the world again. Do you think a computer can actually experience something? Because wouldn’t it need to experience the world in order to really be intelligent?
Well computers, if they have sensors, actually they already experience the world. The self-driving car is experiencing the world through its radar and LIDAR and various other sensors and so on, so they do experience and they do have sensors. I think it’s not useful to debate computer consciousness, because it’s like a question of, you know, how many angels can fit on the pin of a needle. I think what we can discuss is what they can or cannot do. How they experience it is more a question for philosophers.
So a lot of people are worried – you know all of this, of course – there’s two big buckets of worry about artificial intelligence. The first one is that it’s going to take human jobs and they’re going to have mass unemployment, and any number of dystopian movies play that scenario out. And then other people say, no, every technology that’s come along, even disruptive ones like electricity, and mechanical power replacing animal power and all of that, were merely then turned around and used by humans to increase their productivity, and that’s how you get increases in standard of living. On that question, where do you come down?
I’m much more worried than I am optimistic. I’m optimistic that technology will progress. What I’m concerned with is it will lead to increasing inequality and increasingly unequal distribution of wealth and benefits. In Massachusetts, there used to be many toll collectors. And toll collector is not a very sophisticated job, but recently they were eliminated. And the machines that eliminated them didn’t require full intelligence, basically just an RFID sensor. So we already see many jobs being eliminated by a simpler form of automation. And what society will do about it is not clear. I think the previous disruptions had much longer timespans. But now when people like these toll collectors are being laid off, they don’t have enough time to retrain themselves to become, let’s say computer programmers or doctors. What I’d like to do about it, I’m not sure. But I like a proposal by Andrew Ng, who was from Stanford Coursera. Andrew, he proposed the modified version of basic income, that people who are unemployed and cannot find jobs get some form of basic income. Not just to sit around, but they would be required to learn new skills and learn something new and useful. So maybe that would be a possible solution.
So do you really think that when you look back across time – you know, the United States, I can only speak to that, went from generating 5% of its energy with steam to 80% in just 22 years. Electrification happened electrifyingly fast. The minute we had engines there was wholesale replacement of the animals, they were just so much more efficient. Isn’t it actually the case that when these destructive technologies come along, they are so empowering that they are actually adopted incredibly quickly? And again, just talking about the US, unemployment for 230 years has been between 5% and 9%, other than the Great Depression, but in all the other time, it never bumped. When these highly disruptive technologies came along, it didn’t cause unemployment generally to go up, and they happened quickly, and they eliminated an enormous number of positions. Why do you think this one is different?
The main reason why I think it is different is because it is qualitatively different. Previously, the machines that came, like the steam and electricity-driven, it would eliminate some of the manual work and people could climb up on the pyramid of skills to do more sophisticated work. But nowadays, artificial general intelligence sort of captures this pyramid of skills, and it now competes with people on the cognitive skills. And it can eventually climb to the top of the pyramid, so there will be nowhere to climb to exceed it. And once you generate one general intelligence, it’s very easy to copy it. So you would have a very large number, let’s say, of intelligent robots that will do a very large number of things. They will compete with people to do other things. It’s just very hard to retrain, let’s say, a coal miner to become, let’s say, producer of YouTube videos.
Well that isn’t really how it ever happens, is it? I mean, that’s kind of a rigged set-up, isn’t it? What matters is, can everybody do a job a little bit harder than they have? Because the maker of YouTube videos is a film student. And then somebody else goes to film school, and then the junior college professor decides to… I mean, everybody just goes up a little bit. You never take one group of people and train them to do an incredibly radically different thing, do you?
Well, I don’t know about that exactly, but to return to your analogy, you mentioned that the United States for 200 years the pattern was such. But, you know, the United States is not the only country in the world, and 200 years is a very small part of our history. We look at several thousand years, and look with what happened in the north, we see they’re very complex things. Unemployment rate in the Middle Ages was much higher than 5% or 10%.
Well, I think the important thing, and the reason why I used 200 years is because that’s the period of industrialization that we’ve seen, and automation. And so the argument is Artificial Intelligence is going to automate jobs, so you really only need to look over the period you’ve had other things automating jobs to say, “What happens when you automate a lot of jobs?” I mean, by your analogy, wouldn’t the invention of the calculator have put mathematicians out of business? I mean like with ATM machines, an ATM machine in theory replaces a bank teller. And yet we have more bank tellers today than we did when the ATM was introduced, because that too allows banks to open more branches and hire more tellers. I mean, is it really as simple as, “Well, you’ve built this tool, now there’s a machine doing a job a human did and now you have an unemployed human.” Is that kind of the only force at work?
Of course it’s not simple, there are many forces at work. And there are forces that resist change, as we’ve seen from Luddites in 18th century. And now there are people, for example coal mining districts, who want to go back to coal mining. Of course, it’s not that simple. What I’m saying is we only had a few examples of industrial revolutions, and as data scientists say, it’s very hard to generalize from few examples. It’s true that past technologies have generated more work. It doesn’t follow that this new technology, which is different, will generate more work for all the people. It may very well be different. We cannot rely on three or four past examples to generalize for the future.
Fair enough. So let’s talk, if we can, about how you spend your days, which is in data science, what are some recent advances that you think have materially changed the job of a data scientist? Are there ones? And are there more things that you can kind of see that are about to change and begin? Like how is that job evolving as technology changes?
Yes, well data scientists now live in the golden age of the field. There are now more powerful tools that make data science much easier, tools like Python and R. And Python and R both have a very large ecosystem of tools, like scikit-learn for example in the case of Python, or whatever Hadley Wickham comes up in the case of R. There are tools like Spark and various things on top of that that allow data scientists to access very large amount of data. It’s much easier and much faster for data scientists to build models. The danger for data scientists, again, is automation, because as those tools make it easier and easier, and soon they make the work, you know, a large part of it automated. In fact, there are already companies like DataRobot and others that allow business users who are not data scientists just to plug their data, and DataRobot or their competitors just generate the results. No data scientist needed. That is already happening in many areas. For example, ads on the internet are automatically placed, and there are algorithms that make millions of decisions per second and build lots of models. Again, no human involvement because humans just cannot do millions of models a second. There are many areas where this automation is already happening. And recently I had a poll in KDnuggets asking, when do you think data science work will be automated? Then the median answer was about 20 or 25. So although this is a golden age for data scientists, I think they should enjoy it because who knows what will happen in the next 8 to 10 years.
So, when Mark Cuban was talking about the first – he gave a talk earlier this year – he said the first trillionaires will be in businesses that utilize AI. But he said something very interesting, which is, he said that if he were coming up through university again, he would study philosophy. That’s the last thing that’s going to be automated. What would you suggest to a young person today listening to this? What do you think they should study, in the cognitive area, that is either blossoming or is it likely to go away?
I think what will be very much in demand is at the intersection of humanities and technology. If I was younger I would still study machine learning and databases, which is actually what I studied for my PhD 30 years ago. I probably would study more mathematics. The deep learning algorithms that are making tremendous advances are very mathematically intensive. And the other aspect is, kind of maybe the hardest to automate is human intuition and empathy, understanding what other people need and want, and how to best connect with them. I don’t know how much that can be studied, but if philosophy or social studies or poetry is the way to it, then I would encourage young people to study it. I think we need a balanced approach, not just technology but humanities as well.
So, I’m intrigued that our DNA is– I’m going to be off here, whatever I say. I think is about is about 740 meg, it’s on that order. But when you look at how much of it we share with, let’s say, a banana, it’s 80-something percent, and then how much we share with a chimp, it’s 99%. So somewhere in that 1%, that 7 or 8 meg of code that tells how to build you, is the secret to artificial general intelligence, presumably. Is it possible that the code to do an AGI is really quite modest and simple? Not simple – you know, there’s two different camps in the AGI world. And one is that humans are a hack of 100 or 200 or 300 different skills that you put them all together and that’s us. Another one is, we had Pedro Domingos on the show and he had a book called The Master Algorithm, which posits that there is an algorithm that can solve any problem, or any solvable problem, the way human is. Where on that spectrum would you fall? And do you think there is a simple answer to an AGI?
I don’t think there is a simple answer. Actually, I’m a good friend with Pedro and I moderated his webcast on his book last year. But I think that the master algorithm that he looks for may exist, but it doesn’t exclude having lots of additional specialized skills. I think there is very good evidence that there is such a thing as general intelligence in humans, that people, for example, make have different scores on SAT on verbal and math. I know that my verbal score would be much lower than my math score. But usually if you’re above average on one, you would be above average on the other. And likewise, if you’re below average on one, you will be below average. People seem to have some general skills, and in addition there are a lot of specialized skills. You know, you can be a great chess player but have no idea how to play music, or vice versa. I think there are some general algorithms, and there are lots of specialized algorithms that leverage special structure of the domain. You can think of it this way, that when people were developing chess-playing programs, they initially applied some general algorithms, but then they found that they could speed up these programs by building specialized hardware that was very specific to chess. Likewise, people when they start new skills they approach it generally, then they develop the specialized expertise which speeds up their work. I think likewise it could be with intelligence. There may be some general algorithm, but it would have ways to develop lots of special skills that would leverage whatever specific or particular tasks.
Broadly speaking, I guess data science relies on three things: it relies on hardware, faster and faster hardware; better and better data, more of it and labeled better; and then better and better algorithms. If you kind of had to put those three things side by side, where are we most efficient? Like if you could really amp one of those three things way up, what would it be? 
That’s a very good question. With current algorithms, it seems that more data produces much better results than a smarter algorithm, especially if it is relevant data. For example, for image recognition there was a big quantitative jump when deep learning trained on millions of images as opposed to thousands of images. But I think what we need for next big advances is having somewhat smarter algorithms. One big shortcoming for deep learning is, again, it requests so much data. People seem to be able to learn from very few examples. And the algorithms that we have are not yet able to do that. In algorithm’s defense, I have to say that when I say people can learn from very few examples, we assume those are adults and they’ve already spent maybe 30 or 40 years of training interacting with the world. So maybe if algorithms can spend some years training and interacting with the world, they’ll acquire enough knowledge so they’ll be able to generalize to other similar examples. Yes, I think probably data, then algorithms, and then hardware. That would be my order.
So, you’re alluding to transfer learning, which is something humans seem to be able to do. Like you said, you could show a person who’s never seen an Academy Award, what that little statue that looks like, and then you could show them photographs of it in the dark, on its side, underwater, and they could pick it out. And what you just said is very interesting, which is, well yeah, we only had one photo of this thing, but we had a lifetime of learning how to recognize things underwater and in different lighting and all that. What do you think about transfer learning for computers? Do you think we’re going to be able to use the datasets that we have that are very mature, like the image one, or handwriting recognition, or speech translation, are we going to be able to use those to solve completely unrelated problems? Is there some kind of meta-knowledge buried in those things we’re doing really well now, that we can apply to things we don’t have good data on?
I think so. I think because the world itself is the best representation. So recently I read a paper that applied this negative transformation to ImageNet, and it turns out that now a deep learning system that was trained to recognize, I don’t remember exactly what it was, but let’s say cats, would not be able to recognize negatives of cats, because the negative transformation is not part of its repertoire. But that is very easy to remedy if you just add negative vocabulary image to the training. I think there is maybe a large but finite number of such transformations that humans are familiar with, like the negative and rotated and other things. And it’s quite possible that by doing such transformation to very large existing databases, we could teach those machine learning systems to achieve and exceed human levels. Because humans themselves are not perfect in recognition.
Earlier, this conversation we’re having, we’re taking human knowledge and how people do things and we’re kind of applying that to computers. Do you think AI researchers learn much from brain science? Do they learn much from psychology? Or is it more that’s handy for telling stories or helping people understand things? But as you started at the very beginning with airplanes and birds we were talking, there really isn’t a lot of mapping between how humans do things and how machines do them.
Yes, by the way, the airplanes and birds analogy I think is due to Yann LeCun. And I think some AI researchers are inspired by how humans do things, and the prime example is Geoff Hinton who is an amazing researcher, not only because of what he achieved, but he has extremely good understanding of both computers and human consciousness. And several talks that I’ve heard of him and some conversation afterwards, he suggested he uses his knowledge of how human brain works as an inspiration for coming up with new algorithms. Again, not copying them but inspiring the algorithms. So to answer your question, yes, I think human consciousness is very relevant to understanding how intelligence could be achieved, and as Geoff Hinton says, that’s the only working example we have at the moment.
We were able to kind of do chess in AI so easily because there were so many – not so easily, obviously people worked very hard on it – but because there were so many well-kept records of games that would be training data. We can do handwriting recognition well because we have a lot of handwriting and it’s been transcribed. We do translation well because there is a lot of training data. What are some problems that would be solvable if we just had the data for them, and we just don’t have it nor do we have any good way of getting it? Like, what’s a solvable problem that really our only impediment is that we don’t have the data?
I think at the forefront of such problem is medical diagnosis, because there are many diseases where the data already exists, it’s just maybe not collected in electronic form. There is a lot of genetic information that could be collected and correlated with both diseases and treatment, what works. Again, it’s not yet collected, but Google and 23andMe and many other companies are working on that. Medical radiology recently witnessed great success of a startup called Enlitic, where they were able to identify tumors using deep learning on almost the same quality as human radiologists. So I think in medicine and health care we will see big advances. And in many other areas where there is a lot of data, we can also see big advances. But the flipside of data, or what we can touch on it, is people, at least in some part of the political spectrum, are losing connection on whether it’s actually true or not. Last year’s election saw a tremendous amount of fake news stories that seemed to have significant influence. So while on one hand we’re training machines to do a better and better job in recognizing what is true, many humans are losing their ability to recognize what is true and what is happening. Just to witness denial of climate change by many people in this country.
You mention text analysis on your LinkedIn profile. I just saw that that was something that you evidently know a lot about. Is the problem you’re describing solvable? If you had to say the number one problem of the worldwide web is you don’t know what to believe, you don’t know what’s true, and you just don’t have a way necessarily of sorting results by truthiness, do you think that that is a machine learning problem, or is that not one? Is it going to require moderation in humans? Or is truth not a defined enough concept on which to train 50 billion web pages?
I think the technical part certainly can be solved from machine learning point of view. But the worldwide web does not exist in vacuum, it is embedded in human society. And as such, it suffers from all the advantages and problems of humans. If there are human actors that will find it beneficial to bend the truth and use the worldwide web to convince other people what they want to convince them of, they will find some ways to leverage the algorithms. The operator by itself is not a panacea as long as there are humans with all of our good and evil intentions around it.
But do you think it’s really solvable? Because I remember this Dilbert comic strip I saw once where Dilberts on a sales call and the person that he’s talking to says, “Your salesmen says your product cures cancer!” And Dilbert says, “That is true.” And the guy says, “Wait a minute! It’s true that it cures cancer or it’s true that he said that?” And so it’s like that, that statement, “Your salesperson said your product cures cancer,” is a true statement. But that subtlety, that nuance, that it’s-true-but-it’s-not-true aspect of it, I just wonder, it doesn’t feel like chess, this very clear-cut win/lose kind of situation. And I just wonder even if everybody wanted the true results to rise to the top, could we actually do that?
Again, I think technically it is possible. Of course, you know nothing will work perfectly, but humans also do not do perfect decisions. For example, Facebook already has an algorithm that can identify clickbait. And one of the signals is relatively simple, just look at the number of people, let’s say, who look at a particular headline, click on a particular link, and then how much time they spend there or whether they return and click backwards. The headline like, “Nine amazing things you can do to cure X,” and you go to that website and it’s something completely different, then you quickly return. Your behavior will be different than if you go to a website that matches the headline. And you know, Facebook and Google and other sites, they can measure those signals and they can see which type or which headlines are deceptive. The problem is that the ecosystem that has evolved seems to reward capturing attention of people, and headlines are more likely to be shared, are worth capturing attention of people, generate emotion in either anger or some cute things. We’re evolving toward internet of anger, partisan anger, and cute kittens. That’s the two extreme axes of what gets attention. I think the technical part is solvable. The problem is that, again, there are humans around it that make a very different motivation from you and me. It’s very hard to work when your enemy is using various cyber-weapons against you.
Do you think nutrition may be something that would be really hard as well? Because no two people – you eat however many times a day, however many every different foods, and there is nobody else who does that same combination on the planet, even for seven consecutive days or something. Do you think that nutrition is a solvable thing, or there are too many variables for there to ever be a dataset that would be able to say, “If you eat broccoli, chocolate ice cream, and go to the movie at 6:15, you’ll live longer?
I think that is certainly solvable. Again, the problem is that humans are not completely logical. That’s our duty and our problem. People know what is good for them, but sometimes they just want something else. We sort of have our own animal instinct that is very hard to control. That’s why all the diets work, but just not for a very long time. People who go on diets very frequently and then you know, find that it didn’t work and go on it again. Yes, for information, nutrition can be solved. How motivation to convince people to follow good nutrition, that is a much, much harder problem.
All right! Well it looks like we are out of time. Would you go ahead and tell the listeners how they can keep up with you, go on your website, and any ways they can follow you, how to get hold of you and all of that?
Yes. Thank you, Byron. You can find me on Twitter @KDnuggets, and visit the website KDnuggets.com. It’s a magazine for data scientists and machine learning professionals. We publish only a few interesting articles a day. And I hope you can read it, or if you have something to say, contribute to it! And thank you for the interview, I enjoyed it.
Thank you very much.
Byron explores issues around artificial intelligence and conscious computers in his upcoming book The Fourth Age, to be published in April by Atria, an imprint of Simon & Schuster. Pre-order a copy here
[voices_in_ai_link_back]

Turning data scientists into action heroes: The rise of self-service Hadoop

Mike is chief operating officer at Altiscale.
The unfortunate truth about data science professionals is that they spend a shockingly small amount of time actually exploring data. Instead, they are stuck devoting significant amounts of time wrangling data and pouring resources into the tedious act of prepping and managing it.
While Hadoop excels at turning massive amounts of data into valuable insights, it’s also a notorious culprit for sucking up resources. In fact, these hurdles are serious bottlenecks to big data success, with research firm Gartner predicting that through 2018, 70 percent of Hadoop deployments will not meet cost savings and revenue generation objectives due to skills and integration challenges.
Whether it’s time stuck in a queue behind higher priority jobs or functioning as a Hadoop operations person, — building their own clusters, accessing data sources, and running and troubleshooting jobs — data scientists are wasting time on administrative tasks. Sure, it’s necessary to do some heavy lifting to successfully perform analysis on data. But it isn’t the best use of a data scientist’s time, and it’s a drain on an organization’s resources.
That said, how can data scientists stop serving as substitute Hadoop administrators and become analytics action heroes?
Just as the business intelligence industry has moved to a more self-service model, the Hadoop industry is also moving to a self-service model. Operational challenges are moving to the background, so that data scientists are liberated to spend more time building models, exploring data, testing hypotheses, and developing new analytics.
Self-service Hadoop solutions simplify, streamline, and automate the steps needed to create a data exploration environment. Self-service is achieved when a provider (one who runs and operates a scalable, secure Hadoop environment) delivers a data science platform for the analytics team.
With a self-service environment, data scientists can focus on the data analysis, while being confident that the data and Hadoop operations are well taken care of. And these environments can be kept separate from production environments, ensuring that test data science jobs don’t interfere with a production Hadoop environment that is core to business operations, thereby reducing risk of operational mishaps.
As we see a rise in self-service Hadoop, organizations will realize the benefits of analytics action heroes and their super power contributions. Here are a few reasons why:

  • Faster understanding of trends and correlations that drive business action: Self-service tools eliminate the complex and time-consuming steps of procuring and provisioning hardware, installing and configuring Hadoop and managing clusters in production. By automating issues that customers run into in production, such as job failures, resource contention, performance optimization and infrastructure upgrades, data analytics projects run with more ease and speed.
  • Freedom to take risks with more agile data science and analytics teams: Using the latest self-service technology in the Hadoop ecosystem, organizations can gain a competitive edge not previously possible. Teams can experiment with advanced technology in a production environment, without the overhead associated with maintaining an on-premise solution. This allows data scientists to develop cutting-edge products that leverage features in the most advanced software available.
  • Increased time for Hadoop experts to focus on value-added tasks: Operational stability frees up internal resources so Hadoop experts can focus on unearthing data insights and other value-added tasks such as data modeling insights. Simply put, with more time spent on examining the data rather than wrangling it, organizations can uncover insights that drive business forward — and deliver on the true promise of big data.

Hadoop has unlimited potential to drive business forward. Yet, it can quickly become a drain on internal operational resources when running in production and at scale. Organizations need to devote more time on data science and not on the Hadoop infrastructure to fully realize big data’s potential — self-service tools make this a reality.

Airbnb open sources SQL tool built on Facebook’s Presto database

Apartment-sharing startup Airbnb has open sourced a tool called Airpal that the company built to give more of its employees access to the data they need for their jobs. Airpal is built atop the Presto SQL engine that Facebook created in order to speed access to data stored in Hadoop.

Airbnb built Airpal about a year ago so that employees across divisions and roles could get fast access to data rather than having to wait for a data analyst or data scientist to run a query for them. According to product manager James Mayfield, it’s designed to make it easier for novices to write SQL queries by giving them access to a visual interface, previews of the data they’re accessing, and the ability to share and reuse queries.

It sounds a little like the types of tools we often hear about inside data-driven companies like Facebook, as well as the new SQL platform from a startup called Mode.

At this point, Mayfield said, “Over a third of all the people working at Airbnb have issued a query through Airpal.” He added, “The learning curve for SQL doesn’t have to be that high.”

He shared the example of folks at Airbnb tasked with determining the effectiveness of the automated emails the company sends out when someone books a room, resets a password or takes any of a number of other actions. Data scientists used to have to dive into Hive — the SQL-like data warehouse framework for Hadoop that [company]Facebook[/company] open sourced in 2008 — to answer that type of question, which meant slow turnaround times because of human and technological factors. Now, lots of employees can access that same data via Airpal in just minutes, he said.

The Airpal user interface.

The Airpal user interface.

As cool as Airpal might be for Airbnb users, though, it really owes its existence to Presto. Back when everyone was using Hive for data analysis inside Hadoop — it was and continues to be widely used within web companies — only 10 to 15 people within Airbnb understood the data and could write queries using its somewhat complicated version of SQL. Because Hive is based on MapReduce, the batch-processing engine most commonly associated with Hadoop, Hive is also slow (although new improvements have increased its speed drastically).

Airbnb also used [company]Amazon[/company]’s Redshift cloud data warehouse for a while, said software engineer Andy Kramolisch, and while it was fast, it wasn’t as user-friendly as the company would have liked. It also required replicating data from Hive, meaning more work for Airbnb and more data for the company to manage. (If you want to hear more about all this Hadoop and big data stuff from leaders at [company]Google[/company], Cloudera and elsewhere, come to our Structure Data conference March 18-19 in New York.)

A couple years ago, Facebook created and then open sourced Presto as a means to solve Hive’s speed problems. It still accesses data from Hive, but is designed to deliver results at interactive speeds rather than in minutes or, depending on the query, much longer. It also uses standard ANSI SQL, which Kramolisch said is easier to learn than the Hive Query Language and its “lots of hidden gotchas.”

Still, Mayfield noted, it’s not as if everyone inside Airbnb, or any company, is going to be running SQL queries using Airpal — no matter how easy the tooling gets. In those cases, he said, the company tries to provide dashboards, visualizations and other tools to help employees make sense of the data they need to understand.

“I think it would be rad if the CEO was writing SQL queries,” he said, “but …”

Data might be the new oil, but a lot of us just need gasoline

One of the biggest tropes in the era of big data is that data is the new oil — it’s very valuable to the companies that have it, but only after it has been mined and processed. The analogy makes some sense, but it ignores the fact that people and companies don’t have the means to collect the data they need or the ability to process it once they have it. A lot of us just need gasoline.

Which is why I was excited to see the new Data for Everyone initiative that crowdsourcing startup CrowdFlower released on Wednesday. It’s a library of interesting and free datasets that have been gathered by CrowdFlower’s users over the years and verified by the company’s crowdsourced labor force. Topics range from Twitter sentiment on various subjects to a collection of labeled medical images.

Data for Everyone is far from comprehensive or from being any sort of one-stop shop for data democratization, but it is a good approach to a problem that lots of folks have been trying to solve for years. Namely, giving people interested in analyzing valuable data access to that data in a meaningful way. Unfortunately, early attempts at data marketplaces such as Infochimps and Quandl, and even earlier incarnations of the federal Data.gov service, often included poorly formatted data or suffered from a dearth of interesting datasets.

An example of what's available in Data for Everyone.

An example of what’s available in Data for Everyone.

It’s often said that data analysts spend 85 percent of their time formatting data and only 15 percent of it actually analyzing data — a situation that is simply untenable for people whose jobs don’t revolve around data, even as tools for data analysis continue to improve. All the Tableau software or Watson Analytics or DataHero or PowerBI services in the world don’t do a whole lot to help mortals analyze data when it’s riddled with errors or formatted so sloppily it takes a day just to get it ready to upload.

Hopefully, we’ll start to see more high-quality data markets pop up, as well as better tools for collecting data from services such as Twitter. They don’t necessarily need to be so easy a 10-year-old can use them, but they do need to be easy enough that someone with basic programming or analytic skills can get up and running without quitting their day job. Data for Everyone looks like one, as does the new Wolfram Data Drop, also announced on Wednesday.

Because while it’s getting a lot easier for large companies and professional data scientists to collect their data and analyze it for purposes ranging from business intelligence to training robotic brains — topics we’ll be discussing at our Structure Data conference later this month — the little guy, strapped for time and resources, still needs more help.

For now, Spark looks like the future of big data

Titles can be misleading. For example, the O’Reilly Strata + Hadoop World conference took place in San Jose, California, this week but Hadoop wasn’t the star of the show. Based on the news I saw coming out of the event, it’s another Apache project — Spark — that has people excited.

There was, of course, some big Hadoop news this week. Pivotal announced it’s open sourcing its big data technology and essentially building its Hadoop business on top of the [company]Hortonworks[/company] platform. Cloudera announced it earned $100 million in 2014. Lost in the grandstanding was MapR, which announced something potentially compelling in the form of cross-data-center replication for its MapR-DB technology.

But pretty much everywhere else you looked, it was technology companies lining up to support Spark: Databricks (naturally), Intel, Altiscale, MemSQL, Qubole and ZoomData among them.

Spark isn’t inherently competitive with Hadoop — in fact, it was designed to work with Hadoop’s file system and is a major focus of every Hadoop vendor at this point — but it kind of is. Spark is known primarily as an in-memory data-processing framework that’s faster and easier than MapReduce, but it’s actually a lot more. Among the other projects included under the Spark banner are file system, machine learning, stream processing, NoSQL and interactive SQL technologies.

The Spark platform, minus the Tachyon file system and some younger related projects.

The Spark platform, minus the Tachyon file system and some younger related projects.

In the near term, it probably will be that Hadoop pulls Spark into the mainstream because Hadoop is still at least a cheap, trusted big data storage platform. And with Spark still being relatively immature, it’s hard to see too many companies ditching Hadoop MapReduce, Hive or Impala for their big data workloads quite yet. Wait a few years, though, and we might start seeing some more tension between the two platforms, or at least an evolution in how they relate to each other.

This will be especially true if there’s a big breakthrough in RAM technology or prices drop to a level that’s more comparable to disk. Or if Databricks can convince companies they want to run their workloads in its nascent all-Spark cloud environment.

Attendees at our Structure Data conference next month in New York can ask Spark co-creator and Databricks CEO Ion Stoica all about it — what Spark is, why Spark is and where it’s headed. Coincidentally, Spark Summit East is taking place the exact same days in New York, where folks can dive into the nitty gritty of working with the platform.

There were also a few other interesting announcements this week that had nothing to do with Spark, but are worth noting here:

  • [company]Microsoft[/company] added Linux support for its HDInsight Hadoop cloud service, and Python and R programming language support for its Azure ML cloud service. The latter also now lets users deploy deep neural networks with a few clicks. For more on that, check out the podcast interview with Microsoft Corporate Vice President of Machine Learning (and Structure Data speaker) Joseph Sirosh embedded below.
  • [company]HP[/company] likes R, too. It announced a product called HP Haven Predictive Analytics that’s powered by a distributed version of R developed by HP Labs. I’ve rarely heard HP and data science in the same sentence before, but at least it’s trying.
  • [company]Oracle[/company] announced a new analytic tool for Hadoop called Big Data Discovery. It looks like a cross between Platfora and Tableau, and I imagine will be used primarily by companies that already purchase Hadoop in appliance form from Oracle. The rest will probably keep using Platfora and Tableau.
  • [company]Salesforce.com[/company] furthered its newfound business intelligence platform with a handful of features designed to make the product easier to use on mobile devices. I’m generally skeptical of Salesforce’s prospects in terms of stealing any non-Salesforce-related analytics from Tableau, Microsoft, Qlik or anyone else, but the mobile angle is compelling. The company claims more than half of user engagement with the platform is via mobile device, which its Director of Product Marketing Anna Rosenman explained to me as “a really positive testament that we have been able to replicate a consumer interaction model.”

If I missed anything else that happened this week, or if I’m way off base in my take on Hadoop and Spark, please share in the comments.

[soundcloud url=”https://api.soundcloud.com/tracks/191875439″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

The 4 things (at least) you’ll learn about at Structure Data

Gigaom’s Structure Data conference is less than a month away, kicking off March 18 in New York. There are a lot of reasons to attend — great location, great networking, free drinks — but, of course, the biggest reason is great content.

With that in mind, here are four big themes of the event and the speakers who’ll be talking about them. Some are household names in the world of big data and information technology, some are researchers on the forefront of hot new fields, and other up-and-coming entrepreneurs with big ideas about how data can change business and the world. Structure Data is your chance to hear in person what they have to say and ask them those questions you’ve been dying to ask.

The business of big data

Everyone has heard about Hadoop, but the business of big data infrastructure is about so much more: Spark, Kafka, the internet of things, the industrial internet, visualization, social media analysis, webscale systems, machine learning. The tools are finally in place to do some really cool things, if you know where to look for them.

Structure Data speakers leading the charge in the world data software and services include: Ted Bailey, Dataminr; Rob Bearden, Hortonworks; Eric Brewer, Google; Ann Johnson, Interana; Jock Mackinlay, Tableau Software; Hilary Mason, Fast Forward Labs; Seth McGuire, Twitter; Neha Narkhede, Confluent; Matt Ocko, Data Collective; Andy Palmer, Tamr; Tom Reilly, Cloudera; William Ruh, GE; John Schroeder, MapR; Joseph Sirosh, Microsoft; Ion Stoica, Databricks; Matt Wood, Amazon Web Services.

Eric Brewer, vice president of infrastructure, Google

Eric Brewer, vice president of infrastructure, Google

A new era of artificial intelligence

Unless you live under a rock inside a cave with spotty internet, you’ve probably heard that folks including Stephen Hawking and Elon Musk think we should be leery of artificial intelligence. Perhaps they’re right, perhaps they’re wrong. But AI is hot right now because techniques such as deep learning offer effective ways of training systems that can make sense of mountains of text, audio and visual data, and because we’re closer than ever to robots that can navigate the world around them.

Structure Data speakers on the forefront of AI and machine learning research include: Ron Brachman, Yahoo; Eugenio Culurciello, TeraDeep; Rob Fergus, Facebook; Ahna Girshick, Enlitic; Jeff Hawkins, Numenta; Anthony Lewis, Qualcomm; Gary Marcus, Geometric Intelligence; Naveen Rao, Nervana Systems; Ashutosh Saxena, Stanford University; Julie Shah, MIT; Sven Strohband, MetaMind; Davide Venturelli, NASA; Brian Whitman, Spotify.

Julie Shah, Interactive Robotics Group, MIT

Julie Shah, Interactive Robotics Group, MIT

Users — big users — everywhere

One of the most amazing things to watch over the past few years is how big data tools and data science techniques dispersed from the ivory towers of places like Google and Facebook out across every type of industry. From farming to medicine, and from media to food production, data is driving some incredible investments and innovations.

Structure Data speakers discussing how data has transformed their businesses include: Krish Dasgupta, ESPN; Don Duet, Goldman Sachs; Ky Harlin, BuzzFeed; Nancy Hersh, Opower; Steven Horng, Beth Israel Deaconess Medical Center; Ravi Hubbly, Lockheed Martin; Lee Redden, Blue River Technology; Bill Squadron, STATS; Dan Zigmond, Hampton Creek.

Ky Harlin, director of data science, BuzzFeed

Ky Harlin, director of data science, BuzzFeed

Data by the people, for the people and about the people

While big data has been a boon for IT vendors and some large corporations, the benefits haven’t always been so obvious when it comes to society. Better marketing and more-addictive apps only help the businesses behind them, while the privacy risks for consumers have never been higher as companies collect more and more data from the sites we visit and devices we use. Things are starting to come around, however, as smart people are using data tackle everything from crime to geopolitics, and officials are increasingly cognizant of regulating new industries like the internet of things in ways that maximize the consumer experience while still keeping them safe.

Structure Data speakers addressing societal impacts of data analysis include: Julie Brill, Federal Trade Commission; Paul Duan, Bayes Impact; Kalev Leetaru, GDELT; Jens Ludwig, University of Chicago Crime Lab.

Julie Brill, commissioner, Federal Trade Commission

Julie Brill, commissioner, Federal Trade Commission

Analytics startup Mode wants to give SQL a shiny home in the cloud

Mode, a startup that’s trying to be something like a GitHub for data scientists, has added new features to its collaboration platform that make it easier to write SQL queries and share the resulting reports. While that news in itself might not be too interesting, the company claims the new stuff represents a major upgrade for a service already popular among some well-known users.

The new features, explained here in a company blog post, really boil down to improving the workflow for data analysts. Among other things, users can now position data tables and the SQL editor as they see fit on their screens, easily edit schema and preview reports as they’re building them. A new activity feed, a la Slack or any other enterprise social platform, enables what Mode Co-founder and Chief Analyst Benn Stancil calls “implicit collaboration” — sharing reports and other work without expressly tagging individual colleagues.

A screenshot of the new report preview.

A screenshot of the new report preview.

But the bigger news in all of this might be the type of traction Mode claims it’s getting. It cites Twitch.tv., TuneIn and Munchery as customers, and Stancil said Twitch does nearly all of its analytics via Mode. “Twitch almost certainly is operating at a scale near petabytes,” he said. “…It’s not like an Excel-sized thing by any means.”

The way Mode works, fundamentally, is as an overlay atop a company’s existing SQL data store. About half of the company’s users connect via their Amazon Redshift data warehouses, but systems range from MySQL to Cloudera Impala. “It works with anything that speaks SQL,” Stancil said.

I asked Stancil if Mode is acting as an alternative to [company]Tableau Software[/company] among its customers, and he said it only is to the degree that they’re both trying to simplify the process of analyzing data and creating reports. Aside from the collaboration angle, the biggest difference is the target audience, where Stancil sees Tableau targeting savvy business users while Mode is all about the data analyst.

“There are certain people,” he said, “for whom Tableau is more difficult to use than just writing SQL because you have to go through a UI that constrains what you can do.”

You can learn more about the companies building and the people using next-generation analytics tools at our Structure Data conference next month in New York. Speakers include Tableau Vice President of Analytics Jock Mackinlay, Interana CEO Ann Johnson and BuzzFeed Director of Data Science Ky Harlin.

Why data science matters and how technology makes it possible

When Hilary Mason talks about data, it’s a good idea to listen.

She was chief data scientist at Bit.ly, data scientist in residence at venture capital firm Accel Partners, and is now founder and CEO of research company Fast Forward Labs. More than that, she has been a leading voice of the data science movement over the past several years, highlighting what’s possible when you mix the right skills with a little bit of creativity.

Mason came on the Structure Show podcast this week to discuss what she’s excited about and why data science is a legitimate field. Here are some highlights from the interview, but it’s worth listening to the whole thing for her thoughts on everything from the state of the art in natural language processing to the state of data science within corporate America.

And if you want to see Mason, and a lot of other really smart folks, talk about the future of data in person, come to our Structure Data conference that takes place March 18-19 in New York.

[soundcloud url=”https://api.soundcloud.com/tracks/187259451?secret_token=s-4LM4Z” params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed

How far big data tech has come, and how fast

“Things that maybe 10 or 15 years ago we could only talk about in a theoretical sense are now commodities that we take completely for granted,” Mason said in response to a question about how the data field has evolved.

When she started at Bit.ly, she explained, the whole product was just shortened links shared across the web. That was it. So she and her colleagues had a lot of freedom rather early on to carry out data science research in an attempt to find new directions to take the company.

Shivon Zilis, VC, Bloomberg Beta; Sven Strohband, Partner and CTO, Khosla Ventures; Hilary Mason, Data Scientist in Residence, Accel Partners; Jalak Jobanputra, Managing Partner, FuturePerfect Ventures.

Hilary Mason (center) at Structure Data 2014.

“That was super fun, and also the first time I realized that the technology we were building and using was actually allowing us to gather more data about natural human behavior than we’ve ever, as a research community, had access to,” Mason said.

“Hadoop existed, but was still extremely hard to use at that point,” she continued. “Now it’s something where I hit a couple buttons and a cloud spins up for me and does my calculations and it’s really lovely.”

Defending data science

It was only a couple years ago that “data scientist” was deemed the sexiest job of the 21st century, but that job title and the field of data science have always been subject to a fair amount of derision. What’s more, there’s now a collection of software vendors claiming they can automate away some of the need for data scientists via their products.

Mason disagrees with the criticism and the idea that you can automate all, or even the most important parts, of a data scientist’s job:

“You have math, you have programming, and then you have what is essentially empathy domain knowledge and the ability to articulate things clearly. So I think the title is relevant because those three things have not been combined in one job before. And the reason we can do that today, even though none of these things is new, is just that the technology has progressed so much that it’s possible for one person to do all these things — not perfectly, but well enough.”

She continued:

“A lot of people seem to think that data science is just a process of adding up a bunch of data and looking at the results, but that’s actually not at all what the process is. To do this well, you’re really trying to understand something nuanced about the real world, you have some incredibly messy data at hand that might be able to inform you about something, and you’re trying to use mathematics to build a model that connects the two. But that understanding of what the data is really telling you is something that is still a purely human capability.”

The next big things: Deep learning, IoT and intelligent operations

As for other technologies that have Mason excited, she said deep learning is high up on the list, as are new approaches to natural language processing and understanding (those two are actually quite connected in some aspects).

“Also, being able to use AI to automate the bounds of engineering problems,” Mason said. “There are a lot of techniques we already understand pretty well that could be well applied in like operations or data center space where we haven’t seen a lot of that.”

Hilary Mason

Hilary Mason (second from right) at Structure Data 2014.

Mason thinks one of the latest data technologies on the path to commoditization is stream processing for real-time data, and Fast Forward Labs is presently investigating probabilistic approaches to stream processing. That is, giving up a little bit of accuracy in the name of speed. However, she said, it’s important to think about the right architecture for the job, especially in an era of cheaper sensors and more-powerful, lower-power processors.

“You don’t actually need that much data to go into your permanent data store, where you’re going to spend a lot of computation resources analyzing it,” Mason explained. “If you know what you’re looking for, you can build a probabilistic system that just models the thing you’re trying to model in a very efficient way. And what this also means is that you can push a lot of that computation from a cloud cluster actually onto the device itself, which I think will open up a lot of cool applications, as well.”

Microsoft buys data science specialist Revolution Analytics

Microsoft has agreed to acquire Revolution Analytics, a company built around commercial software and support for the popular R statistical computing project. The open source R project is hugely popular among data scientists and research types, and having Revolution’s R experts in-house could be a big deal for Microsoft as it tries to establish itself as the go-to place for data science software.

Among Revolution’s additions to the standard R capabilities were simplifying the use of the program and engineering it to run across big data systems such as Hadoop. Here’s how Joseph Sirosh, Microsoft’s corporate vice president for machine learning, explains what the deal means in a blog post:

As their volumes of data continually grow, organizations of all kinds around the world – financial, manufacturing, health care, retail, research – need powerful analytical models to make data-driven decisions. This requires high performance computation that is “close” to the data, and scales with the business’ needs over time. At the same time, companies need to reduce the data science and analytics skills gap inside their organizations, so more employees can use and benefit from R. This acquisition is part of our effort to address these customer needs.

. . .

This acquisition will help customers use advanced analytics within Microsoft data platforms on-premises, in hybrid cloud environments and on Microsoft Azure. By leveraging Revolution Analytics technology and services, we will empower enterprises, R developers and data scientists to more easily and cost effectively build applications and analytics solutions at scale.

Sirosh will be speaking at Gigaom’s Structure Data conference, which takes place March 18-19 in New York.

A simple example of a plot using R.

A simple example of a plot using R.

In the blog post, Sirosh also promised to continue contributing to the open source R community, as well as to continue developing Revolution’s products. He reiterated Microsoft’s renewed (or just plain new) commitment to open source software, which includes contributions to various Hadoop-related projects and support for many open source technologies on the Azure platform.

In a separate blog post, Revolution’s David Smith detailed Microsoft’s specific commitment to R, including within the Azure Machine Learning service it announced in June:

And Microsoft is a big user of R. Microsoft used R to develop the match-makingcapabilities of the Xbox online gaming service. It’s the tool of choice for data scientists at Microsoft, who apply machine learning to data from Bing, Azure, Office, and the Sales, Marketing and Finance departments. Microsoft supports R extensively within the Azure ML framework, including the ability to experiment and operationalize workflows consisting of R scripts in MLStudio.

When Microsoft CEO Satya Nadella went on a cloud computing road show in October, touting the scale of Microsoft’s cloud efforts, I argued that applications, not scale, would always be Microsoft’s big advantage in that space. The same holds true for the world of big data and data science.

Revolution Analytics and the R project might not be household names in most circles, and they certainly won’t be a major driver of Microsoft revenue any time soon, but they are a big deal in the world predictive analytics and machine learning. That’s an emerging market that Microsoft wants to get in on early, while so many other vendors are still pushing yesterday’s technologies or focused on building out infrastructure to store all the data companies want so badly to analyze.

Here are the winners of the 2015 Structure Data Awards

The second-annual Structure Data Awards are here, where Gigaom picks the most-interesting and most-promising data startups that launched in the previous year. The winners, which range from a non-profit data science organization to a company building infrastructure for deep learning, will present during a special session at our Structure Data conference, which takes places March 18 and 19 in New York.

This year’s winners are:

Bayes ImpactA non-profit organization that emerged from Y Combinator, Bayes Impact is trying to bring data to bear on some of society’s thorniest problems. It host fellows, works with directly other non-profit organizations, and puts on hackathons to identify new applications for data science.

ConfluentApache Kafka has become a popular tool for managing real-time data streams of data from web sites, applications and sensors. In March, the team that created Kafka while at LinkedIn launched Confluent to help commercialize the technology.

EnliticDeep learning has proven its prowess in pattern recognition and computer vision, although advances often emerge from the corporate labs of large web companies. Enlitic is applying the techniques in the name of health care by trying to build deep learning models that can diagnose disease from medical images.

InteranaBased on the data-centric culture two of its founders experienced while working and building data products at Facebook, Interana’s software is designed to open data analysis to entire companies. Aside from the user experience, the team built an entire data storage and low-latency processing stack from the ground up.

David Soloff of Premise Data (left) and his Structure Data award in 2014.

David Soloff of Premise Data (left) and his Structure Data award in 2014.

MetaMindMetaMind is the product of years of artificial intelligence research by its founding team, including in the field of deep learning. The company’s goal is to help other organizations make the most of their text and image data, and to push the state of the art with its own research.

Nervana SystemsAs deep learning took off, the team at Nervana sensed an opportunity to build systems specially designed for the unique computing requirements of neural networks. Although the company includes folks who have worked on neurosynaptic chips at places such as Qualcomm, Nervana is building a hardware-and-software platform.

TamrBig data is taking off within enterprises, but finding and transforming relevant datasets is still very difficult. Tamr, which was founded by former Vertica CEO Andy Palmer and database expert Michael Stonebraker, tries to simplify it with a combination of machine learning and human data stewards.

TeraDeepTeraDeep is straddling the intersection of two very big trends — deep learning and the internet of things. The company has developed deep learning algorithms that can run on smartphone processors and FPGAs, and is building small processors that can be embedded into devices to make them intelligent.

Of course, startups are just a part of Structure Data 2015. The event also features executives from the biggest, best and most innovative companies around — including BuzzFeed, ESPN, Google and NASA — and researchers from universities and companies including Facebook, MIT, NYU and Stanford.