When Hilary Mason talks about data, it’s a good idea to listen.
She was chief data scientist at Bit.ly, data scientist in residence at venture capital firm Accel Partners, and is now founder and CEO of research company Fast Forward Labs. More than that, she has been a leading voice of the data science movement over the past several years, highlighting what’s possible when you mix the right skills with a little bit of creativity.
Mason came on the Structure Show podcast this week to discuss what she’s excited about and why data science is a legitimate field. Here are some highlights from the interview, but it’s worth listening to the whole thing for her thoughts on everything from the state of the art in natural language processing to the state of data science within corporate America.
And if you want to see Mason, and a lot of other really smart folks, talk about the future of data in person, come to our Structure Data conference that takes place March 18-19 in New York.
[soundcloud url=”https://api.soundcloud.com/tracks/187259451?secret_token=s-4LM4Z” params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]
How far big data tech has come, and how fast
“Things that maybe 10 or 15 years ago we could only talk about in a theoretical sense are now commodities that we take completely for granted,” Mason said in response to a question about how the data field has evolved.
When she started at Bit.ly, she explained, the whole product was just shortened links shared across the web. That was it. So she and her colleagues had a lot of freedom rather early on to carry out data science research in an attempt to find new directions to take the company.
“That was super fun, and also the first time I realized that the technology we were building and using was actually allowing us to gather more data about natural human behavior than we’ve ever, as a research community, had access to,” Mason said.
“Hadoop existed, but was still extremely hard to use at that point,” she continued. “Now it’s something where I hit a couple buttons and a cloud spins up for me and does my calculations and it’s really lovely.”
Defending data science
It was only a couple years ago that “data scientist” was deemed the sexiest job of the 21st century, but that job title and the field of data science have always been subject to a fair amount of derision. What’s more, there’s now a collection of software vendors claiming they can automate away some of the need for data scientists via their products.
Mason disagrees with the criticism and the idea that you can automate all, or even the most important parts, of a data scientist’s job:
“You have math, you have programming, and then you have what is essentially empathy domain knowledge and the ability to articulate things clearly. So I think the title is relevant because those three things have not been combined in one job before. And the reason we can do that today, even though none of these things is new, is just that the technology has progressed so much that it’s possible for one person to do all these things — not perfectly, but well enough.”
“A lot of people seem to think that data science is just a process of adding up a bunch of data and looking at the results, but that’s actually not at all what the process is. To do this well, you’re really trying to understand something nuanced about the real world, you have some incredibly messy data at hand that might be able to inform you about something, and you’re trying to use mathematics to build a model that connects the two. But that understanding of what the data is really telling you is something that is still a purely human capability.”
The next big things: Deep learning, IoT and intelligent operations
As for other technologies that have Mason excited, she said deep learning is high up on the list, as are new approaches to natural language processing and understanding (those two are actually quite connected in some aspects).
“Also, being able to use AI to automate the bounds of engineering problems,” Mason said. “There are a lot of techniques we already understand pretty well that could be well applied in like operations or data center space where we haven’t seen a lot of that.”
Mason thinks one of the latest data technologies on the path to commoditization is stream processing for real-time data, and Fast Forward Labs is presently investigating probabilistic approaches to stream processing. That is, giving up a little bit of accuracy in the name of speed. However, she said, it’s important to think about the right architecture for the job, especially in an era of cheaper sensors and more-powerful, lower-power processors.
“You don’t actually need that much data to go into your permanent data store, where you’re going to spend a lot of computation resources analyzing it,” Mason explained. “If you know what you’re looking for, you can build a probabilistic system that just models the thing you’re trying to model in a very efficient way. And what this also means is that you can push a lot of that computation from a cloud cluster actually onto the device itself, which I think will open up a lot of cool applications, as well.”