Why data science matters and how technology makes it possible

When Hilary Mason talks about data, it’s a good idea to listen.

She was chief data scientist at Bit.ly, data scientist in residence at venture capital firm Accel Partners, and is now founder and CEO of research company Fast Forward Labs. More than that, she has been a leading voice of the data science movement over the past several years, highlighting what’s possible when you mix the right skills with a little bit of creativity.

Mason came on the Structure Show podcast this week to discuss what she’s excited about and why data science is a legitimate field. Here are some highlights from the interview, but it’s worth listening to the whole thing for her thoughts on everything from the state of the art in natural language processing to the state of data science within corporate America.

And if you want to see Mason, and a lot of other really smart folks, talk about the future of data in person, come to our Structure Data conference that takes place March 18-19 in New York.

[soundcloud url=”https://api.soundcloud.com/tracks/187259451?secret_token=s-4LM4Z” params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed

How far big data tech has come, and how fast

“Things that maybe 10 or 15 years ago we could only talk about in a theoretical sense are now commodities that we take completely for granted,” Mason said in response to a question about how the data field has evolved.

When she started at Bit.ly, she explained, the whole product was just shortened links shared across the web. That was it. So she and her colleagues had a lot of freedom rather early on to carry out data science research in an attempt to find new directions to take the company.

Shivon Zilis, VC, Bloomberg Beta; Sven Strohband, Partner and CTO, Khosla Ventures; Hilary Mason, Data Scientist in Residence, Accel Partners; Jalak Jobanputra, Managing Partner, FuturePerfect Ventures.

Hilary Mason (center) at Structure Data 2014.

“That was super fun, and also the first time I realized that the technology we were building and using was actually allowing us to gather more data about natural human behavior than we’ve ever, as a research community, had access to,” Mason said.

“Hadoop existed, but was still extremely hard to use at that point,” she continued. “Now it’s something where I hit a couple buttons and a cloud spins up for me and does my calculations and it’s really lovely.”

Defending data science

It was only a couple years ago that “data scientist” was deemed the sexiest job of the 21st century, but that job title and the field of data science have always been subject to a fair amount of derision. What’s more, there’s now a collection of software vendors claiming they can automate away some of the need for data scientists via their products.

Mason disagrees with the criticism and the idea that you can automate all, or even the most important parts, of a data scientist’s job:

“You have math, you have programming, and then you have what is essentially empathy domain knowledge and the ability to articulate things clearly. So I think the title is relevant because those three things have not been combined in one job before. And the reason we can do that today, even though none of these things is new, is just that the technology has progressed so much that it’s possible for one person to do all these things — not perfectly, but well enough.”

She continued:

“A lot of people seem to think that data science is just a process of adding up a bunch of data and looking at the results, but that’s actually not at all what the process is. To do this well, you’re really trying to understand something nuanced about the real world, you have some incredibly messy data at hand that might be able to inform you about something, and you’re trying to use mathematics to build a model that connects the two. But that understanding of what the data is really telling you is something that is still a purely human capability.”

The next big things: Deep learning, IoT and intelligent operations

As for other technologies that have Mason excited, she said deep learning is high up on the list, as are new approaches to natural language processing and understanding (those two are actually quite connected in some aspects).

“Also, being able to use AI to automate the bounds of engineering problems,” Mason said. “There are a lot of techniques we already understand pretty well that could be well applied in like operations or data center space where we haven’t seen a lot of that.”

Hilary Mason

Hilary Mason (second from right) at Structure Data 2014.

Mason thinks one of the latest data technologies on the path to commoditization is stream processing for real-time data, and Fast Forward Labs is presently investigating probabilistic approaches to stream processing. That is, giving up a little bit of accuracy in the name of speed. However, she said, it’s important to think about the right architecture for the job, especially in an era of cheaper sensors and more-powerful, lower-power processors.

“You don’t actually need that much data to go into your permanent data store, where you’re going to spend a lot of computation resources analyzing it,” Mason explained. “If you know what you’re looking for, you can build a probabilistic system that just models the thing you’re trying to model in a very efficient way. And what this also means is that you can push a lot of that computation from a cloud cluster actually onto the device itself, which I think will open up a lot of cool applications, as well.”

Hilary Mason on taking big data from theory to reality

[soundcloud url=”https://api.soundcloud.com/tracks/187259451?secret_token=s-4LM4Z” params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

If you’re interested in assessing how and when a given data technology — deep learning, machine intelligence, natural language generation — can move from the theoretical to commercial use,  Hilary Mason may have the best job around. This week’s guest, the CEO and Founder of Fast Forward Labs, talks about how that startup taps into a wide array of expertise sources– from academic and commercial research, the open source world to “outsider art” in the realms of spam and malware, to come up with new ideas for applications.

One natural language generation (NLG) project, for example, lets a person who wants to sell her house, enter the parameters — square footage, number of rooms etc — then step back to let the system write up the ad for that property. (As a person who makes her living from writing words, all I can say is: “ick.”)

She’s also got an interesting take on opportunities in the internet of things — a term she dislikes — and why the much-maligned title of data scientist has validity. Mason is really interesting so if you’re pressed for time, check out at least the second half of this podcast. And to hear more from her, be sure to sign up for Structure Data in March, where she will return to speak in March.

Shivon Zilis, VC, Bloomberg Beta; Sven Strohband, Partner and CTO, Khosla Ventures; Hilary Mason, Data Scientist in Residence, Accel Partners; Jalak Jobanputra, Managing Partner, FuturePerfect Ventures.

Shivon Zilis, VC, Bloomberg Beta; Sven Strohband, Partner and CTO, Khosla Ventures; Hilary Mason, Data Scientist in Residence, Accel Partners; Jalak Jobanputra, Managing Partner, FuturePerfect Ventures.

As for segment one, Derrick and I discuss Datapipe’s acquisition of GoGrid, the first cloud consolidation move of the new year; the long-awaited Box IPO; and an itty bit on Microsoft’s foray into augmented reality.

So get cozy and take a listen.

SHOW NOTES

Hosts: Barb Darrow and Derrick Harris.

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed

PREVIOUS EPISODES:

On the importance of building privacy into apps and Reddit AMAs

Cheap cloud + open source = a great time for startups 

It’s all Docker containers and the cloud on the Structure Show

Mo’ money, mo’ data, mo’ cloud on the Structure Show

Why CoreOS went its own way on containers

 

 

.