Sight Machine, a startup trying to simplify the collection and analysis of industrial data, has raised a $5 million venture capital round from Mercury Fund, Michigan eLab, Huron River Ventures, Orfin Ventures and Funders Club, as well as its existing investors. The company originally focused on computer vision and letting users easily analyze images from assembly-line cameras, but has expanded its platform to include data from sensors, robots, and other industrial instruments and systems. Sight Machine was co-founded by Nathan Oostendorp, who also co-founded tech news site Slashdot.
When Hilary Mason talks about data, it’s a good idea to listen.
She was chief data scientist at Bit.ly, data scientist in residence at venture capital firm Accel Partners, and is now founder and CEO of research company Fast Forward Labs. More than that, she has been a leading voice of the data science movement over the past several years, highlighting what’s possible when you mix the right skills with a little bit of creativity.
Mason came on the Structure Show podcast this week to discuss what she’s excited about and why data science is a legitimate field. Here are some highlights from the interview, but it’s worth listening to the whole thing for her thoughts on everything from the state of the art in natural language processing to the state of data science within corporate America.
And if you want to see Mason, and a lot of other really smart folks, talk about the future of data in person, come to our Structure Data conference that takes place March 18-19 in New York.
[soundcloud url=”https://api.soundcloud.com/tracks/187259451?secret_token=s-4LM4Z” params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]
How far big data tech has come, and how fast
“Things that maybe 10 or 15 years ago we could only talk about in a theoretical sense are now commodities that we take completely for granted,” Mason said in response to a question about how the data field has evolved.
When she started at Bit.ly, she explained, the whole product was just shortened links shared across the web. That was it. So she and her colleagues had a lot of freedom rather early on to carry out data science research in an attempt to find new directions to take the company.
“That was super fun, and also the first time I realized that the technology we were building and using was actually allowing us to gather more data about natural human behavior than we’ve ever, as a research community, had access to,” Mason said.
“Hadoop existed, but was still extremely hard to use at that point,” she continued. “Now it’s something where I hit a couple buttons and a cloud spins up for me and does my calculations and it’s really lovely.”
Defending data science
It was only a couple years ago that “data scientist” was deemed the sexiest job of the 21st century, but that job title and the field of data science have always been subject to a fair amount of derision. What’s more, there’s now a collection of software vendors claiming they can automate away some of the need for data scientists via their products.
Mason disagrees with the criticism and the idea that you can automate all, or even the most important parts, of a data scientist’s job:
“You have math, you have programming, and then you have what is essentially empathy domain knowledge and the ability to articulate things clearly. So I think the title is relevant because those three things have not been combined in one job before. And the reason we can do that today, even though none of these things is new, is just that the technology has progressed so much that it’s possible for one person to do all these things — not perfectly, but well enough.”
“A lot of people seem to think that data science is just a process of adding up a bunch of data and looking at the results, but that’s actually not at all what the process is. To do this well, you’re really trying to understand something nuanced about the real world, you have some incredibly messy data at hand that might be able to inform you about something, and you’re trying to use mathematics to build a model that connects the two. But that understanding of what the data is really telling you is something that is still a purely human capability.”
The next big things: Deep learning, IoT and intelligent operations
As for other technologies that have Mason excited, she said deep learning is high up on the list, as are new approaches to natural language processing and understanding (those two are actually quite connected in some aspects).
“Also, being able to use AI to automate the bounds of engineering problems,” Mason said. “There are a lot of techniques we already understand pretty well that could be well applied in like operations or data center space where we haven’t seen a lot of that.”
Mason thinks one of the latest data technologies on the path to commoditization is stream processing for real-time data, and Fast Forward Labs is presently investigating probabilistic approaches to stream processing. That is, giving up a little bit of accuracy in the name of speed. However, she said, it’s important to think about the right architecture for the job, especially in an era of cheaper sensors and more-powerful, lower-power processors.
“You don’t actually need that much data to go into your permanent data store, where you’re going to spend a lot of computation resources analyzing it,” Mason explained. “If you know what you’re looking for, you can build a probabilistic system that just models the thing you’re trying to model in a very efficient way. And what this also means is that you can push a lot of that computation from a cloud cluster actually onto the device itself, which I think will open up a lot of cool applications, as well.”
A startup out of Las Vegas is trying to capitalize on a very difficult, and potentially very lucrative, opportunity within the internet of things. The company, called Terbine, wants to become a data broker for the world of connected devices by building a platform where companies can buy, sell and share the data their sensors are collecting.
Terbine is still very young — the company has just raised seed funding from a firm called Incapture Group — but founder and CEO David Knight has big plans. He’s looking at everything from billboards to drones, from shipping vessels to satellites, as potential sources for a massive database of information about what’s happening in the physical world. He thinks companies will pay big money to able to monitor pedestrian traffic in key markets thousands of miles away, for example, or to identify the potential closure of shipping lanes because of an oil spill long before it’s being reported.
Terbine would play the middleman in all of these transactions, collecting the data, curating and formatting it, and then managing access to it. Knight envisions a market-like approach to access, where some data might be free, but most would be priced based on how timely it is, how rare it is or how relevant it is at any given time. He’s looking at sectors such as energy, agriculture, and oil and gas — which has become much less centralized thanks to fracking — as early targets.
“I realize a lot of people are talking about the internet of things,” Knight said, “but so far most of the conversation reminds me of the early days of CB radio.” Back then lots of people had a radio, like lots of people now have sensors, but there was no place to go to connect with the most interesting people.
It’s not an insane idea — even [company]Cisco[/company] has pitched the idea of “data infomediaries,” and [company]IBM[/company] has suggested companies could make money by recycling data — but so far no one has really been able to pull it off. There are myriad regulatory hurdles to overcome, not to mention the technological challenges of building such an infrastructure. Terbine has already prototyped a platform for the data exchange on Amazon Web Services and is thinking about its edge-network architecture, but actually building it is another story.
There’s also the not-so-small question of how Terbine, or any company attempting to build such a platform, will get companies on board with the data-sharing plan. Many data marketplaces so far have been populated with data that’s either not too interesting or, in the case of some early government efforts, not available in usable formats.
Knight thinks companies will certainly be willing to pay for quality data, but acknowledges that bartering (give some data to get some data) might be a better method for getting them initially involved and proving there’s value in the exchange. He said Terbine also hopes to deploy its own network of sensors with strategic partners so it can ensure certain data it perceives as valuable will be available.
Being headquartered in Las Vegas might be a strategic advantage, Knight said, because of the highly interconnected SuperNAP data center (where he hopes to eventually host Terbine’s platform) and the Department of Energy’s Remote Sensing Laboratory. He’s hopeful the latter could offer a testbed for some of Terbine’s plans, and possibly some talent.
It’s a longshot to be sure, but Knight, who has previously been involved in the quest to bring the Endeavour space shuttle to the California Science Center and is also working on a high-tech virtual reality tour of the craft, says he’s game for it.
“What I really like,” he said, “is being involved with things people say can’t be done.”
A company called FarmLink has raised $40 million in equity capital to further its business of analyzing sensor data to determine how much food a field can ideally yield. It’s just the latest in a string of investments at the intersection of agriculture and data.
Big data has been a buzzword for years, but it’s a lot more than just buzz. There are now so many tools and technologies for creating, collecting and analyzing data that almost anything is possible if you know where to look.
Microsoft showed off more its big data strategy on Tuesday in an event that touched on everything Excel to “ambient intelligence.” If the company can execute, it has a shot to repeat its desktop success in the data era.
Ford is turning to data culled from social media to make design decisions on its new vehicles, according to data scientist Michael Cavaretta.
Even Booz Allen Hamilton has dollar signs in its eyes when it thinks about sports data. The company is getting started on a new venture to apply its data science mastery to the piles of sensor and statistical data teams are generating.
Splunk is switching CTOs, as co-founder Erik Swan is stepping down to be replaced by former Yahoo exec and Continuuity co-founder Todd Papaioannou.
A Seattle-based startup called Seeq has raised $6 million to help companies capitalize on the Industrial Internet by letting use the streams of data their business processes are generating.