Big data is going mainstream, but there are still plenty of lessons to be learned from Silicon Valley data scientists whose businesses depend on data to survive. Although their use cases don’t always align with what more-traditional businesses are doing, they know enough about the science and technology to save big-data newcomers a lot of frustration.
I spent two days last week watching talks at the IE Group’s Big Data Innovation event, and here are five messages that really resonated with me. Hopefully, they’ll help your business, too.
1. Hadoop isn’t for everything. This should be common knowledge by now, but it bears repeating. Usama Fayyad, CTO at ChoozOn, pounded this point home when discussing how even Yahoo (s yhoo) — Hadoop’s biggest champion and Fayyad’s former employer (he was chief data officer) — learned this lesson the hard way. Yahoo was trying to do some advanced customer segmentation with Hadoop, he said, but found out it would be 50 times less expensive to do that particular workload with a more-traditional database architecture. The realization ultimately killed that project, which was resurrected as analytics startup nPario. Yahoo is now a paying nPario customer. (At Structure:Europe in October, we’ll debate the merits of Hadoop versus traditional relational databases onstage.)
2. Big data makes data science easier. I found this one of the more enlightening realizations, thanks in large part to how its messenger — Daniel Wiesenthal, chief data scientist at Sparked.com — was able to so clearly delineate between the sometimes-overlapping concepts of big data and data science. Essentially, he explained, techniques such as support vector machines and neural networks are time-tested and proven methods for “sucking every last ounce of information from your data set,” even when those data sets are small, but the techniques are very complicated, they’re difficult to interpret and they tend to break at scale.
However, big data lets data scientists use simpler modeling techniques such as decision trees and regression, while letting the volume of data account for accuracy (and statistical significance) rather than a super-complex algorithm. And, Wiesenthal noted, using general-purpose big data technologies such as Hadoop means data scientists can develop and test models faster because their infrastructure isn’t tuned to a specific algorithm or problem type, and it’s designed to perform well against large data sets.
3. “Sometimes it’s more important to know what to kill.” Software-as-a-service pioneer Salesforce.com (s crm) uses its big data platform to monitor the uptake and usage of various product features, said director of product management Narayan Bharadwaj, but the goal isn’t only to predict what new features to add next. Rather, he explained, using data to determine what features aren’t doing help a company like Salesforce.com decide to put those resources into more-valuable features. “Sometimes it’s more important to know what to kill,” he said.
Bharadwaj didn’t address this point, but it seems a logical next step would be to analyze the characteristics of features that perform well/poorly to get a sense of what works and what doesn’t from a design perspective.
4. Context adds value. To put it another way, if users know why they’re being shown a particular piece of content or offer or recommendation, they’re more likely to check it out. As a senior data scientist at StumbleUpon, explained, his company invests heavily in big data technologies and data science techniques in order to put the most-relevant web content in front of each user, but knows it’s not enough to expect those users to just trust the service’s judgment. Sparked.com’s Wiesenthal made a similar point in his talk, noting that services such as Pandora (s p) and Netflix (s nflx) are popular in part because they actually tell users something about themselves when recommending similar content.
5. Transaction data trumps search data. Mok Oh, chief scientist at PayPal (s ebay), discussed the chain of events that begins with product searches and ends with purchases, and how it becomes increasingly difficult to determine signals when you start at one end of the chain and work your way toward the other. PayPal is trying to traverse this gap, however, beginning with the transactions it processes and using the other data at its disposal (both internally and from external sources such as Facebook and Gnip) to try and figure out who its customers really are and what they really want. He argues this is easier than, say, Google (s goog) trying track users from search through purchase — unless, of course, they actually purchase something using a tool like Google Wallet.
I think the greater lesson, though, is to make lemonade from the lemons that are your data. Assuming a company’s greatest data resource is the data it has gathered specific to its own business, a path toward big data success is to use that data as a starting point and then get creative figuring out ways to glean more insights from it.
Feature image courtesy of Shutterstock user Bruce Rolff.