Netflix is open sourcing tools for analyzing data in Hadoop

The data team at [company]Netflix[/company] is opening sourcing some of the tools it uses to analyze data stored in Hadoop. The overall open source project is called Surus, and it focuses on user-defined functions (or UDFs) that Netflix has built for the Apache Hive and Pig, two higher-level frameworks that make it easier to query Hadoop data and write data-processing jobs.

The first tool Netflix has released as part of Surus is a Pig function, called ScorePMML, for scoring predictive models at scale. Within Netflix, the goal was to standardize the process of taking a model someone has built using R, for example then tested on a small dataset, and then running it against a much larger dataset in Hadoop and possibly rolling it out as a production model.

According to the blog post introducing Surus and ScorePMML, future releases will includes tools for tasks such as pattern recognition and outlier detection. The post goes into more detail about how ScorePMML works, where it shines and what are its limitations.

Netflix, of course, has become a poster child for the benefits of data analysis — using it to inform everything from content recommendations to streaming performance — and a leader in open source technology. For more on its efforts in both of these areas, check out my November interview with Netflix Chief Product Officer Neil Hunt, and our Structure Show podcast interview with Netflix engineers Ruslan Meshenberg and Andrew Spkyer (embedded below).

To learn even more about how the best and brightest companies around are using data to glean insights and build entirely new products, check out our Structure Data conference March 18 and 19 in New York. Speakers include data experts from companies such as BuzzFeed, Facebook, ESPN, Spotify and Goldman Sachs.

[soundcloud url=”https://api.soundcloud.com/tracks/177729369?secret_token=s-e9ILu” params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

In the name of accuracy, Google retools its Flu Trends model

Google has responded to criticisms over its Flu Trends tool and has reworked its predictive model to also account for data from the Centers for for Disease Control and Prevention. It’s still not a replacement for actual scientific research, but should be more accurate.

eBay buys Decide.com; co-founder Etzioni heading up AI institute

eBay has acquired Seattle-based price-prediction startup Decide.com, and the service will shut down on Sept. 30. The entire team will head over to eBay to help the e-commerce giant improve its experience through predictive modeling. The entire team except Co-founder and CTO Oren Etzioni, that is: the University of Washington computer science professor, Madrona Venture Group partner and former Farecast founder is heading up Paul Allen’s new Allen Institute for Artificial Intelligence.

Oren Etzioni

Oren Etzioni

At Netflix, big data can affect even the littlest things

There has been a lot of talk about data after the success of “Orange is the New Black” on Netflix, but as the content competition picks up in streaming TV, it might be the little things where big data has the biggest impact.

Data doesn’t play politics — and most of it suggests Obama will win

It’s one day before the presidential election, and the results from computer models and other data analyses are in, with most experts giving President Obama a higher probability of winning than challenger Mitt Romney. That’s no lock, however: while data doesn’t lie, models sometimes do.

EMC buys big-data-plus-security startup Silver Tail

EMC (s emc) is beefing up its RSA security division with the planned acquisition of big-data-focused startup Silver Tail Systems. The company’s software detects malicious activity by monitoring and analyzing site visitors’ behavior in order to determine whether it’s harmless or potentially harmful. The company claims multiple large banking and e-commerce sites among its customers.

As I’ve explained before, big data and real-time security software are a match made in heaven because systems get smarter as they get more data to feed their predictive models. The more that security products know about what’s normal and what’s not, the more accurately they can identify threats. Specifically, Silver Tail’s suite of products can flag anomalous behavior in real time, thus letting security personnel decide how respond, but it also can apply rules to try and mitigate suspected attacks with CAPTCHAs or other authentication tools.

Silver Tail was founded in 2008 and had raised more than $20 million in venture capital from a number of firms, including Andreessen Horowitz and Citi Ventures.