Netflix is open sourcing tools for analyzing data in Hadoop

The data team at [company]Netflix[/company] is opening sourcing some of the tools it uses to analyze data stored in Hadoop. The overall open source project is called Surus, and it focuses on user-defined functions (or UDFs) that Netflix has built for the Apache Hive and Pig, two higher-level frameworks that make it easier to query Hadoop data and write data-processing jobs.

The first tool Netflix has released as part of Surus is a Pig function, called ScorePMML, for scoring predictive models at scale. Within Netflix, the goal was to standardize the process of taking a model someone has built using R, for example then tested on a small dataset, and then running it against a much larger dataset in Hadoop and possibly rolling it out as a production model.

According to the blog post introducing Surus and ScorePMML, future releases will includes tools for tasks such as pattern recognition and outlier detection. The post goes into more detail about how ScorePMML works, where it shines and what are its limitations.

Netflix, of course, has become a poster child for the benefits of data analysis — using it to inform everything from content recommendations to streaming performance — and a leader in open source technology. For more on its efforts in both of these areas, check out my November interview with Netflix Chief Product Officer Neil Hunt, and our Structure Show podcast interview with Netflix engineers Ruslan Meshenberg and Andrew Spkyer (embedded below).

To learn even more about how the best and brightest companies around are using data to glean insights and build entirely new products, check out our Structure Data conference March 18 and 19 in New York. Speakers include data experts from companies such as BuzzFeed, Facebook, ESPN, Spotify and Goldman Sachs.

[soundcloud url=”” params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]