Big, open data: MapR on Github and Yelp’s dataset challenge

If you’re into open source, or at least open data, today is a good day. Hadoop vendor MapR has open sourced a portion of its source code on Github and Maven, while Yelp has released a sample of its data as part of a $5,000 challenge to find the most-innovative use for it.
MapR’s decision to open source parts of it code is significant, but not groundbreaking. The company is only releasing its improvements to a handful of Hadoop-related Apache projects that are included in the MapR distribution of Hadoop, but not the proprietary code that’s MapR’s real competitive advantage in the contentious Hadoop market. While it’s still not flying the all-open-source banner like Hortonworks is, the code release puts MapR more on par with competitor Cloudera, which bolsters its open source aspects with some proprietary software for managing Hadoop clusters.
MapR also took another step in the open source direction on Thursday, announcing a partnership with Canonical that integrates MapR’s M3 distribution with the Ubuntu Linux operating system. The two also have plans to ease the installation of MapR’s Hadoop software on OpenStack-based cloud infrastructure.
I wrote recently in relation to MapR’s $30 million VC investment that the company is in a tricky position when it comes to open source. The Hadoop ecosystem was built on open source and still values it immensely, but some customers are definitely willing to pay money for products that deliver the features they want, open source or not.
As for Yelp, well, it’s just following in the footsteps of many companies — Netflix (s nflx) and everyone doing something on Kaggle (including GigaOM) — in trying to find new ways to use its data. The data set it’s releasing is from the Phoenix, Ariz., area and include 11,537 businesses, 8,282 checkin sets, 43,873 users and 229,907 reviews. The deadline for entries is May 20, and they can be submitted in pretty much any form you can imagine.
Hopefully, for Yelp’s sake, it doesn’t step in it the way other companies — including Netflix and AOL — have when they released supposedly anonymous data sets that were later de-anonymized. Releasing data sets gives clear benefits to both the source companies and institutions or individuals accessing the data, but privacy snafus have a away sneaking up and mitigating some of the goodwill.
Feature image courtesy of Shutterstock user Jakub Krechowicz.