Cloudera bought DataPad because data scientists need tooling, too

Cloudera has acquired a data-visualization startup called DataPad in order to bring DataPad’s employees, rather than its technology, into its fold. News of the acquisition first hit on Monday when DataPad users, myself included, received an email noting an acquisition and the service’s impending shutdown. On Tuesday, VentureBeat reported a rumor that Cloudera was the mystery buyer, which Gigaom has since confirmed.

However, more important than the acquisition of fledgling startup (DataPad launched publicly in May with $1.7 million in seed funding) is what this seems to say about Cloudera’s plans for putting its Hadoop technologies into the hands of more users. Hadoop is often dinged as being designed with engineers in mind, thus the mad rush over the past few years to build applications on top of it, and Cloudera knows that one way to do that is to reach out to data scientists and developers in languages they understand.

Wes McKinney. Source: Wes McKinney on Twitter

Wes McKinney. Source: Wes McKinney on Twitter

[company]DataPad[/company] co-founders Wes McKinney and Chang She are known in the data science community for having developed a Python-based data analysis library called Pandas. Python hasn’t always been a popular word in Hadoop circles — Hadoop is an infrastructure-level technology written in Java — but it’s a very popular word among the people who might want access to data stored inside Hadoop. For example, Mortar Data, a data analysis startup running atop the Amazon Elastic MapReduce service, has wrapped the MapReduce Java commands in a mixture of Pig and Python in order to attract a higher-level user base.

[company]Cloudera[/company] is not blind to this reality. In April, the company released a Python client for its Impala SQL-on-Hadoop engine that actually integrates with Pandas and has a stated purpose of helping Python users work with Impala without disrupting their flows.

Cloudera’s embrace of Apache Spark as a framework for running a majority of future big data jobs speaks to this strategy, as well. Users don’t just like Spark because it’s faster than MapReduce, they also like it because it’s easy to program and supports the Java, Scala and Python languages. This will be especially beneficial for projects such as Cloudera Oryx, a set of machine learning libraries currently being rewritten on top of Spark, and almost certain to be adopted primarily by data science types.

None of this should be surprising considering the billions of dollars up for play in the commercial Hadoop market. Cloudera, Hortonworks, MapR, Pivotal and more are all trying to win over as many users as they can for their respective flavors of Hadoop and general big data infrastructure. Spreading the cheerleading base beyond IT staff and systems architects, to include the people actually developing applications and doing data analysis within the company, is a good way to help ensure your stuff is the stuff that gets used.

Feature image courtesy of Shutterstock user fivespots.