Qubole, the startup from former Facebook engineers Ashish Thusoo and Joydeep Sen Sarma, just closed a Series A investment round for its service, which lets users run a variety Hadoop jobs — including Hive, MapReduce and Pig — in the Amazon(s amzn) Web Services cloud. Hive is the data warehouse system and SQL-like language for Hadoop that Thusoo and Sen Sarma helped create while at the social-networking company. Charles River Ventures and Lightspeed Ventures led the round, which brings the company’s total venture capital investment to $7 million, including its seed round in late 2011.
Qubole launched in June 2012 and opened its platform for public consumption in December, Thusoo told me, and has processed about half a petabyte of customer data since then. Thus far, the platform’s biggest users have been in the advertising technology, e-commerce and application-development spaces. A common use case (and one detailed in a blog post by Qubole customer MediaMath) is to create pipelines that use Hadoop to process unstructured data before pushing it into relational databases such as MySQL, Vertica or Infobright for more-traditional business-intelligence applications.
However, Thusoo added, Qubole also has connectors for getting data out of certain other data stores, such as MongoDB, and is working on letting customers import data via API from services such as Omniture and Google analytics.
Being in the cloud — especially Amazon’s cloud — could actually pay big dividends, too, and not just because it lets Qubole scale clusters automatically and lets users avoid the operational headaches of maintaining a Hadoop cluster. Companies are already using Amazon S3 to store a lot of data — more than 2 trillion objects at this point — and that’s Qubole’s choice for a storage system, as well. As companies move more of their big data workloads to the cloud, S3 serves as a cheap, easy and generic storage platform to which they can connect various services and applications.
In January, for example, Netflix detailed its cloud-based Hadoop platform that consists of numerous services but relies on Amazon S3 as the source-of-truth data store.
If there’s one big question about Qubole, though, it has to be the emergence of a rather-large SQL-on-Hadoop market since the company launched. Although Hive has been an important part of the Hadoop stack over the past few years, its MapReduce foundation is beginning to show its age in terms of query speed, and the new breed of database startups pushing SQL analytics atop Hadoop are quick to point this out.
Thusoo has certainly noticed this activity, but he stills sees Qubole as being in a good position. For starters, he said, the company is looking at interactive analytics projects such as Impala and Shark to see how they might integrate with the Qubole platform, and Hadoop startup Hortonworks is leading the Stinger project to drastically boost the speed of Hive itself.
Further, there’s the fact that Qubole itself has already optimized its platform to run, on average, about five times faster than Hive would normally run on Amazon Elastic MapReduce alone.
“We’re also keeping a close tab on other projects in our space,” Thusoo said. “We have a lot of options … to play with.”
This story was updated at 8:32 a.m. to clarify that Qubole can handle MapReduce and Pig jobs as well as Hive, and that its seed round came in late 2011, not late 2012.