Go ahead and deploy your Hadoop cluster in the cloud. Really. It can handle it.
Cloud computing providers and big data vendors have been working toward this moment for years, and it looks like the moment has finally come. Of all the news coming out of the Strata and Hadoop World shows taking place this week, the most compelling stuff all goes to prove this point. Here’s a quick recap of what was announced:
- [company]Cloudera[/company] is offering a tool called Director that makes it easier to manage a fully functional Cloudera Hadoop cluster on the Amazon Web Services cloud.
- The [company]Hortonworks[/company] Data Platform is now fully certified with Microsoft Azure, meaning users can run the same software and get the same experience as in their own data center.
- Further — presumably as a result of its longstanding partnership with Hortonworks — [company]Microsoft[/company]’s HDInsight Hadoop service now supports the Storm stream-processing framework.
- [company]Rackspace[/company] launched a big data version of its bare-metal cloud service, promising deployment of a Hadoop-Spark cluster in just a few clicks.
- Tableau announced a beta version of a connector for Amazon Elastic MapReduce (among others, including Spark SQL), which will let users pull data from the cloud service for analysis in Tableau.
That’s just product news, about just Hadoop, from just this week. The broader goings on around the data community paint an even clearer picture of where we’re headed. All around, big, well-funded companies are putting some serious effort into blurring the lines between big data platforms and cloud computing platforms:
- Apache Spark startup [company]Databricks[/company] last week released benchmark results showing how it blew away a previous record, set using Hadoop MapReduce, while running on a fraction of the machines in Amazon’s cloud. That’s important for a couple of reasons, including that Databricks has raised lots of venture capital from some of the biggest firms around on the promise that its Spark cloud service will attract lots of users.
- NoSQL vendor [company]MongoDB[/company] stopped one step short of releasing its own database-as-a-service offering, opting instead to just manage users’ MongoDB deployments from inside a cloud service. If users choose to deploy on AWS, though, MongoDB’s new MMS tool will automatically optimize those instances and offer to deploy them across Availability Zones in the name of resiliency.
- Hortonworks and Altiscale have been working with the Docker community to integrate its container technology with the Hadoop YARN resource-management framework. What’s more, Hortonworks has also been working with companies around integrating the Kubernetes management tool for Docker, and is now proposing the creation of an object storage system for Hadoop. This work isn’t so much about Hadoop in the cloud as much as it is about Hadoop becoming a cloud-like platform for hosting lots of disparate data applications.
- [company]Salesforce.com[/company] announced with much fanfare a new analytics service this week that is of course delivered as a cloud service. Many next-generation analytics products are also cloud-based or at least offer a cloud option, including Tableau and [company]ClearStory Data[/company].
- Companies like Google, Microsoft and IBM continue to release new machine learning features on top of their cloud platforms and inside their cloud applications, making them look like more appealing places to store data, as well.
- [company]Teradata[/company] is running a Hadoop cloud offering as well as a cloud-based offering of its flagship data warehouse system.
- [company]Oracle[/company] — Oracle! — announced a platform-as-a-service offering. It even hired Peter Magnusson, one of the key engineers behind Google App Engine, to help lead its development. Laugh if you will, but Oracle seriously suggesting its customers run their databases in the cloud (including “extreme performance” versions) says a lot about how times have changed.
And then there are the numerous startups doing some flavor of Hadoop in the cloud — [company]Mortar[/company], [company]Altiscale[/company] and [company]Qubole[/company], to name a few — and seemingly dozens of analytics startups, some of which are running some very impressive infrastructure under a sleek UI.
AWS and Google regularly release new big data services for their cloud platforms, and we’ll likely see at least one more come out of Amazon’s Re:invent show next month. By the way, every cloud provider now offers solid-state drives, and they’re the default local and persistent storage options on AWS. That’s part of the reason Spark was able to run so fast in that Databricks benchmark test.
Don’t get me wrong, we are by most accounts very far from the point where most (broadly defined) big data workloads, or probably even a significant fraction of them, are running in a cloud service. For every Netflix running large Hadoop and Cassandra clusters in the cloud, there are probably two large banks that are still experimenting with three different Hadoop sofware vendors. Startups such as Interana and Cask (nee Continuuity) that want to target enterprise customers still face the harsh reality that their initial cloud delivery models will have to wait.
But with a few exceptions for particularly large or particularly regulated datasets, the tide is turning. If users weren’t asking for it, it’s hard to see all these companies trying so hard to make it happen.