Why opening up its Cosmos big data system would be the right move for Microsoft

There has been a rumor floating around since August (first, and subsequently, reported by Mary Jo Foley at ZDNet) that Microsoft is preparing to release its Cosmos big data system as a service on its Azure cloud platform, likely as an alternative to Hadoop. That would not only be a bold move for Microsoft, but also, probably, a smart one.

We’ll get to that shortly, but first a brief history of Cosmos. It’s Microsoft’s internal big data system, used to store and process data for applications such as Bing, Internet Explorer and email. Cosmos’ batch-computing element is called Dryad, and it’s similar to — although reportedly much more flexible than — the MapReduce framework associated with Hadoop as well early-2000s Google, where MapReduce was invented. Cosmos also features a SQL-like query engine called SCOPE.

As of early 2011, Microsoft claimed it was storing about 62 petabytes of data in Cosmos. Around that timeframe, Microsoft was also previewing commercial versions of Cosmos/Dryad and pitching it as a better alternative to Hadoop. On the whole, reviews were positive. You can read more about Cosmos here and here.

A graphic showing Cosmos' place in the application architecture, circa 2011.

A graphic showing Cosmos’ place in the application architecture, circa 2011.

Around October 2011, Microsoft began investing heavily in making Hadoop run on Windows, an early — and wise — indication of the company’s move toward embracing both open source technologies as well as technologies that companies and developers have shown an interest in using. The Dryad work was moved to the Windows high-performance computing product line, where it eventually died. In February 2012, Microsoft went whole hog on Hadoop, announcing a partnership with Hortonworks and forthcoming products for both Windows Servers and Azure.

The company has been pretty quiet about Cosmos, Dryad and the whole shebang in the meantime, but in August, ZDNet‘s Foley reported on a Microsoft job posting that suggests the company is developing a version of Cosmos for external consumption. In late January, she expanded on the original report with information that Microsoft is recruiting testers for the Cosmos service, as well as citing improvements to the system’s SQL engine and new storage and computing components.

Microsoft declined to comment about any plans to release any products based on Cosmos.

Microsoft CEO Satya Nadella speaks at a Microsoft cloud event. Photo by Jonathan Vanian/Gigaom

Microsoft CEO Satya Nadella talks about open source at a Microsoft cloud event.

If Microsoft is indeed preparing a Cosmos service on Azure, it’s easy to see why. Data processing and analysis is going to be a major driver of IT spending in the coming decades, and smart companies are going to cover all their bases when it comes to serving customers’ needs. Hadoop is just the platform that every cloud provider, database vendor and analytics vendor needs to support because the community is so large and so many workloads already run on it.

But that doesn’t mean Hadoop is necessarily the best technology for every job, especially for cloud providers that want to control every aspect of a new service from the core code up to the user interface. Google’s Compute Engine platform supports Hadoop, but the company all but said “Hadoop is passé” when it rolled out its post-Hadoop Cloud Dataflow service in June. Databricks, a startup based around the Apache Spark technology, works very closely to integrate Spark with the Hadoop ecosystem but is banking its business on a cloud service that’s all about Spark.

If the Apache Storm stream-processing project was as popular as Hadoop is, perhaps Amazon Web Services would have built something around it rather than starting with its own stream-processing technologies, Kinesis and Lambda. Microsoft, in fact, is also now touting its own stream-processing engine called Trill that already underpins the company’s Azure Stream Analytics service, as well as streaming workloads for the backend systems powering Bing and Halo.

Comparing Trill to other streaming engines.

Comparing Trill to other streaming engines.

We will discuss the business of big data in detail at our Structure Data conference, which takes place March 18 and 19 in New York. Speakers include the CEOs of Hadoop vendors Cloudera, Hortonworks and MapR, as well as executives from Google, Microsoft and Amazon Web Services. And, of course, we will have some of the world’s most-advanced users talking about the tools they use and what they would like to see from the companies selling software.

And new data services, especially among cloud providers,are also about showing off a company’s technological chops, much like their boasts about the number of data centers they own. Engineers like to work on the biggest and best systems around, and developers like to build applications on them. Much like Google has open sourced bits and pieces of its infrastructure in the form of Kubernetes and some Cloud Dataflow libraries, it won’t surprise me if Microsoft decides to open source parts of Cosmos and Trill at some point — perhaps to help drive more interest around its recently open sourced .NET development framework.

There is too much money up for grabs in the cloud computing and big data markets to leave good technology locked up inside a company’s internal towers. As Microsoft, Google and Amazon seek to grab as many cloud workloads as they can and to hire as many talented engineers as they can — in a competitive market that also includes very open-source-friendly companies such as Facebook and Netflix — expect to see a lot more openness about the stuff they’re building, as well as a lot more services based on it.