Microsoft buys data science specialist Revolution Analytics

Microsoft has agreed to acquire Revolution Analytics, a company built around commercial software and support for the popular R statistical computing project. The open source R project is hugely popular among data scientists and research types, and having Revolution’s R experts in-house could be a big deal for Microsoft as it tries to establish itself as the go-to place for data science software.

Among Revolution’s additions to the standard R capabilities were simplifying the use of the program and engineering it to run across big data systems such as Hadoop. Here’s how Joseph Sirosh, Microsoft’s corporate vice president for machine learning, explains what the deal means in a blog post:

As their volumes of data continually grow, organizations of all kinds around the world – financial, manufacturing, health care, retail, research – need powerful analytical models to make data-driven decisions. This requires high performance computation that is “close” to the data, and scales with the business’ needs over time. At the same time, companies need to reduce the data science and analytics skills gap inside their organizations, so more employees can use and benefit from R. This acquisition is part of our effort to address these customer needs.

. . .

This acquisition will help customers use advanced analytics within Microsoft data platforms on-premises, in hybrid cloud environments and on Microsoft Azure. By leveraging Revolution Analytics technology and services, we will empower enterprises, R developers and data scientists to more easily and cost effectively build applications and analytics solutions at scale.

Sirosh will be speaking at Gigaom’s Structure Data conference, which takes place March 18-19 in New York.

A simple example of a plot using R.

A simple example of a plot using R.

In the blog post, Sirosh also promised to continue contributing to the open source R community, as well as to continue developing Revolution’s products. He reiterated Microsoft’s renewed (or just plain new) commitment to open source software, which includes contributions to various Hadoop-related projects and support for many open source technologies on the Azure platform.

In a separate blog post, Revolution’s David Smith detailed Microsoft’s specific commitment to R, including within the Azure Machine Learning service it announced in June:

And Microsoft is a big user of R. Microsoft used R to develop the match-makingcapabilities of the Xbox online gaming service. It’s the tool of choice for data scientists at Microsoft, who apply machine learning to data from Bing, Azure, Office, and the Sales, Marketing and Finance departments. Microsoft supports R extensively within the Azure ML framework, including the ability to experiment and operationalize workflows consisting of R scripts in MLStudio.

When Microsoft CEO Satya Nadella went on a cloud computing road show in October, touting the scale of Microsoft’s cloud efforts, I argued that applications, not scale, would always be Microsoft’s big advantage in that space. The same holds true for the world of big data and data science.

Revolution Analytics and the R project might not be household names in most circles, and they certainly won’t be a major driver of Microsoft revenue any time soon, but they are a big deal in the world predictive analytics and machine learning. That’s an emerging market that Microsoft wants to get in on early, while so many other vendors are still pushing yesterday’s technologies or focused on building out infrastructure to store all the data companies want so badly to analyze.

Twitter open sources an anomaly-detection tool

Twitter has open sourced a tool that it uses to detect spikes in time-series data, in order to make sure it’s able to react to issues with its servers, applications or other systems that might affect the site’s performance. The company also uses the tool, which is an R package called BreakoutDetection, to track large upticks in engagement during big events such as the Super Bowl. According to a blog post detailing BreakoutDetection, Twitter created it in order to deal with the mass of data and relative commonality of anomalies associated with running large-scale systems.

Fig2_0

Google has open sourced a tool for inferring cause from correlations

Google open sourced a new package for the R statistical computing software that’s designed to help users infer whether a particular action really did cause subsequent activity. Google has been using the tool, called CausalImpact, to measure AdWords campaigns but it has broader appeal.

The new databases, part two (of three): optimization for every need

In part one, we looked at the forces driving a proliferation of new database solutions, loosely ordered within an emerging Hadoop ecosystem. Examples of these specialized analytical engines include:

1)    Databases optimized for cloud scaling. In the Gigaom Research report, What to know when choosing database as a service, George looks at the database solutions—such as VoltDB, Clustrix, and NuoDB—that are bringing a SQL interface to scalable database clusters.

2)    Databases optimized for archiving. As he describes in the Gigaom Research report, How to manage big data without breaking the bank, databases such as RainStor are able to leverage up to 40-times compression of archive data to bring new cost effectiveness and accessibility to the storage of data records.

3)    Open source NoSQL databases, such as MongoDB and CouchDB. These databases, which George expects to migrate more fully to the Hadoop ecosystem, are optimized for the frequent product updating required for mobile and web environments.

4)    Graph databases, such as Neo Technology’s Neo4J, that specialize in tracking and optimizing the multipoint networks found in shipping, transportation, and telecommunications, computer networks and similar environments.

5)    The Gnu-project statistical language and environment, R. This is a preexisting language for statistical analysis that will be used for stats-oriented databases within Hadoop.

6)    Splunk, with its machine log data and analysis that currently provides two-way integration with Hadoop and other data environments.

7)    Microsoft’s massively parallel data warehouse and Hadapt’s implementation of SQL on Hadoop. These products provide alternative routes to Hadoop database access that combine a SQL interface with very low-cost and high performance improvements over the traditional data warehouse.

Not all of these types of products are presently operative or fully functional within Hadoop. But Gigaom Research analyst George Gilbert expects they will be options within a larger Hadoop ecosystem as the IT industry undergoes a period of increasing database options and complexity under the increasingly unifying Hadoop umbrella.

In part three, we will look at how this market of largely startup and open source alternatives will mature—and be made practical for the average enterprise organization.

How 0xdata wants to help everyone become data scientists

Although it’s still a work in progress, 0xdata thinks it has the answer to the problem of doing advanced statistical analysis at scale: Build on HDFS for scale, use the widely known R programming language and hide it all under a simple interface.

Data for doctors: Big data meets a big business

Forget the division between structured and unstructured data. For the benefits of the big data era to reach businesses bottom lines or to change behaviors, companies will have to figure out how to bring the results of Hadoop analytics to HR and middle managers.

How OkCupid Demystifies Dating With Big Data

The interesting story behind OkCupid, the online dating site recently acquired by Match.com, is OkTrends, its blog that analyzes the site’s wealth of data to shed light on our love lives. But the interesting story behind OkTrends is its use of R to power those analytics.