Researchers and businesspeople around the world will soon have at their disposal a new way to quickly and easily perform massive computations over large quantities of unstructured data. The reason: a Microsoft computing tool under development called Dryad. Dryad and the associated programming model called DryadLINQ simplify the process of running complex data-analysis applications across hundreds, or even thousands, of machines running Windows HPC Server.
Originated by Microsoft Research, Dryad, DryadLINQ and the Distributed Storage Catalog (DSC) are currently available as community technology previews. DryadLINQ allows programmers to use Microsoft technologies such as Microsoft .NET and LINQ to express their algorithms. Dryad executes the algorithms across large quantities of unstructured data distributed across clusters of commodity computers. DSC provides the bottom layer to the stack. It is a storage system that ties together a large number of commodity machines to store very large (i.e., Bing-level) quantities of data.
These are commercial versions of the same technology used by the Bing search engine for large, unstructured data analysis. Together, they will enable a new class of data-intensive applications by providing a sophisticated, distributed runtime and associated programming model that will allow organizations of all types to use commodity clusters for analysis of large volumes of unstructured data.
Microsoft’s approach to big data is to provide an end-to-end solution that spans the entire process of data capture, loading, analysis, reporting and visualization. Key to this is interoperability with commonly used Microsoft IT infrastructure, such as Active Directory and System Center, and to enable both HPC and big data applications on the same clusters and nodes.
An obvious question is how Dryad differs from Hadoop, an open source implementation of the MapReduce system developed at Google. Chief among the difference are the following:
- A significant part of Dryad and Hadoop is the management and administration of large clusters on which applications run. This typically includes a range of activities such as setup, provisioning, deployment, resource management, performance tuning, monitoring and troubleshooting. While Hadoop has chosen to build these capabilities from scratch, Dryad has chosen to leverage the proven and tested cluster management capabilities already present in Windows HPC Server. Windows HPC Server has been deployed in production on clusters of varying sizes (tens to thousands nodes) for several years and has a comprehensive, set of end-to-end cluster management capabilities.
- Hadoop, given its roots in Web 2.0 companies with their large data volumes and tech-savvy development teams, has focused on performance and scale. Dryad, building on the performance and scale of Windows HPC Server, has in addition focused on making big data easier to use for mainstream application developers. DryadLINQ is based on the Language Integrated Query (LINQ) technology familiar to thousands of Structured Query Language (SQL) developers. Further, it is integrated with Visual Studio, which is also one of the most widely used development tools.
- Dryad and DSC are based on the widely used and mature NTFS (New Technology File System), the file system that comes standard with Windows Server. This use of NTFS is one of the technical innovations in the overall DryadLINQ, Dryad and DSC system: where Hadoop depends on the Hadoop File System (HDFS), its own file system, DryadLINQ, Dryad and DSC get the same scalability without hiding the underlying file system. This allows users to access their data directly as NTFS files, allowing the use of existing work that has gone into performance tuning NTFS. Further, no special tools are required for loading/unloading and managing the data in DSC.
- Hadoop uses the MapReduce computational model, which provides support for expressing the application logic in two simple steps — map and reduce. However, to develop more complex applications, developers will have to manually string together a sequence of MapReduce steps. DryadLINQ offers a higher-level computational model where complex sequence of MapReduce steps can be easily expressed in a query language similar to SQL. DryadLINQ, in conjunction with Dryad will execute the query with the optimal number of steps, taking into consideration factors such as the amount of data, size of the cluster and how the data is distributed across the cluster.
DryadLINQ, Dryad and DSC are all coming to market as part of the next version of Microsoft’s Windows HPC Server product. Microsoft chose this integration point because of the existing goal of the HPC Server product to bring high-performance computing to a broader community of customers via Windows-style ease of use. Historically, cluster computing focused on compute-intensive workloads, but with DryadLINQ, Dryad and DSC, the Windows HPC Server product is bringing its emphasis on democratization and ease of use to large-scale unstructured data analytics.
The DryadLINQ, Dryad and DSC technologies are available as a CTP download today from http://blogs.technet.com/b/windowshpc/archive/2010/12/17/dryad-beta-program-starting.aspx.
As Dryad proves, Hadoop isn’t the only tool around for analyzing big data, and we’ll be discussing a large number of them at our Structure Big Data conference, which takes place next week, March 22, in New York City.
Madhu Reddy is senior product manager for Technical Computing marketing at Microsoft.