The curious case of Hadoop in HPC

SGI (s sgi) and Cloudera have entered into an agreement that lets SGI sell clusters preloaded with Cloudera’s Hadoop distribution and commercial support, but the most interesting part of the deal is that it presently appears focused on general-purpose deployments rather than on the high-performance-computing deployments that are SGI’s bread and butter. That would be surprising if it weren’t for an apparent trend of vendors pushing Hadoop alternatives for HPC workloads, with the apparent theory being that HPC users are more willing to put performance ahead of community.

Timothy Prickett Morgan at The Register provides a good writeup of the Cloudera-SGI news. Among his observations is that “SGI plans to focus mostly on peddling its energy-efficient Rackable machines for Hadoop clusters, with the Altix ICE clusters, which are designed for HPC supercomputing workloads, taking a backseat.” That seems like a strange decision considering that Cloudera already has a very similar reseller deal with Dell (s dell), and SGI’s main point of differentiation product-wise is its Altix lineup, not its scale-out Rackable lineup. And, as Morgan also explains, SGI already has been building massive Hadoop clusters for some of its large customers incorporating the Altix servers.

However, those custom-built deployments actually put SGI among a seemingly small number of companies actually pushing Hadoop for HPC. Probably the best example of this trend is Microsoft (s msft), which last week announced plans to offer Hadoop distributions for Windows Server and Azure despite the existence of its Dryad framework. In a March post, Microsoft’s Madhu Reddy described Dryad as “enabl[ing] a new class of data-intensive applications by providing a sophisticated, distributed runtime and associated programming model that will allow organizations of all types to use commodity clusters for analysis of large volumes of unstructured data.”

After the Microsoft-Hadoop news broke, Reddy e-mailed me an update on Dryad in which he explained its new name, “LINQ To HPC,” and its HPC focus:

Because L2H is integrated with Windows HPC Server, it is optimized for analysis of Big Data in HPC scenarios (i.e., scenarios where large amounts of data are required as an input to—or output from—HPC applications and the data has to be analyzed/visualized). We are targeting both on-premises and on Windows Azure HPC Big Data scenarios.

IBM (s ibm) appears to be taking a non-Hadoop-focused approach to HPC, too, with its acquisition of Platform Computing last week. As I explained at the time, Platform made its name in high-performance computing within large banks and is now taking that prowess into big data with a MapReduce management product. While Platform MapReduce can support both Hadoop MapReduce and the Hadoop Distributed File System, it also supports numerous other frameworks at both the computing and storage layers. If those offer better performance than their Hadoop counterparts, it’s easy to believe that IBM’s HPC customers won’t be going with Hadoop.

There’s also LexisNexis offshoot HPCC Systems, which thinks it has a chance to do big business selling its Hadoop-alternative processing system, High-Performance Computing Cluster, to performance-sensitive customers. The software, which was designed to process huge amounts of data for intelligence and other big-time customers, is already suited to those types of workloads, CTO Armando Escalante explained to me recently, leaving his company’s real challenge as positioning itself as a Hadoop alternative among more-traditional web developers.

The willingness to push Hadoop alternatives (or, in SGI’s case, to do custom rather than prepackaged Hadoop builds) for HPC workloads is likely rooted in the HPC space’s history of working with its own set of tools designed for each application’s own specific performance requirements. Whereas mainstream users care about the thriving Hadoop community because it means better products and a continued stream of innovation and support, HPC users typically care about what works best. If LINQ, HPCC or any frameworks that IBM’s new IP supports offer a faster, better experience than Hadoop, then perhaps they’ll find a loyal contingent of users lurking within the world’s research labs and high-performance data centers.

Image courtesy of Flickr user jpctalbot.