Why Big Data Will Need Big Gear

Hardware is often treated as a second-class citizen in discussions about big data, whereas software innovations such as Hadoop and the latest and greatest predictive analytics algorithms reign supreme, but that won’t be the case forever.
Wednesday, GigaOM Pro published a report I wrote about the fast-growing ecosystem of Hadoop vendors, projects and users. The underlying theme of that report, and right now for big data in general, is software: the products, algorithms and languages that help companies process, analyze and act upon their mountains of information. In my experience, hardware rarely comes up in discussions about big data, and when it does, it’s generally centered on data warehouse appliances. But the omission hardly means hardware is irrelevant. In fact, big gear might become a big deal as companies look to bolster the performance of their big data systems.
Perhaps nowhere is this more clear than at the network level. As Sun Microsystems (s orcl) Co-Founder Andy Bechtolsteim pointed out in a fireside chat at Structure: Big Data, the network has become the bottleneck in many systems as server hardware has continued to become faster and more powerful. This shouldn’t be a surprising sentiment, considering Bechtolsteim’s current company, Arista Networks, sells high-performance networking gear — including a new pair of low-latency switches just announced this week — but it’s also true. It’s why, for example, mobile-app-analytics startup Flurry recently upgraded its network infrastructure with Arista gear to handle skyrocketing network traffic, including across its growing Hadoop cluster.
When you’re moving terabytes of data across hundreds or thousands of nodes, or from one environment to another (e.g., from a Hadoop cluster to an analytic database) you can’t afford to have it lagging across the network for too long. If you’re looking to process and analyze in real time, latency needs to be as close to zero as possible.
Storage can be critical for better big data performance, too, as Cisco (s csco) highlighted Wednesday with its new C260 M2 servers designed for OLTP and data warehouse applications. The new box can house up to a terabyte of memory and 16 solid-state or hard-disk drives (up to 9.6TB in total), which are important factors in its target use cases, as they must not only be able to store large amounts of data, but also be able to access it in a hurry to answer time-sensitive queries of feed applications. This is why large systems vendors such as Oracle (s orcl), IBM (s ibm) and EMC (s emc) sell massive appliances armed to the teeth with storage and memory capacity and running either transactional or analytic database software.
But Cisco’s server is somewhat unique in that it’s, well, a server, as opposed to a full-on appliance. It doesn’t seem like too big of a stretch to think some company that wants to seriously improve the performance of its Hadoop cluster would transition to the C260 M2 to get the performance gains of so much memory and SSD capacity while still being able to add to the cluster one rackmount server at a time. Further, a smaller cluster in terms of nodes means even less latency, because there are fewer network points for data to traverse as it crosses the system.
In their current states, many Hadoop clusters and other big data systems probably are just fine in terms of compute, network and storage performance. That’s why, as I explain in my report, and as I’ve reported here before, much work is underway to improve software performance for tools like Hadoop, while still focusing on less-expensive commodity gear. But that might change as companies start to rely more on analytics to make everyday business decisions.
Much like banks have spent untold millions on finely tuned software and hardware to carry out their risk analyses and high-frequency trading applications, more-mainstream companies also will start feeling the pressure to build analytic systems that deliver results in as close to real time as possible, even for batch processes. Certainly, Arista and Cisco aren’t alone in seeing this trend, and I suspect we’ll see a lot more network and server vendors targeting big data applications beyond OLTP analytic databases as the trend gains momentum.
Image courtesy of Flickr user jpctalbot.