CloudSigma goes all-SSD to boost HPC performance in the public cloud

Public clouds offer lots of flexibility, but not necessarily the sort of performance you need for handling big data. The Zurich-based provider CloudSigma has felt this pinch more than most, as it is a supplier to Europe’s performance-hungry science cloud, Helix Nebula, and now it says it has found the solution: going all-SSD. Well, that and rolling its own stack.
CloudSigma, which operates out of both Switzerland (Zurich) and the U.S. (Las Vegas), was one of a handful of infrastructure-as-a-service (IaaS) providers that signed up last November for SolidFire’s all-SSD storage system. The result is now here: CloudSigma has ditched all its hard-disk drives and, as a result, it now feels confident enough to offer a service-level agreement (SLA) for performance, as well as uptime.
What’s more, despite the fact that solid-state storage costs about eight times as much as hard-disk, CloudSigma hasn’t changed its pricing – its SSD-based utility service costs $0.14 per GB per month, same as the HDD-based service did. Customers can also pick up the SSD storage service unbundled from CPU and RAM if they so choose.

HPC in the public cloud

According to CloudSigma COO Bernino Lind, the shift to SSD is a major help when it comes to handling high-performance computing (HPC) workloads, such as those of Helix Nebula users CERN, the European Space Agency (ESA) and the European Molecular Biology Laboratory (EMBL):

“They want to go to opex instead of capex, but the problem is there is no-one really who does public infrastructure-as-a-service which works well enough for HPC. There is contention — variable performance on compute power and, even worse, really variable performance on IOPS [Input/Output Operations Per Second]. When you have a lot of I/O operations, then you get all over the spectrum from having a couple of hundred to having 1,000 and it just goes up and down. It means that, once you run a large big data setup, you get iowaits and your entire stack normally just stops and waits.”

Lind pointed out that, while aggregated spinning-disk setups will only allow up to 10,000 IOPS, one SSD will allow 100,000-1.5 million IOPS. That mitigates that particular contention problem. “There should be a law that public IaaS shouldn’t run on magnetic disks,” he said. “The customer buys something that works sometimes and doesn’t work other times – it shouldn’t be possible to sell something that has that as a quality.”
CloudSigma has also resolved another contention point around RAM, Lind claimed:

“A modern CPU can ask for a lot of data because it’s fast and efficient, so it is possible to saturate and make contention on your memory bus. That has been solved with NUMA topology, which is like a multiplexer to get access to memory banks. You get asynchronous access, which means you don’t have contention on accessing the RAM.
“However, public cloud service providers turn this off so the actual instance doesn’t have access to NUMA. We figured out a way to pass on the NUMA topology so, when you run really extensive compute jobs, you won’t hit a kind of contention when you want access to RAM. This is really important for big data workloads.”

In-house stack

Speaking of things that public cloud providers tend to turn off, CloudSigma’s stack – apart from the underlying KVM hypervisor, everything was written in-house – makes it possible to access all the instruction set goodies that are built into modern processors, such as the AES encryption instruction set.
Public clouds may run on a variety of physical hosts that encompass a range of CPU generations, only some of which will have certain instruction sets hard-coded onto the silicon. Providers will often turn off these instruction sets to make their platform homogeneous, but that means losing out on the performance benefits offered by hard-coding. According to Lind, CloudSigma’s stack allows a heterogeneous cloud based on allocation pools – say, one of older Intel chips and another of newer AMD 6380 chips – that customers can choose according to their performance needs.
What does all this mean in practice? Lind cited the example of augmented-reality gaming outfit Ogmento, which recently used CloudSigma’s all-SSD setup to power a mobile, location-based version of a popular title. “They [said] all their I/O-heavy stuff, databases and so on, saw a x8-x12 performance increase,” he noted. “Their entire stack saw a x2-x4 performance increase. That means they need to use less compute power in order to run their system.”
With the budgetary constraints faced by European scientists these days, it’s not hard to see how that same kind of effect could make a real difference in more serious applications too.