How to deal with cloud failure: Live, learn, fix, repeat

Like it or not, sweeping software bugs are just part and parcel with operating the largest computing systems the world has ever seen. On Monday night, Amazon Web Services (s amzn) published a detailed post-mortem of its latest cloud outage, which struck on Friday night as massive thunderstorms knocked out power to one of the company’s east coast data centers. However, issues with the data center’s backup generator were just a catalyst — it was a handful of latent software bugs that manifested themselves as the system attempted to restore itself that did the real damage.
Although AWS is already working on fixes at all levels, this won’t be the last cloud computing outage we see, either from AWS or its competitors in the cloud provider space.
On Monday, I spoke with Geoff Arnold, an industry consultant and entrepreneur-in-residence at U.S. Venture Partners whose past includes a tenure as Distinguished Engineer at Sun Microsystems and building and managing distributed systems for Amazon, Huawei and, most recently, Yahoo (s yhoo). We spoke before AWS released the details of what caused the outage, but he suspected the real issue was more about bugs and less about a power failure. His take: “As we gain more experience [building globally distributed systems], we encounter failure domains that we haven’t hit before.”
By and large, that’s just the price of doing business in the cloud. It’s a constant cycle of living and learning from your mistakes.
The reality, he said, is that engineers know the various components of their systems are going to fail and they design around the known fallibilities. But when you’re building some of the largest computing systems ever assembled, you’re bound to run into problems for which you haven’t planned or didn’t even know existed. Bug testing against every possible problem across hundreds of thousands of servers and multiple data centers just is neither easy nor, really, feasible.
Even Google (s goog), whose state-of-the-art web infrastructure lets it advertise no planned downtime for its cloud services and claim more than 99.9 percent uptime for Gmail, can’t test its massive systems to the nth degree. As Google Research Director Peter Norvig explained during a keynote in August 2011, there’s a line where it just becomes too cost-prohibitive to keep testing processes, although that varies by company. In April, Gmail suffered an outage that left up to 35 million users temporarily without access to their email.

Amazon’s first-mover disadvantage

As for AWS specifically, Arnold thinks it’s still among the most-resilient cloud platforms around. To some degree, though, it’s paying the price for being the biggest and most-advanced cloud available, and for having a stable of high-profile customers that make news when they go down. “[I]t’s certainly the case that Amazon does have some first-adopter disadvantages here,” Arnold said.

Geoff Arnold

For example, he explained, while AWS has had to re-architect various pieces of its platform at relatively high levels of effort and expense, its smaller, often times newer, competitors are learning from its mistakes without having to live them firsthand. Additionally, he said, AWS is somewhat limited in its options for high availability because it tries to keep its prices low for basic services such as computing and storage. “You can throw dollars at the problem” and engineer around faults in more-expensive ways, he said, but those costs will likely get passed on to customers.
And while the in vogue thing for competitors to do when AWS crashes is to pile on with potshots, Arnold isn’t convinced they — especially the open source alternatives — would fare any better operating at Amazon’s scale. “Frankly, none of them seems to be as robust, mature and well thought through as Amazon’s [cloud], he said. “I think OpenStack is going to be much less stable than anything Amazon has produced.”
A big reason for this is the open source development model that accepts contributions from large ecosystems of developers and that must account for the needs of all the big names attaching their strategies to a given project. “In some sense,” Arnold said, “I think Amazon has an advantage over the open source alternatives because [it] only [has] to answer to one boss — and that’s Jeff Bezos.”

No end in sight

Given the relative youth of cloud-scale systems such as Amazon’s, the logical question is whether they’ll ever evolve to a stage where availability isn’t an issue. Arnold says the answer is “not likely,” but there is some help on the way thanks to software-defined networks.
One of the major problems with cloud platforms is that “they have large numbers of components interacting in ways that were not necessarily designed together,” he explained, making it very difficult to predict what will go wrong. For example, virtual servers are designed in isolation from storage systems which are designed in isolation from load balancers, but they’re all sewn together and running over the same network in the end. Arnold thinks an SDN-style top-down network management approach, where everything is provisioned holistically, could help resolve this particular issue, but that’s still probably five years away.
When I asked whether there will be a time when we’ve figured out how to handle distributed systems of a particular size, Arnold replied, “I used to think so, but I’m getting more cynical now.”
One reason is that software engineers are always going to devise new ingenious ways to integrate systems together, causing entirely new problems to arise. Another is that the increasingly distributed nature of systems-engineering teams — especially those developing open source code — makes it easier to add bugs into systems and harder to catch them (see, e.g., the leap second bug that made its way back into the Linux kernel earlier this year). “We’ll continue to surprise ourselves by introducing failures,” he said.
Feature image courtesy of Shutterstock user asharkyu.