What Amazon and Its Customers Must Learn From Last Week’s Outage

A lot has already been written, here on GigaOM and elsewhere, about the fault that knocked out part of Amazon‘s North Virginia data center last week. Opinion is divided: One side blames Amazon for its technical failings; the other holds the cloud giant’s customers responsible for their bad judgment. But the answer to where the primary responsibility lies is a little less black and white. Let’s examine each side.

Today, Amazon released the company’s initial assessment of what went wrong. An error in a change to the network configuration inside the North Virginia data center routed large volumes of primary network traffic onto the lower-capacity network reserved for Amazon’s Elastic Block Stores (EBS). Due to the unexpectedly high load, volumes within a single Availability Zone lost their connections to the network, to one another and, most importantly, to their redundant backup instances or replicas. When the initial network was restored, all of the affected volumes simultaneously attempted to use it to locate and resynchronize with their replicas. The network overload even affected EBS users in other Availability Zones, something that was not supposed to happen.

As Derrick Harris noted on Friday,

“If we think about the AWS network as a highway system, the effects of the outage were like those of a traffic accident. Not only did it result in a standstill on that road, but it also backed up traffic on the onramps and slowed down traffic on other roads as drivers looked for alternate routes. The accident is contained to one road, but the effects are felt on nearby and connected roads, too.”

In that respect, there doesn’t seem to be much room for doubt that Amazon is at least partly responsible, and that the failure should never have cascaded as far or as long as it did.

But we’re also seeing reports from Amazon customers who managed to operate relatively unscathed throughout the problem period. Customers such as Twilio, Bizo, Mashery and Engine Yard designed their systems to understand both the power and limitations of using a commodity service like Amazon’s. As InfoWorld columnist David Linthicum notes, ”You have to plan and create architectures that can work around the loss of major components to protect your own services.” Foursquare, Reddit, Quora and hundreds more suffered greatly because of failings in Amazon’s data center. Might their suffering have been lessened if they’d planned ahead in a similar way to Engine Yard or Twilio?

Still, whatever mistakes Amazon’s customers may have made, and however they pinched pennies to cut cloud services down to the cheapest, least fault-tolerant configuration they could get away with, the initial fault must lie with Amazon. Even poorly architected customer systems would not have been pushed to failure if Amazon’s underlying infrastructure had continued to perform as expected. Maybe some long-term good can come from short-term pain and embarrassment.

For more thoughts on Amazon’s recent outage, and the lessons that Amazon and its customers must learn, see my latest Weekly Update for GigaOM Pro (subscription required).

Image courtesy of flickr user hans.gerwitz