What Amazon and Its Customers Can Learn From Last Week’s Outage
Last week’s Amazon Web Serivces outage unleashed a torrent of speculation from technology pundits and the mainstream media, and opinion appears surprisingly divided as to where any blame should lie. Problems, which affected part of one of AWS’ five global data centers, began early on Thursday, and, thanks to a lack of detailed information on what was wrong or how it could be fixed, a small number of companies were still struggling days later as Amazon attempted to restore data from backups.
There doesn’t seem to be much room for doubt that Amazon is at least partly responsible. The failure should never have cascaded as far or as long as it did. Amazon describes its Availability Zones, into which it divides each of its Regions (data centers), as “distinct locations that are engineered to be insulated from failures in other Availability Zones.” Yet this outage initially affected at least two of these zones. Information dissemination was poor, and normally vocal champions within the company went silent. At the time of writing, the Amazon Web Services Blog still doesn’t even acknowledge that there was ever an issue.
But while Amazon may be responsible for the initial failure, and for a lack of communication while it was being resolved, it’s also clear that a number of its customers had a far harder time than they needed to because of how their services are designed and operated. As Derrick Harris noted earlier this week, Twilio, SmugMug and Netflix were among those companies to emerge almost unscathed, and this wasn’t due to luck. It was a philosophy of system design that understood the power — and the limitations — of using a commodity service like Amazon’s. Cloud computing consultant Ben Kepes notes that “highlight has been made of the need to think beyond one zone, one data center, one region and one provider to build a robust and resilient service.” InfoWorld columnist David Linthicum, meanwhile, agrees: “You have to plan and create architectures that can work around the loss of major components to protect your own services.” This is simply good practice in designing any IT system, which is why it is bizarre that so many simply abdicated their responsibility and left it all to Amazon. Foursquare, Reddit, Quora and hundreds more suffered greatly because of failings in Amazon’s data center. Might their suffering have been lessened if they’d planned ahead in a similar way to Netflix or Twilio?
Amazon has said it is reviewing last week’s outage in order to understand what went wrong. But the company must also play a far more proactive role in teaching its customers about ways applications can cost-effectively take advantage of the cloud, as well as how to respond to outages, failures and other problems in the underlying infrastructure. Whatever mistakes Amazon’s customers may have made, whatever penny pinching they attempted to cut cloud services down to the cheapest, least fault-tolerant configuration they could get away with, the initial fault must lie with Amazon. Poorly architected customer systems would not have been pushed to failure if Amazon’s underlying infrastructure had continued to perform as expected. Maybe some long-term good can come from short-term pain and embarrassment.