Amazon outages — lessons learned

Two recent Amazon(s amzn) outages over the past month certainly got everyone’s attention. One, in late June, was sparked by a violent thunderstorm which cut power, setting up a chain of events that put many Amazon customers offline for hours. That came just two weeks after another significant outage in the same U.S. East data center.
Newvem, a company that studies Amazon EC2 usage based on its customers’ Amazon deployments, said there were signs hours before the outage that provide helpful information on how companies can mitigate the impact of future snafus.
A look at Newvem customers — which run more than 15,000 Amazon EBS volumes in U.S. East — showed that 12 hours before the initial June 28 storm-induced outage, there was a 200 percent to 300 percent latency spike. That meant it took  three times as long to load a page, according to Newvem CEO Zev Laderman. When the outage struck, more than 2,000 of those user volumes — or 15 percent of the total — became unavailable. From there, it took about 15 hours to get those users back to their usual capacity.  Worse, about 7 percent of those 15,000 volumes may be gone for good, he said.
To be sure, Newvem, an Israeli startup, has a vested interest in publicizing this: It wants to sign up more customers and if it can show that its service can predict problems and recommend palliative actions, it’s good for business.
“One customer — a startup with a few hundred servers– put up a Web site that week … and they lost all their data because they didn’t back it up,” Laderman said.
No company should put itself in that position and Amazon recommends actions to prevent such loss. The fact that many Amazon customers are startups that have little or no traditional IT experience, means they’re susceptible to such issues.  Cloud providers just like non-cloud entities, rely on data centers, hardware, software and power — all of which are susceptible to failure. The glory of the cloud is that workloads can be spread around to lessen risk but even that is not foolproof as we have learned.
Here are Newvem’s common sense recommendations to mitigate Amazon cloud risk:
1: Configure Amazon ELBs correctly. That means spreading these elastic load balancers not only across availability zones (AZs) within one data center but across geographies. Pumping data between geographies gets pricey, but it might be the cost of doing business. Some 20 percent of Newvem customers did not do this prior to the June 28 event.
2: Take snapshots of ALL volumes.  This, too,  is something Amazon already recommends. A service like Newvem’s will check to make sure all volumes have fresh snapshots on hand.
3: Distribute snapshots across regions, not just AZs. Again, that means not only across AZs but geographies. If this is done properly, the business can rollback to a recent snapshot as soon as service is restored with minimal data loss or associated damage.

The bigger, strategic, cloud question

While people get hysterical about Amazon outages because it is the biggest and most high-profile public cloud, snafus like this happen often but on a smaller scale. The big ones get covered, the little ones do not. That’s why cloud users need to bone up on corrective action. Laderman said other than the early-warning latency spikes Newvem customers also experienced “aftershocks” 24, 38 and 40 hours after the big event.
The bigger issue is that just as companies need to spread risk among AZs and regions, they will need to do the same thing with the clouds themselves. The pressure to evaluate multi-cloud deployments was ramping up already but these latest failures add fuel to the fire. No software developer or IT person wants to be the one to explain to the CEO why the website or retail site is offline becuase of a single company’s cloud failure.
As more companies go to  multi-cloud deployments, beneficiaries could include the OpenStack gang including Rackspace (s rax) and Hewlett-Packard(s hpq) as well as  GoGrid, SoftLayer and other cloud purveyors. No one expects Amazon to stand still for that. Laderman expects the cloud giant will fight back by making it less expensive for companies to distribute their data and data snapshots across regions in hopes of stemming defections.
 Photo courtesy of Flickr user theogeo