Reddit Goes Down, Blames Amazon, But Who’s Responsible in the End?

Updated: Content-sharing site Reddit suffered some major downtime yesterday, a situation it says was largely the result of a failure on the part of Amazon Web Services (s amzn) and that likely will compel the site to move even more of its infrastructure out of the cloud and back in house roll back its usage of the Elastic Block Storage service. The specific problem was severe performance degradation for “a small subset” of EBS volumes in AWS’s US-EAST-1 data center, which just happened to affect the majority of the disks for Reddit’s Postgres and Cassandra database servers. Rather than looking at this as an example of why cloud computing is a bad idea, though, the real takeaway might have to do with the importance for cloud users to make sure all is well with their cloud deployments.

As Reddit systems administrator Jason Harvey notes in his detailed blog on the outage, this isn’t the first time Reddit has experienced performance issues with EBS. In fact, performance woes had already spurred the site to start moving its Cassandra servers back in house to local storage on EC2 servers and now it is considering doing the same for its Postgres servers. But aside from issues with EBS, Harvey also acknowledges a couple things Reddit could have done better:

  • “One mistake we made was using a single EBS disk to back some of our older master databases (the ones that hold links, accounts and comments). Fixing this issue has been on our todo list for quite a while, but will take time and require a scheduled site outage. This task has just moved to the top of our list.”
  • “When we started with Amazon, our code was written with the assumption that there would be one data center. We have been working towards fixing this since we moved two years ago. Unfortunately, progress has been slow in this area.”

As Harvey explains, there are approaches users can take to avoid suffering this type of consequence, including using multiple Availability Zones or replicating storage across a greater number of disks. I’m not trying to absolve AWS of fault — it certainly deserves a fair amount in this instance and in general, if Reddit’s claims of generally spotty EBS performance are true — but the reality is that only users really suffer from this type of outage. I’ve written before about one-sided cloud computing terms of service (I actually don’t think they’re that unfair), the result of which is that providers probably won’t owe customers anything more than service credits even for the most severe outages (a small token after a major issue), and they most certainly won’t owe more if customers didn’t design their cloud infrastructure as optimally as possible to avoid such a situation. It’s not that AWS isn’t to blame but, rather, that save for a small reputation hit, Reddit’s pain is no skin off AWS’s back. (It might be worth noting that, according to a recent study by CloudHarmony, AWS actually exceeded its SLA for Amazon EC2 over the course late-2009 through late-2010, operating at 100 percent availability in its U.S. Availability Zones.)

In another recent situation, online ticketing startup TicketLeap suffered a database failure in its attempt to carry out opening-day sales for Comic-Con International atop an AWS-hosted infrastructure. In that situation, however, TicketLeap bore the brunt of the blame, shared perhaps a little with some questionable MySQL code. But, as was pointed out by at least one commenter afterward, there are known ways to resolve the specific issue that TicketLeap experienced. TicketLeap, it appears, just might not have done its homework, and was saved from complete disaster by leveraging the cloud’s flexibility and scaling down its server count to a manageable level.

Both Reddit and TicketLeap would certainly acknowledge the benefits of using AWS or any other cloud provider, though. If not for the ability to rent resources, TicketLeap might have spent a small fortune scaling up for Comic-Con only to crash. Even Reddit’s Harvey wrote that:

Amazon’s Elastic Block Service is an extremely handy technology. It allows us to spin up volumes and attach them to any of our systems very quickly. It allows us to migrate data from one cluster to another very quickly. It is also considerably cheaper than getting a similar level of technology out of a SAN.

But the cloud isn’t perfect, even if it is actually far more reliable than many presume, which is why customers need to make sure their deployments are in order and as reliably architected as possible. Because when failures do happen, it’s only customers that end up paying the price in the end.

Image courtesy of Flickr user Carolyn Coles.