Lessons from SlideShare: Cloud Computing Fiascos & How to Avoid Them

Cloud computing is a big deal for startups. The opportunity to essentially have unlimited computing capacity available at the touch of a button opens up amazing new opportunities. The power to launch 1,000 servers at the press of a button (and tear them down just as quickly) is indeed remarkable. But if comic books have taught us anything at all, it’s that with great power comes great responsibility!

My company, SlideShare, has been using cloud computing for almost everything we do, so we’ve made our share of blunders. Below are two of the more notable ones:

How to Lose $5000 Without Even Trying

Several months ago, we became fascinated with Hadoop. We organized a Hadoop Hackday at our office, and very quickly wrote some prototype code for calculating analytics data for SlideShare users.

Hadoop analytics is a perfect task for cloud computing. You need a bunch of
computers, but you only need them once a day to crunch all the data. But as we started testing our prototype code with larger and more realistic data sets, it started taking longer and longer to complete a job.

At that point, I made the call to quadruple the number of machines (from 20 to 75). This decision actually made sense: if it’s going to take 100 computer hours to get a task done, then you might as well have 100 computers work for one hour and get the job done faster.

A few hours after I made that decision, a major site outage occurred that distracted everyone on the engineering team. We worked through the night and the next day, and recovered from the (unrelated) crisis by Friday afternoon. We all took a well-deserved weekend off, and came in Monday morning to discover that the analytics job we’d started before the crisis was still running. Our buggy code was failing in a way we hadn’t anticipated, so throwing hardware at the problem hadn’t helped. Meanwhile, we’d run up a bill of $5000 with Amazon Web Services!

Lesson learned: if you’re going really use the power of cloud computing, you need to constantly monitor spend and make sure that it doesn’t get out of whack and break the budget, especially if you’re going to be scaling up and down dramatically. Unfortunately, Amazon Web Services doesn’t provide any alerting or charting tools that make it easy to keep track of spend; keeping track of spending is a cumbersome process involving downloading csv files, importing them into Excel, and analyzing the data. But it has to be done.

Getting Sloppy With Storage

We recently noticed that our spend on storage (Amazon S3) was increasing dramatically. A few days of investigation revealed that there was a general lack of discipline in how we were using storage. Files that could be deleted were being left in place; files for different purposes were being kept in the same directory; and there were some files that we couldn’t identify where they came from and whether they were still needed or not!

Amazon S3, or any cloud storage for that matter, can be thought of as a giant file system. There’s no over-arching control over what data goes where: It’s up to you to make sure you use the storage in a disciplined way. If only one person is writing the
code, this is easy. However, once you have a team of people writing multiple programs, it’s easier to forget to delete something. You need to make sure you don’t waste storage, and the only way to do that is to be really specific about what data is saved where.
A best practice is to put each type of resource in a separate “bucket” (Amazon’s name for a top-level directory), since that’s really the only way you can get accurate statistics about how much storage is being used for each type.

The Spider-Man Principal

In both cases, we learned we weren’t being disciplined enough to handle the power cloud computing put at our fingertips. If we’d been on leased hardware, we would have hit hardware limits (running out of disk space). It would’ve been inconvenient, but it would have forced us to think about what we were doing, and make a conscious decision to spend more money. It’s great to have the super-power of cloud computing, but you need to be responsible if you want to use it!

Jonathan Boutelle is Co-Founder and CTO of Slideshare, a web site for presentations that relies heavily on cloud computing. Previously, Jonathan was a principal at Uzanto, (a UI consulting firm) and worked as a software engineer at CommerceOne (a B2B enterprise software firm) and Advanced Visual Systems (a 3D graphics startup) You can find his presentations on cloud computing at slideshare.net/jboutelle, and his Twitter is @jboutelle. He also blogs at www.jonathanboutelle.com.

Image courtesy Flickr user adactio under creative commons.