STRUCTURE 08: Werner Vogels, Amazon CTO

“I’m the systems administrator for a small book shop in Seattle.”

“Remember this is just a very small snapshot at the beginning of a movement.”

Shows slideshow by Animoto of people at the Next Web conference in Europe (company analyzes music for mood changes and sets slideshow to that timing). What’s so special about Animoto? They have no servers. They have nothing. They have desktop apps, but they have no server infrastructure. But most of what they do is pretty intensive. They use Amazon S3 storage, Amazon EC2, SQS.

April 15/16 they launched a Facebook app. Friends would be notified of that little movie. At that point 25,000 customers. Started signing up 25,000 customers an hour. Imagine if you came up to a VC and said I kind of need money for 5,000 servers, I don’t know if we’re going to be popular or not.

What this cloud computing does, one of the biggest lessons to take home today, is moving infrastructure from a capital expense to a variable cost model.

Why is Amazon in this business? Infrastructure quickly eclipsing e-commerce in size. Amazon has history of transformation — used to be technology consumer, now at Amazon almost no third-party left. At Amazon’s scale, you cannot put reliability in the hands of your providers. Used to be single application, now platform; used to be single seller, now way over a million merchants. So many technology changes over time.

Amazon was, ’til 2001, traditional web server based. Every year the hassle was how do we scale. We would get through the holiday season and in January the engineers would sit around and say how are we going to get through next year. Then Target came to Amazon, and at first we said no way, then we took the opportunity to work with Target to see how we could change Amazon from a single app to a whole platform.

So we moved to a service-oriented model, way before SOA and all those terms were popular. We hit the point where we needed to scale all those services. If there’s one way of looking at Amazon, it’s to look at it as one of the world’s largest distribution systems. So it is a challenge to keep these systems running really reliably at scale.

At one moment HP convinced us we really needed to go back to a mainframe, cause that would really help us. That lasted one year. If you have all these pieces of software that need to work together they become a bottleneck in themselves. We took pieces from the business logic that sits on the top, and looked at the data that was interacting with that business logic, and put an API around it, so the only way you could interact with that data was through the business logic. And became a platform for not only Amazon but our business partners as well.

In general, if you hit a page at Amazon, that page will go out to somewhere between 250 and 350 services to get that data. And all those services have to run really, really reliably.

This is a one bug, one fix model. The only thing you have to do is fix it at that one place, and all of your customers will be exposed to that immediately. There is a massive elephant in the room. If you look at that one line develop – test – operate, there is a really big box missing there. From slide: Undifferentiated heavy lifting: hardware costs, software costs, maintenance, load balancing, scaling, utilization…etc

Everyone becoming infrastructure experts. Before, 30 percent of time, energy and dollars spent on differentiated value creation. 70 percent of time, energy, and dollars on undifferentiated heavy lifting. And the way that engineers are, they love doing that stuff. Because it’s fun and it’s interesting. But each of these teams were doing that. Decided to be way more agile.

For some reason we put data centers in the same place we put trailer parks, and they attract tornadoes. (big laugh) One of the things that may surprise you is data centers fail.

At Amazon we should be able to survive complete loss of a data center without customers being affected.

Last soccer World Cup two companies launch at same time, one didn’t have the right capacity. The thing breaks down, they don’t come back. Traffic of and, almost identical. The reason I put this slide up is to show you the spikes. About 2 to 3x the traffic you get through the normal year. What are you going to do for the rest of the year. Peak capacity management is a big issue.

Moving from a team focused on infrastructure to business logic — you move fast, you decentralize many of your servers, and then you go through a phase where you analyze what you’re doing. We needed the notion of infinite capacity.

So started virtualizing compute, messaging and storage. Important that you could get and release capacity instantly. EC2. Because teams used to think if they gave up their hardware they wouldn’t get it back. People are very conservative around their resources because concerned if they lose it they won’t get it back. SQS connects different pieces together. Storage: S3, Simple DB, EC2-PS (still needs product name). For us it made sense to develop S3 first because large percent of how we access data at Amazon. So Simple DB was sort of a complement where you would store secondary keys and the real objects in S3. About 20 percent of remaining data access was data attributes, then about 5 percent of data access left. For that we developed persistent storage engine.

Recommends “Getting Real” book by 37signals.

SmugMug great interface for photos — the reason I am bringing them up is not necessarily because of first step they took, where they look at Amazon’s infrastructure as a replacement for what they normally do. At this moment 600 TB of pictures stored in Amazon S3. They are really adventurous in terms of seeing where they can take that company. They actually are taking sort of the next step. They are venturing into different businesses based on the availability of these services. SmugVault store anything with their interface.

Amazon builds these services internally also as tools, not as a framework. Each team can decide to use whichever development tools they need. That means that the infrastructure services we provide need to be very generic, and need to be able to switch between internally.

Two years from now, these will be the key drivers:

  • security
  • scalability
  • availability
  • performance
  • cost-effectiveness

All these things will need to become explicit for these web services.

Addressing uncertainty:

  • acquire resources on demand
  • release resources when no longer needed
  • pay for what you use
  • leverage others’ core competencies
  • turn fixed cost into variable

To be able to assign resources on demand is essential.

Government services using Amazon, saving money.

I’ll be around all day, you can give me a dime, and I’ll predict the future for you. (laughter)

The only thing you need to access this wealth of service and storage is a credit card.