Google Infrastructure Czar: Cloud Gets It Done

It was nearly five years ago when I last spent time with Urs Hölzle, Google’s infrastructure czar. (His official title is SVP of operations.) It was around that time he introduced me (and several others) to many of the concepts (such as cloud and big data) that are now part of the technology sector’s vernacular. Hölzle was company’s first VP of engineering, and he has led the development of Google’s technical infrastructure.
Hölzle’s current responsibilities include the design and operation of the servers, networks and data centers that power Google. It would be an understatement to say that he is amongst the folks who have shaped the modern web-infrastructure and cloud-related standards. When I had a chance to chat with him recently, my question was, “How do you define the cloud?”
I wanted to know, because frankly, the usage of “cloud” has been hijacked for marketing purposes, thanks to indiscriminate labeling of anything and everything on the Internet. Urs has a pretty clear and concise idea of what the cloud means to him (and Google) and what it’s good for.

Tiny Machines

For Hölzle, cloud-based computing comes into play when you have “very big computers that are basically buildings” or very small computers (such as smart phones) — and nothing in between. The small devices can get all the functionality from the big computer, which he has defined as the warehouse-scale computer.
“Here’s my Nexus S, and it can actually run many, [but] not quite all, Google apps, even though it has a tiny processor,” says Hölzle. “There may not always be photos on it, but all your photos are reachable from here. And maybe not all of your email is on it, but all your email is reachable from here.” It doesn’t matter whether the data is in Iowa, Oregon or halfway across the world from where you are.
As a person, you don’t have to worry about backups and viruses, Hölzle says. You don’t have to worry “about changing machines, reconfiguring your machine or worry about installing software. “On the cloud, there is no concept of scheduled downtime, because the cloud is supposed to work all the time. In other words, cloud-based computing helps “you just actually do the work that your company was founded for, instead of focusing on the technology behind the tools that you’re using.”

The Cloud Tone vs. Dial Tone

One of the big challenges of cloud-based computing is reliability and availability. Remember when millions of us were impacted by outages at Google’s Gmail and Skype’s voice service? There was a serious dip in productivity when those apps took a nosedive.
Perhaps, we should be asking for five nines of reliability from our cloud services? Five nines (99.999%) is a concept associated with services that have less than 5.26 minutes of downtime every year — like the dial tone on a landline phone. After all, that’s the only way to trust and rely on the cloud for all the important things we do and services we use.
Hölzle believes that in the cloud-centric world, there needs to be a fine balance between convenience and reliability. He compares phone systems: Mobile phones aren’t as dependable as land lines, but they are more useful because they can be used in different locations for different type of needs. “Mobile phones are sort of overtaking landlines because they add additional functionality that is worth the annoyance,” he says. Instead of five-nines, Hölzle says that cloud-apps should aim for being always available.
The only way to get to zero outages is to try to not make any changes in your cloud app. “In our apps, we’re not actually shooting for five nines because that would lower the feature velocity,” says Hölzle. Put another way: When you make changes, problems happen. So to get to zero outages, one needs to essentially make no changes. The landline phone system didn’t have many changes, and as a result, it became a paragon of five nines.
“Whenever there is an update with new features, it introduces more risk, and that will cause more downtime, invariably, because humans are not perfect, and occasionally something is going to go wrong at a small scale, and hopefully, very, very rarely something may go wrong at a larger scale. But it does happen,” he adds.
Gmail’s most recent outage not withstanding, it takes about 30-odd Google employees help keep Gmail running, Urs says. However, there’s a much larger infrastructure team behind Gmail and other Google services. “If we had just email, then we would have to build all of that on top of building the actual email product,” he argues.
Others might disagree, but Hölzle believes Google’s common infrastructure gives it a technological and financial edge over on-premise solutions. “We’re able to avoid some of that fragmentation and build on a common infrastructure,” says Hölzle. “That’s actually one of the big advantages of the cloud.”
[wufoo form=”z7m8z1″ username=”gigaom”]