Is the cloud unstable and what can we do about it?

The recent major reboots of cloud-based infrastructure by Amazon and Rackspace has resurfaced the question about cloud instability. Days before the reboot, both Amazon and Rackspace noted that the reboots were due to a vulnerability with Zen. Barb Darrow of Gigaom covered this in detail here. Ironically, all of this came less than a week before the action took place, leaving many flat-footed.

Outages are not new

First, let us admit that outages (and reboots) are not unique to cloud-based infrastructure. Traditional corporate data centers face unplanned outages and regular system reboots. For Microsoft-based infrastructure, reboots may happen monthly due to security patch updates. Back in April 2011, I wrote a piece Amazon Outage Concerns are Overblown. Amazon had just endured another outage of their Virginia data center that very day. In response, customers and observers took shots at Amazon. However, is Amazon’s outage really the problem? In the piece, I suggested that customers were misunderstanding the problem when they think about cloud-based infrastructure services.

Cloud expectations are misguided

As with the piece back in 2011, the expectations of cloud-based infrastructure have not changed much for enterprise customers. The expectation has been (and still is) that cloud-based infrastructure is resilient just like that within the corporate data center. The truth is very different. There are exceptions, but the majority of cloud-based infrastructure is not built for hardware resiliency. That’s by design. The expectation by service providers is that application/ service resiliency rests further up the stack when you move to cloud. That is very different than traditional application architectures found in the corporate data center where infrastructure provides the resiliency.

Time to expect failure in the cloud

Like many of the web-scale applications using cloud-based infrastructure today, enterprise applications need to rethink their architecture. If the assumption is that infrastructure will fail, how will that impact architectural decisions? When leveraging cloud-based infrastructure services from Amazon or Rackspace, this paradigm plays out well. If you lose the infrastructure, the application keeps humming away. Take out a data center, and users are still not impacted. Are we there yet? Nowhere close. But that is the direction we must take.

Getting from here to there

Hypothetically, if an application were built with the expectation of infrastructure failure, the recent failures would not have impacted the delivery to the user. Going further, imagine if the application could withstand a full data center outage and/ or a core intercontinental undersea fiber cut. If the expectation were for complete infrastructure failure, then the results would be quite different. Unfortunately, the reality is just not there…yet.

The vast majority of enterprise applications were never designed for cloud. Therefore, they need to be tweaked, re-architected or worse, completely rewritten. There’s a real cost to do so! Just because the application could be moved to cloud does not mean the economics are there to support it. Each application needs to be evaluated individually.

Building the counterargument

Some may say that this whole argument is hogwash. So, let us take a look at the alternative. If one does build cloud-based infrastructure to be resilient like that of its corporate brethren, it would result in a very expensive venture at a minimum. Infrastructure is expensive. Back in the 1970’s a company called Tandem Computers had a solution to this with their NonStop system. In the 1990’s, the Tandem NonStop Himalayan class systems were all the rage…if you could afford them. NonStop was particularly interesting for financial services organizations that 1) could not afford the downtime and 2) had the money to afford the system. Consequently, Tandem was acquired by Compaq who in turn was acquired by HP. NonStop is now owned by HP as part of their Integrity NonStop products. Aside from Tandem’s solutions, even with all of the infrastructure redundancy, many are still just a data center outage away of impacting an application. The bottom line is: It is impossible to build a 100% resilient infrastructure. That is true either due to 1) it is cost prohibitive and 2) becomes a statistical probability problem. For many, the value comes down to the statistic probably of an outage compared with the protections taken.

Making the move

Over the past five years or so, companies have looked at the economics to build redundancy (and resiliency) at the infrastructure layer. The net result is a renewed focus on moving away from infrastructure resiliency and toward low-cost hardware. The thinking is: infrastructure is expensive and resiliency needs to move up the stack. The challenge is changing the paradigm of how application redundancy is handled by developers of corporate applications.

849 million reasons why Bebo was a mistake for AOL

Bebo co-founder and mega-millionaire Michael Birch today re-acquired the brand and the network for a million dollars. Like dance clubs, these faded (and forgotten) social network marquees are a rude reminder that social web is about people and people – especially young people are fickle.

Tiger Woods, doublethink and failed startups

It’s become a piece of widely-accepted startup lore to say that failure is an important part of success. But this trio of examples highlight the fact that entrepreneurs have as much in common with Tiger Woods as they do with each other.

Solar maker Evergreen Solar files for bankruptcy

It’s not a startup, but it’s the latest member of the cleantech graveyard. On Monday, Evergreen Solar filed for Chapter 11 and announced it will be selling its assets, laying off 65 people and suspending operation of its Midland, Mich. filament factory.

What happens if your web series doesn’t hit it big?

For Solo creator Jonathan Nail, producing his own web series was an opportunity to create a showcase for his acting. But after two years of hard work and thousands of dollars, he found that the rewards of independent production are not universal.

How to Ensure Business Continuity in the Cloud

After years of hype, the IT industry finally had a rude awakening this spring that reminded us that cloud computing infrastructures are vulnerable to the same genetic IT flaw that plagues traditional data center operations: Everything fails sooner or later. Here’s how to build around that.

Hey Apple, Sony and Amazon: Crisis Response is Real Time Now Too

The reasons for the recent screwups by Apple, Sony and Amazon were different, but their reaction was remarkably similar: a conspicuous lack of timely response. Like many others, these tech giants don’t seem to have realized that crisis response has to become real-time now too.

Lessons in Failure: The Startup Post-Mortem

If you’re looking for tips on what not to do with your startup, reading about the failure of someone else’s company can be a good place to start. Today, it was entrepreneur Ben Yoskovitz’s turn to write about the recent failure of his startup, Standout Jobs.

The Moving To-Do List

At the end of the day, we look at our to-do list list, but there’s one item that didn’t get completed, so we move it to tomorrow’s list, but tomorrow, the same thing happens, and it keeps happening. I call these items a “moving to-do list”