Facebook: Downtime was caused by an internal boo-boo, not a hack

Facebook’s outage early on Tuesday, which also took out linked services such as Instagram and Tinder, was down to a technical issue caused by the company itself rather than external factors.

The outage affected users around the world. According to a technical note, the outage lasted an hour — individuals may have experienced it for up to 50 minutes, sources told me.

The Lizard Squad hacking group, which apparently successfully hijacked the website of Malaysia Airlines on the weekend, claimed responsibility for Facebook’s downtime in a tweet. However, according to the company itself, that’s nonsense.

It said in a statement:

Earlier this evening many people had trouble accessing Facebook and Instagram. This was not the result of a third party attack but instead occurred after we introduced a change that affected our configuration systems. We moved quickly to fix the problem, and both services are back to 100 percent for everyone.

No data was compromised, the sources added.

Silent night: Facebook, Instagram go down for Snowmadgeddon

All those carefully crafted images of snowscapes will have to stay locked up for a bit, East Coasters, because Facebook and Instagram are down. (Update 11:12pm: Our brief Facebook-related national nightmare appears to be over, at least for me and Janko Roettgers on the West Coast.)

I’m not clear what any of you on the East Coast would be doing on Facebook at this hour, given how much liquor it appeared you bought as “provisions” during the day on Monday, but as of about 10:10pm PST Monday evening, Facebook appeared to be down for just about everybody. Facebook acknowledged something was wrong with an error message on its main site, and all the usual “is it down?” sites weighed in with the usual messages.

Facebook goes down January 2015 snow

Stand firm, East Coasters: you can still distribute pictures of bros playing beer pong on 1st Avenue through your other social media channels. But this long cold winter night might just get a little longer should you be unable to document the snowfall in your neighborhoods for family members and those annoying friends from high school who live in California.

Speaking on behalf of those annoying friends from California, we’ve actually got the windows open in Oakland right now because it was a little unseasonably warm tonight. Stay strong, East Coast Facebookers: there’s always Ello (and maybe Twitter).

I’ll update this post with good jokes or more information until I get tired and have to go to bed.

10:50pm PT: Facebook and Instagram still appear to be down. As far as I can tell, civilization has not collapsed. But I did have to finally shut the windows in Oakland as the temperature dropped into the low 60s.

I liked this one:

10:55pm PT: Instagram fesses up:

10:57pm PT: We learn a lot about ourselves in these moments:

10:58pm PT: Is this a way bigger deal than we first realized?

11:02pm: It’s now very serious: Tinder is down:

11:06pm: Everything is still somehow down. But don’t worry, Yo is crushing it:

11:15pm: Facebook appears to be back. I have not investigated Tinder yet because I’m a married man.

11:52pm PT: Jay Parikh, vice president of engineering at Facebook, appears to have had a rough night:

Further update by David Meyer now that Tom has sensibly retired for the night: Turns out it was an internal mistake on Facebook’s part.

Is the cloud unstable and what can we do about it?

The recent major reboots of cloud-based infrastructure by Amazon and Rackspace has resurfaced the question about cloud instability. Days before the reboot, both Amazon and Rackspace noted that the reboots were due to a vulnerability with Zen. Barb Darrow of Gigaom covered this in detail here. Ironically, all of this came less than a week before the action took place, leaving many flat-footed.

Outages are not new

First, let us admit that outages (and reboots) are not unique to cloud-based infrastructure. Traditional corporate data centers face unplanned outages and regular system reboots. For Microsoft-based infrastructure, reboots may happen monthly due to security patch updates. Back in April 2011, I wrote a piece Amazon Outage Concerns are Overblown. Amazon had just endured another outage of their Virginia data center that very day. In response, customers and observers took shots at Amazon. However, is Amazon’s outage really the problem? In the piece, I suggested that customers were misunderstanding the problem when they think about cloud-based infrastructure services.

Cloud expectations are misguided

As with the piece back in 2011, the expectations of cloud-based infrastructure have not changed much for enterprise customers. The expectation has been (and still is) that cloud-based infrastructure is resilient just like that within the corporate data center. The truth is very different. There are exceptions, but the majority of cloud-based infrastructure is not built for hardware resiliency. That’s by design. The expectation by service providers is that application/ service resiliency rests further up the stack when you move to cloud. That is very different than traditional application architectures found in the corporate data center where infrastructure provides the resiliency.

Time to expect failure in the cloud

Like many of the web-scale applications using cloud-based infrastructure today, enterprise applications need to rethink their architecture. If the assumption is that infrastructure will fail, how will that impact architectural decisions? When leveraging cloud-based infrastructure services from Amazon or Rackspace, this paradigm plays out well. If you lose the infrastructure, the application keeps humming away. Take out a data center, and users are still not impacted. Are we there yet? Nowhere close. But that is the direction we must take.

Getting from here to there

Hypothetically, if an application were built with the expectation of infrastructure failure, the recent failures would not have impacted the delivery to the user. Going further, imagine if the application could withstand a full data center outage and/ or a core intercontinental undersea fiber cut. If the expectation were for complete infrastructure failure, then the results would be quite different. Unfortunately, the reality is just not there…yet.

The vast majority of enterprise applications were never designed for cloud. Therefore, they need to be tweaked, re-architected or worse, completely rewritten. There’s a real cost to do so! Just because the application could be moved to cloud does not mean the economics are there to support it. Each application needs to be evaluated individually.

Building the counterargument

Some may say that this whole argument is hogwash. So, let us take a look at the alternative. If one does build cloud-based infrastructure to be resilient like that of its corporate brethren, it would result in a very expensive venture at a minimum. Infrastructure is expensive. Back in the 1970’s a company called Tandem Computers had a solution to this with their NonStop system. In the 1990’s, the Tandem NonStop Himalayan class systems were all the rage…if you could afford them. NonStop was particularly interesting for financial services organizations that 1) could not afford the downtime and 2) had the money to afford the system. Consequently, Tandem was acquired by Compaq who in turn was acquired by HP. NonStop is now owned by HP as part of their Integrity NonStop products. Aside from Tandem’s solutions, even with all of the infrastructure redundancy, many are still just a data center outage away of impacting an application. The bottom line is: It is impossible to build a 100% resilient infrastructure. That is true either due to 1) it is cost prohibitive and 2) becomes a statistical probability problem. For many, the value comes down to the statistic probably of an outage compared with the protections taken.

Making the move

Over the past five years or so, companies have looked at the economics to build redundancy (and resiliency) at the infrastructure layer. The net result is a renewed focus on moving away from infrastructure resiliency and toward low-cost hardware. The thinking is: infrastructure is expensive and resiliency needs to move up the stack. The challenge is changing the paradigm of how application redundancy is handled by developers of corporate applications.

The number of 9s don’t matter, but business metrics do

Technology organizations use percentage uptime as a key performance metric. Unfortunately, it is very technology focused in a time where business metrics are the norm. Which business metrics can IT focus on and how can the CIO help lead the charge?

5 strange but true concerns for keeping Google online

Google, which serves about 7 percent of the world’s overall web traffic, isn’t any ordinary company. Google Research Director Peter Norvig recently shared some of the considerations that Google takes into account when designing its infrastructure and systems to operate at Internet scale.

Use Downtime Projects to Recharge, Try New Work Habits

Many web workers are taking advantage of the holiday break to focus on personal projects that really spark their passions. Here are a few ideas to help you stay focused and motivated on personal projects, while enjoying the holidays at the same time.

Double Dutch: Netherlands KOs Brazil & Twitter

It’s all good to talk about the big-picture goals for Twitter but the company is still having problems keeping its service alive in the face of rising usage. Today, after the Netherlands upset Brazil in the World Cup, Twitter admitted to “a period of high unavailability.”

WordPress Outage Takes Us and 10.2M Blogs Out for 2 Hours

As we’re hosted on WordPress.com, we were affected by an outage of their network of blogs today that’s been attributed to a core router change. The company’s 10.2 million hosted blogs were down for 110 minutes, for a projected page view loss of 5.5 million.