Cloud restarts done, here come the postmortems from Amazon and Rackspace

With a couple days worth of cloud restarts in the rear window, two of the biggest cloud providers are ready to talk about them — a bit.

To recap, last Wednesday night, [company]Amazon[/company] started notifying customers of a reboot needed on some of its instances to start Friday. On Friday night, [company]Rackspace[/company] followed suit, telling customers of a re-do to start Sunday. Details were scarce but it was pretty quickly established that an unspecified vulnerability in the Xen hypervisor was the issue. Both companies use versions of Xen in their public cloud infrastructure.

Rackspace CEO Taylor Rhodes (pictured above) apologized to Rackspace customers for problems they had. Give him credit, he didn’t do the qualified sorry-if-we-inconvenienced-you thing — he outright apologized for “the downtime and inconvenience that you and others of our customers have suffered in recent days.”

Per his blog post:

This maintenance affected nearly a quarter of our 200,000-plus customers, and in the course of it, we dropped a few balls.  Some of our reboots, for example, took much longer than they should.  And some of our notifications were not as clear as they should have been. We are making changes to address those mistakes.  And we welcome your feedback on how we can better serve you.

And Amazon Web Services evangelist Jeff Barr advised customers how they can better weather similar issues in the future. It was the usual litany: Put instances in two or more Availability Zones. Keep an eye on your AWS console and make sure to list alternate contacts in case primary people are out. Use Trusted Advisor assessments. Open up AWS Premium Support cases to get engineering assistance. Use Chaos Monkey to test various kinds of failures in a controlled (e.g. safe) environment. Oh, and use more AWS services, including Auto Scaling, to keep a set number of healthy instances running.

Reset button

Both AWS and Rackspace said they were bound by Xen security practice, which mandates that vendors report security issues to the committee first so members can work to patch problems. Per the Xen support site post:

If a vulnerability is not already public, we would like to notify significant distributors and operators of Xen so that they can prepare patched software in advance. This will help minimise the degree to which there are Xen users who are vulnerable but can’t get patches  …

Naturally, if a vulnerability is being exploited in the wild we will make immediately public release of the advisory and patch(es) and expect others to do likewise.

Amazon and Rackspace did their reboots starting late last week and into the weekend. [company]IBM[/company] SoftLayer disclosed its reboot plans to customers and partners early Wednesday. IBM was added to the Xen Project security predisclosure list on September 29, well after AWS and Rackspace issues were public.

This story was updated at 6:12 a.m. October 3 to reflect that IBM was added to the Xen list on September 29.