Two major cloud providers — Amazon Web Services and Rackspace — had to scramble this weekend to reboot a good chunk of their customers’ compute instances due to a mysterious but apparently critical Xen hypervisor issue.
As it became evident that Xen was the heart of the matter, cloud experts (and VMware marketing people) quickly weighed in to say that live migration capability — which allows virtual machines (VMs) to be moved between physical machines without being shut down, would have mitigated a lot if not all of this trauma. With live migration, the cloud provider could move the guest VM onto another host, and reboot the old host. But enabling that feature is easier said than done or everyone would do it.
As of now however, just [company]VMware[/company] and [company]Google[/company] do so. Google built its live migration and related maintenance capabilities from the ground up. VMware can move VMs between servers running in vSphere environments and will add that capability to vCloud Air over time. It’s somewhat easier for VMware to do this because it controls both sides of the transaction. Amazon and Rackspace public clouds rely on a legacy (read: old) hypervisor in Xen and use older cloud orchestration technology, which mandates that they have to do the requisite patching, said Gigaom Analyst MSV Janakiram.
AWS could be working on live migration, perhaps for an AWS re:Invent reveal in November. But if it is we wouldn’t expect the company to say so and it did not. Asked about any plans for live migration, a spokeswoman said via email that the company continuously maintains its cloud to avoid problems but that in any case live migration per se is no silver bullet. She wrote:
Even in this case, we were able to do the vast majority of the maintenance without any customer impact. There will sometimes be cases where the specifics require that we do reboots regardless of the various maintenance capabilities we have — and we have many. There is no single capability, including live migration, that can guarantee zero customer impact. In this case, there was no way for us to avoid rebooting less than 10% of the fleet. Thus far, on the current maintenance activity, we have completed most of the instance reboots and have seen little customer impact.
(The emphasis is mine.)
Rackspace did not respond to requests for comment for this story, but I suspect we have not heard the end of this occurrence and the need for more seamless workload transfers in a customer’s cloud of choice. And, perhaps (dare we hope?) even between clouds from multiple vendors.