Google gets chatty about live migration while AWS stays mum

On Monday, Amazon wanted us to know that its staff worked day and night to avert planned reboots of cloud instances and updated a blog post to flag that information. But it didn’t provide any specifics on how these live updates were implemented.

Did [company]Amazon[/company] use live migration — a process in which the guest OS is moved to a new, safe host? Or did it use hot patching in which dynamic kernel updates are applied without screwing around with the underlying system?

Who knows? Because Amazon Web Services ain’t saying. Speculation is that it used live migration — even though AWS proponents last fall insisted that live migration per se would not have prevented the Xen-related reboots it launched at that time.

But where AWS remains quiet, [company]Google[/company], which wants to challenge AWS for public cloud workloads, was only too glad to blog about its live migration capabilities launched last year. Live migration, it claimed on Tuesday, prevented a meltdown during the Heartbleed vulnerability hullabaloo in April.

Google’s post is replete with charts and graphs and eight-by-ten glossies. Kidding about the last part but there are lots of diagrams.

A betting person might wager that Google is trying to tweak Amazon on this front by oversharing. You have to credit Google’s moxie here and its aspirations for live migration remain large. Per the Google Cloud Platform blog:

The goal of live migration is to keep hardware and software updated across all our data centers without restarting customers’ VMs. Many of these maintenance events are disruptive. They require us to reboot the host machine, which, in the absence of transparent maintenance, would mean impacting customers’ VMs.

But Google still has a long row to hoe. Last fall, when Google started deprecating an older cloud data center zone in Europe and launched a new one, there was no evidence of live migration. Customers were told to make a disk snapshots and use them to relaunch new VMs in the new zone.

As reported then, Google live migration moves working VMs between physical hosts within zones but not between them. Google promised changes there too, starting in late January 2015 but there appears to be nothing new on that front as yet.

So let the cloud games continue.


Amazon hones its cloud update process

Remember that planned Xen-related reboot Amazon Web Services warned about last week? Well, things went better than planned, according to an updated blog post Monday.

The company said it was able to perform live updates on 99.9 percent of the affected instances, avoiding the need for a reboot altogether.  Last Thursday, [company]Amazon[/company] had said that it would need to reboot about 10 percent of total AWS instances to address a Xen security issue.

The ability of AWS to perform updates without shutting down and bringing back up compute instances comes as very good news to cloud users. And that’s true whether the technology used was a live migration, hot patching or maybe something else. The net result was the same: workloads were not interrupted.

The Xen-related security issue also affected [company]Rackspace[/company], Linode and [company]IBM[/company] SoftLayer, all of which said they’re doing their own fixes before March 10 when more information is released about the vulnerability.

Add IBM cloud to the list of reboots to come

The latest Xen hypervisor vulnerabilities are forcing IBM to reboot some customers’ cloud instances between now and March 10. The vendor sent out an alert to affected IBM SoftLayer customers on Friday, the same day Linode alerted its customers.

As reported, [company]Amazon[/company] Web Services and [company]Rackspace[/company] already posted news about the updates on Thursday night.

Per an [company]IBM[/company] notice sent to customers, the company said it was “in the process of scheduling maintenance for patching and rebooting a portion of services that host portal-provisioned virtual server instances, virtual servers hosted on these servers will be offline during the patching and rebooting process.”

As with the other alerts, the maintenance will happen before March 10, when more details of hte underlying Xen vulnerability will be disclosed. IBM promised more information when it becomes available and said it was working to minimize service disruptions.

Xen security issue prompts Amazon, Rackspace cloud reboots

Amazon Web Services and Rackspace are warning their customers of upcoming reboots they’re taking to address a new Xen hypervisor security issue.

In a premium support bulletin issued Thursday night, Amazon said fewer than 10 percent of all EC2 instances will require work but the affected instances must be updated by March 10. [company]Rackspace[/company] also notified customers of the issue, which will affect a subset of a portion of its First and Next Generation Cloud Servers, Thursday night. Later on Friday, Linode also warned users of an upcoming Xen-related reboot.

If you’re sensing a little bit of deja vu, it’s because the major cloud players were forced to reboot a bunch of their customers in September due to a Xen hypervisor issue, although the reason for the updates was not disclosed at first. Last time out, AWS also said 10 percent of its EC2 instances were affected.

Cloud vendors impacted by these security issues tread a tricky path. They have to address the vulnerability as fast as possible before the details of the flaw are made public, which can lead to a bit of a fire drill. In this case, more information about the flaw will be disclosed March 10.

In September, [company]Amazon[/company] was first out of the chute with notifications, followed by Rackspace and then IBM Softlayer made its disclosures the following week.

Note: This story was updated at 3:49 p.m. PST to note that Linode is also performing system updates.

For cloud players, hot patching may be hotter than live migration

Late last year, the world got a good look at the challenges and pain associated with kernel maintenance required by cloud providers. A security vulnerability in the Xen hypervisor required immediate and unprecedented infrastructure updates and the following “great cloud reboot” impacted a huge swath of cloud providers — including portions of both Amazon Web Services and Rackspace — and the hundreds of thousands of customers running on that infrastructure.

Reaction was swift on [company]Twitter[/company] and in blogs, suggesting these providers were poorly prepared for such an update. Critics suggested that they instead should have utilized a technology known as Live Migration to avoid having to reboot individual workloads on hypervisors.

Unfortunately, the operational reality is that live migration wouldn’t have saved the day. Live migration is an attractive feature — it appears to solve all kinds of administrative woes. But in a scenario where there’s a major security vulnerability like a hypervisor breakout, live migration can’t physically overcome the challenge of data gravity to avoid the system-wide reboots we experienced.

As we continue to move more workloads to cloud infrastructure, cloud operators need to find a solution. Fortunately, one already exists: hot patching. Let’s start with some definitions.

Live migration is when a virtual machine from one physical host is moved to another physical machine using virtual memory streaming, thus avoiding a reboot. Assuming there are no hiccups in that process, the end user should experience no downtime and, at worst, a slight pause in the workload as a result.

Kernel hot patching is the practice of applying dynamic kernel updates without rebooting the underlying system. Like live migration, this process shouldn’t impact the end user as it happens; however, because the patch is changing running code in the kernel, there’s a potential risk of system instability.

The success of VMware’s vMotion has popularized the notion of live migration, but vMotion performs best when the virtualization environment is operating from a storage-attached network (SAN). Because the data resides on the SAN itself and never needs to be copied over the wire, live migration with vMotion and a SAN is a data-light process. This is why SAN-backed vMotion can happen so quickly.

migrating geese

… but it’s also a boondoggle for cloud operators at scale

The problem is that few, if any, cloud providers (and there are exceptions to this rule) run local virtual machines from a SAN. Doing so has a number of drawbacks, including centralizing performance bottlenecks and increasing the blast radius in the event of an outage. Instead, the majority of cloud providers deploy a virtual machine with storage sitting on-chassis alongside the compute host running the actual virtual machine.

The use of live migration falls short for cloud providers for several reasons:

  • The weight of data (data is heavy)
  • The speed at which data can be moved
  • A limited capacity within the cloud fleet (i.e. servers)
  • The necessity of “leap-frogging” (moving data from host to host in succession to avoid exceeding capacity in any one host)
  • The time that successful “leap-frogging” would demand

Those factors aside, live migration can be a complementary tool for cloud providers. The practice can be effective when deployed to fix a single machine, re-balance capacity or perform general maintenance.

So if live migration can’t deliver what cloud operations need, what is the alternative?

Solving cloud operators’ problems with hot patching

Kernel hot patching lets a provider patch security vulnerabilities in real time on running hosts without the need to move data or virtual machines off the system. Of course, hot patching isn’t a perfect solution, either. There are no open-source options available today — and that alone limits the accessibility of the technology.

Oracle acquired kSplice in 2011 and initially shut down the service but subsequently reintroduced it. KernelCare is another commercial option, but the reality is that companies rely on either in-house engineering to craft and implement these patches, or rely on external providers and the whims of their business models. Additionally, those commercial offerings must support the specific kernels a provider uses (for example, KernelCare doesn’t support Ubuntu Linux).

At the end of the day, when it comes to the cloud, there’s no one right answer to the live migration vs. hot patching debate. The acquisitions of kSplice and of GridCentric (a live migration company focused on KVM that was bought by [company]Google[/company]) affirm this belief. As both methodologies have merit, I believe both will get continued investment and interest. And don’t be surprised if you hear the big cloud operators talking more about hot-patching technology in the coming year.

Jesse Proudman is founder and CTO of Blue Box Group.

The week of the big cloud reboots

The week in cloud: Amazon Web Services and Rackspace both acknowledged that they needed to re-start a big chunk of their public cloud infrastructure due to a non-disclosed Xen issue.