For cloud players, hot patching may be hotter than live migration

Late last year, the world got a good look at the challenges and pain associated with kernel maintenance required by cloud providers. A security vulnerability in the Xen hypervisor required immediate and unprecedented infrastructure updates and the following “great cloud reboot” impacted a huge swath of cloud providers — including portions of both Amazon Web Services and Rackspace — and the hundreds of thousands of customers running on that infrastructure.

Reaction was swift on [company]Twitter[/company] and in blogs, suggesting these providers were poorly prepared for such an update. Critics suggested that they instead should have utilized a technology known as Live Migration to avoid having to reboot individual workloads on hypervisors.

Unfortunately, the operational reality is that live migration wouldn’t have saved the day. Live migration is an attractive feature — it appears to solve all kinds of administrative woes. But in a scenario where there’s a major security vulnerability like a hypervisor breakout, live migration can’t physically overcome the challenge of data gravity to avoid the system-wide reboots we experienced.

As we continue to move more workloads to cloud infrastructure, cloud operators need to find a solution. Fortunately, one already exists: hot patching. Let’s start with some definitions.

Live migration is when a virtual machine from one physical host is moved to another physical machine using virtual memory streaming, thus avoiding a reboot. Assuming there are no hiccups in that process, the end user should experience no downtime and, at worst, a slight pause in the workload as a result.

Kernel hot patching is the practice of applying dynamic kernel updates without rebooting the underlying system. Like live migration, this process shouldn’t impact the end user as it happens; however, because the patch is changing running code in the kernel, there’s a potential risk of system instability.

The success of VMware’s vMotion has popularized the notion of live migration, but vMotion performs best when the virtualization environment is operating from a storage-attached network (SAN). Because the data resides on the SAN itself and never needs to be copied over the wire, live migration with vMotion and a SAN is a data-light process. This is why SAN-backed vMotion can happen so quickly.

migrating geese

… but it’s also a boondoggle for cloud operators at scale

The problem is that few, if any, cloud providers (and there are exceptions to this rule) run local virtual machines from a SAN. Doing so has a number of drawbacks, including centralizing performance bottlenecks and increasing the blast radius in the event of an outage. Instead, the majority of cloud providers deploy a virtual machine with storage sitting on-chassis alongside the compute host running the actual virtual machine.

The use of live migration falls short for cloud providers for several reasons:

  • The weight of data (data is heavy)
  • The speed at which data can be moved
  • A limited capacity within the cloud fleet (i.e. servers)
  • The necessity of “leap-frogging” (moving data from host to host in succession to avoid exceeding capacity in any one host)
  • The time that successful “leap-frogging” would demand

Those factors aside, live migration can be a complementary tool for cloud providers. The practice can be effective when deployed to fix a single machine, re-balance capacity or perform general maintenance.

So if live migration can’t deliver what cloud operations need, what is the alternative?

Solving cloud operators’ problems with hot patching

Kernel hot patching lets a provider patch security vulnerabilities in real time on running hosts without the need to move data or virtual machines off the system. Of course, hot patching isn’t a perfect solution, either. There are no open-source options available today — and that alone limits the accessibility of the technology.

Oracle acquired kSplice in 2011 and initially shut down the service but subsequently reintroduced it. KernelCare is another commercial option, but the reality is that companies rely on either in-house engineering to craft and implement these patches, or rely on external providers and the whims of their business models. Additionally, those commercial offerings must support the specific kernels a provider uses (for example, KernelCare doesn’t support Ubuntu Linux).

At the end of the day, when it comes to the cloud, there’s no one right answer to the live migration vs. hot patching debate. The acquisitions of kSplice and of GridCentric (a live migration company focused on KVM that was bought by [company]Google[/company]) affirm this belief. As both methodologies have merit, I believe both will get continued investment and interest. And don’t be surprised if you hear the big cloud operators talking more about hot-patching technology in the coming year.

Jesse Proudman is founder and CTO of Blue Box Group.