Whoopsie: Windows Azure stumbles again

Microsoft Windows Azure had a bad day Wednesday with compute capability impacted worldwide throughout the day, as first reported by The Register. Compute service was partially disrupted across nearly all regions,  according to the Azure status page,

At 2:35 a.m. UTC (7:35 p.m. PDT) the company said:

We are experiencing an issue with Compute in North Central US, South Central US, North Europe, Southeast Asia, West Europe, East Asia, East US and West US. We are actively investigating this issue and assessing its impact to our customers. Further updates will be published to keep you apprised of the impact. We apologize for any inconvenience this causes our customers.

A flurry of updates followed throughout the day and as of 10:45 a.m. UTC (3:45 a.m. PDT) on Thursday, the status report declared that the problem had been addressed and the company was “remediating” affected services.

Microsoft said the issue impacted the swap deployment feature of service management. As explained in PC World: Azure offers both a staging environment to let users test their systems, and a production environment separated by the virtual IP addresses used to access them. Swap deployment operations are used to turn the staging environment into the production environment.

Update: What this meant was that while existing applications continued to work, new code could not be deployed to production until the swap deployment issue could be resolved.

Still, there’s no way to paint this in a good light. If compute services fail across regions, even companies that design workloads to withstand snafus in one region, will be impacted. Folks point fingers at Amazon when its U.S. East facility has glitches, but AWS has not fallen down like this across multiple regions to my knowledge (correct me in comments if I’m mistaken.)


Other than its status page updates, Microsoft has not commented on the glitch — typically it takes some time for the company to issue explanations on its Windows Azure Blog.

This is the second major Microsoft cloud faux pas in the last year: Windows Azure storage was laid low by a so-called “leap year bug” in an expired security certificate in February.

Update: A Microsoft spokeswoman just emailed this update:

“As of 10:45AM PST, the partial interruption affecting Windows Azure Compute has been resolved. Running applications and compute functionality was unaffected throughout the interruption. Only the Swap Deployment operations were impacted for a small number of customers.  As a precaution, we advised customers to delay Swap Deployment operations until the issue was resolved.  All services to impacted accounts have been restored. For more information, please visit the Windows Azure Service Dashboard.”

Note: This story was updated at 7:51 a.m. PDT to reflect that existing applications continued to run but the issue prevented new code from being put into production; and at 9:16 a.m. PDT to characterize the reference to an earlier failure: and again at 10:36 a.m. PDT with a statement from Microsoft.