Delivering a Platform as a Service isn’t easy and figuring out how to handle things when they go wrong marks a huge leap in maturity for a company. When you’re small, panicking and then having one or two of the few people who built the software fix the problem may work, but as the company grows you’re going to have to figure out how to react when things fail. Luckily, Mark Imbriaco, the director of operations at Heroku, shared some of the steps it takes when a PaaS fumbles.
When disaster strikes, such as the 67-hour Amazon outage during April, Heroku follows a set of policies designed to get the service back up quickly and to reduce stress on the engineers dealing with the problem. Because the company requires all its engineers, even software engineer to take time on call, Heroku believes that a notification peg must arrive with a link telling the on-call engineer what to do. At that point, the solution should take less than five minutes to solve. Otherwise, the primary on-call person must call in a person who is acting as the “incident commander.”
In most cases the pages, which arrive about two or three times during a 24-hour on-call period, require the engineer to take down the problematic instance and restart it. To keep engineers happy (in addition to sharing the on-call burden among the 20 people on the engineering staff), Imbriaco said Heroku relentlessly focuses on eliminating false positive notifications so on-call engineers won’t suffer from “pager fatigue.”
What surprised me was how actively the audience participated in the discussion about how Heroku does things, as opposed to how their company handles staffing, processes and whatnot around outages. Someone mentioned his company has people on call for a week as opposed to 24 hours. In another talk at the Surge conference, Adam Jacob, president and co-founder of Opscode, pushed for greater involvement from software developers whose code might be the problem-causing agent. Audience members at both talks were amazed that these men would give
systems administrators software developers the ability to mess around on production servers.
The discussions made clear that if people are going to follow Google CIO Ben Fried’s advice and become generalists, then the culture inside engineering organizations will have to change. That may mean that like they do at Heroku, all engineers have to take on-call duties and almost-automated rules in place for someone to handle a failure in a few minutes or less. After that point, it gets escalated to someone who knows more if needed.
This gives everyone a stake in creating the best code running on the best systems, and it also allows ideas and understanding to flow across an organization. Folks may not want to call it devops, but whatever they call it, building applications that will scale requires new ways of thinking and responding.