Handling Incidents and Outages

David Mytton, CEO at Server Density, a London-based company that provides server monitoring, shared with the devopsdays Amsterdam 2015 crowd how they handle incidents and outages. Mytton breaks down the incident response process into preparation, response and post-mortem. The process is grounded on a key set of principles: frequent public updates; exhaustive logging of the response activities; team effort and effective escalation.

Preparation

The first step to a balanced incident response process is to prepare for incidents and outages.

Server Density on-call activities have a primary and a secondary engineer outside working hours. The first level rotates weekly among all engineers, but the second level rotates only among ops engineers. The logic is that if the first level engineer cannot solve the issue, the he probably needs the help of an ops engineer. During work hours, ops engineers are the first to handle alerts. Whenever there's an out of hours incident, the first level engineer gets a 24 hour off-call, to prevent incidents due to fatigue and to minimize social and/or health issues.

Mytton stresses the importance of documentation. A primary source is the configuration management code, as it helps to understand recent changes to the infrastructure setup. But written documentation, accessible, searchable and up to date is also necessary. The most important is the incident response guide, a step-by-step checklist to run through whenever the on-call team gets an alert. Mytton gets inspiration from the aviation industry, which uses a lot of checklists even for routine activities, as a way to prevent and solve incidents.

Preparation also means expecting the unexpected. Is your documentation hosted on the same infrastructure as your product? What happens if that infrastructure goes down? What happens if your office looses the Internet connection? What happens if your primary user support infrastructure fails during a service outage? Mytton mentioned the case of an organization that lost its data center (a truck rammed through the building) and then client calls overwhelmed its support service.

Finally, on-call engineers must know key info such as team and vendor contacts.

Incident Response

When an incident triggers an alert, on-call engineers follow a well-defined procedure. Mytton argues that well-defined procedures are essential for effective and fast resolution of any incident. It is also the only way to scale on-call activities to new engineers as the company grows.

First, they open the incident response checklist. Second, they log onto the Ops War Room. Again, this step is inspired by the aviation industry. The goal is to get the engineers fully concentrated in the incident, without any external distractions. Third, they open an issue in JIRA. This issue tracks all the engineers' activities related to the incident. It acts as an information radiator but it also helps the post-mortem that will be held later. Only then the engineers begin to investigate the incident.

If the incident is affecting end users, then Server Density uses its status page to keep them up to date on what's happening. They try to be as detailed as possible and to post updates every 30 minutes even if there's nothing new to report. The goal is to keep customers informed and retain their trust. Some of its users may be unaware of the status page and proceed to e-mail user support. They all get a reply when the incident is solved.

Post-mortem

The final step is to conduct a post-mortem, one or two days after the incident. The goal is to not do it either to soon, when things are still in flux, nor too late, when memories start to fade. The JIRA issue, with all the logged activities, is a great help to tell the story of what happened, with the appropriate technical detail. Mytton says that it's important to know your audience to determine how much technical detail is appropriate.

It's important to answer three questions. What failed? Why did it fail? How is it going to be fixed? Mytton offered a post mortem example reporting the incident when their data center provider had an outage due to a construction crew having cut a fiber optic cable.

InfoQ Software Architects' Newsletter

Write for InfoQ

Preparation

Incident Response

Post-mortem

Rate this Article

This content is in the On-call topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter