BT

Handling Incidents and Outages

| by João Miranda Follow 2 Followers on Jun 29, 2015. Estimated reading time: 3 minutes |

David Mytton, CEO at Server Density, a London-based company that provides server monitoring, shared with the devopsdays Amsterdam 2015 crowd how they handle incidents and outages. Mytton breaks down the incident response process into preparation, response and post-mortem. The process is grounded on a key set of principles: frequent public updates; exhaustive logging of the response activities; team effort and effective escalation.

Preparation

The first step to a balanced incident response process is to prepare for incidents and outages.

Server Density on-call activities have a primary and a secondary engineer outside working hours. The first level rotates weekly among all engineers, but the second level rotates only among ops engineers. The logic is that if the first level engineer cannot solve the issue, the he probably needs the help of an ops engineer. During work hours, ops engineers are the first to handle alerts. Whenever there's an out of hours incident, the first level engineer gets a 24 hour off-call, to prevent incidents due to fatigue and to minimize social and/or health issues.

Mytton stresses the importance of documentation. A primary source is the configuration management code, as it helps to understand recent changes to the infrastructure setup. But written documentation, accessible, searchable and up to date is also necessary. The most important is the incident response guide, a step-by-step checklist to run through whenever the on-call team gets an alert. Mytton gets inspiration from the aviation industry, which uses a lot of checklists even for routine activities, as a way to prevent and solve incidents.

Preparation also means expecting the unexpected. Is your documentation hosted on the same infrastructure as your product? What happens if that infrastructure goes down? What happens if your office looses the Internet connection? What happens if your primary user support infrastructure fails during a service outage? Mytton mentioned the case of an organization that lost its data center (a truck rammed through the building) and then client calls overwhelmed its support service.

Finally, on-call engineers must know key info such as team and vendor contacts.

Incident Response

When an incident triggers an alert, on-call engineers follow a well-defined procedure. Mytton argues that well-defined procedures are essential for effective and fast resolution of any incident. It is also the only way to scale on-call activities to new engineers as the company grows.

First, they open the incident response checklist. Second, they log onto the Ops War Room. Again, this step is inspired by the aviation industry. The goal is to get the engineers fully concentrated in the incident, without any external distractions. Third, they open an issue in JIRA. This issue tracks all the engineers' activities related to the incident. It acts as an information radiator but it also helps the post-mortem that will be held later. Only then the engineers begin to investigate the incident.

If the incident is affecting end users, then Server Density uses its status page to keep them up to date on what's happening. They try to be as detailed as possible and to post updates every 30 minutes even if there's nothing new to report. The goal is to keep customers informed and retain their trust. Some of its users may be unaware of the status page and proceed to e-mail user support. They all get a reply when the incident is solved.

Post-mortem

The final step is to conduct a post-mortem, one or two days after the incident. The goal is to not do it either to soon, when things are still in flux, nor too late, when memories start to fade. The JIRA issue, with all the logged activities, is a great help to tell the story of what happened, with the appropriate technical detail. Mytton says that it's important to know your audience to determine how much technical detail is appropriate.

It's important to answer three questions. What failed? Why did it fail? How is it going to be fixed? Mytton offered a post mortem example reporting the incident when their data center provider had an outage due to a construction crew having cut a fiber optic cable.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Response to incidents and outages article by Vernon Johnson

Great examples of handling incidents and outages and very relevant for today's dynamic environments. Understanding the needs of the business and marrying them with the objectives of IT is important. I also believe you can bring different types of analytics (descriptive, diagnostic, predictive, and prescriptive) into the picture to help stave off incidents and outages by answering questions such as 'What do we have and what is it doing? Why is it doing this? When will it break and why (very important)? What should we do to prevent it?'

Ultimately, we're trying to provide a consistently pleasant experience for our customers. That planning takes IT and other business units to collaborate which you referenced above.

Response to incidents and outages article by Vernon Johnson

Great examples of handling incidents and outages and very relevant for today's dynamic environments. Understanding the needs of the business and marrying them with the objectives of IT is important. I also believe you can bring different types of analytics (descriptive, diagnostic, predictive, and prescriptive) into the picture to help stave off incidents and outages by answering questions such as 'What do we have and what is it doing? Why is it doing this? When will it break and why (very important)? What should we do to prevent it?'

Ultimately, we're trying to provide a consistently pleasant experience for our customers. That planning takes IT and other business units to collaborate which you referenced above.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

2 Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT