Involving Engineers in Incident Management: QCon London Q&A

Learning from past incidents can increase engineers' confidence in handling live incidents and convincing them to join the on-call team. Samuel Parkinson spoke about how we can benefit from past incidents and encourage engineers to get involved in incident management at QCon London 2020.

When engineers were invited to join the out of hours support team, there wasn’t much interest initially. Parkinson mentioned that, to convince engineers to join the incident team, they started to run incident workshops which are paper-based incident simulations. They are using their old incidents to provide the material for these workshops.

Parkinson stated that the company’s previous incidents are a treasure trove for preparing for what’s to happen:

We keep a record of all of our incidents at the FT and we review them regularly. There are always new things to learn even if they’ve happened in the past. And new people provide new eyes on those previous incidents and things that we didn’t know at the time.

One clear benefit from having engineers involved in incident management is incidents get resolved quicker. Having the engineers involved means there’s constantly a line of communication set up, and increased rapport between the operations team and the engineers delivering the systems, Parkinson explained.

There’s more buy-in from the feature teams to follow up on the actions produced in an incident report if the engineers are a part of resolving the incident, Parkinson said. The long term fixes also tend to be more automated, and less documentation and process-based, freeing up the time of the operations team and the engineers.

InfoQ interviewed Samuel Parkinson, principal engineer at the Financial Times, about involving engineers in incident management.

InfoQ: How did you get involved in incident management?

Samuel Parkinson: I first got involved while working as a software engineer for Graze (an online snack delivery company); it was a small engineering team and I ended up working very closely with the infrastructure team there. I remember one of the first big incidents I was involved in there: we’d released a new product and a new part of the website, and we found that the load caused by filtering products was too much for our EC2 instances! I ended up copying and pasting a code fix onto each instance in tandem with the CTO, with a 3...2...1 count to save each change. Nerve wrecking!

Since then, the rush has kept me coming back to helping resolve incidents at Graze and now throughout my time at the Financial Times. I’m pleased to say I’ve since moved on from copying and pasting fixes into production.

My interest in the coordination and communication of incident management really started at the FT, where the number of people and systems involved really does create more difficulties. Ensuring that we get this right as a team, and seeing a great incident management culture continue to develop, is extremely motivating.

InfoQ: How did engineers respond when they were asked to take up incident management work?

Parkinson: When asking our engineers to join the out-of-hours support team, it was always optional. As you might expect there wasn’t much interest initially.

When we asked people why, the feedback we got highlighted that there were more than enough engineers potentially interested in joining the support team, but the majority first wanted to improve their confidence in supporting production incidents.

(That’s seven people we convinced to join the team! These are the results from a survey we sent out to our engineers asking them to sign up to the planned incident workshops.)

We addressed the confidence issue by improving our documentation and running workshops. These areas became our primary method for making the team sustainable, getting us to a place where our engineers were willing to join the team.

InfoQ: What have you done to enable continuous learning from incidents?

Parkinson: One thing we did recently was move our incident reports into GitHub issues. Having all our incidents documented in one place has allowed our engineers to explore the full history, with each person able to add their own insights. The extra metadata available through labels and being able to search incidents is a real step up from what we used to do, which was committing notes into a repository.

The main thing we started in 2019 however was to run incident workshops: paper-based incident simulations using old incidents. This lightweight method of getting people used to handling incidents has also been a great way of getting new eyes on old problems.

A recent workshop came up with a brilliant solution to address an incident: a part of the site was overwhelmed and the load was starting to have an impact on the rest of the site. At the time I think we simply waited it out, but from the workshop it was suggested we could have scaled down that part of the website to 0, in effect turning it off. We discussed it at length and now have a shared understanding that we could apply the same action to similar incidents yet to happen.

InfoQ: How can engineers put on the incident management hat? What benefits can it bring them?

Parkinson: I see it as an amazing opportunity for engineers to really hone their problem solving skills and equally tackle communication in complex situations.

I found it really difficult when getting into incident management as an engineer to put aside that engineering instinct to find and solve the problem, and instead focus first on understanding the impact on users and start with communication. Following that, start looking at what actions can be taken to improve the situation for our users, but unless it’s necessary, avoid having to fix it with code! The incident workshops reinforce this approach to managing an incident, and really promote putting the user’s experience first.

Having a closer connection to how your code is running in production is an amazing feedback loop. Ideally it helps to inform your engineering in the future, nudging us to think more about the not-so-perfect conditions and situations your code has to deal with in production.

Topics

Beyond the Breach: Proactive Defense in the Age of Advanced Threats

Cell-Based Architecture Adoption Guidelines

Launching AI Agents Across Europe at Breakneck Speed With an Agent Computing Platform

Making Digital Accessibility More Than Just High Contrast: Building Truly Inclusive Software

Proactive Approaches to Securing Linux Systems and Engineering Applications

Helpful links

Choose your language

Write for InfoQ

Rate this Article

This content is in the Culture & Methods topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

Cloudflare Introduces Workflows for Building Scalable Resilient Multi-Step Applications

Cloudflare Introduces Short-Lived SSH Access, Eliminating the Need for SSH Credentials

Microsoft Introduces Modern Web App Pattern for .NET: Accelerating App Modernization to the Cloud

Apache Tomcat 11.0 Delivers Support for Virtual Threads and Jakarta EE 11

AWS Lambda Introduces a Visual Studio Code-Based Editor with Advanced Features and AI Integration

Generally AI - Season 2 - Episode 5: Do Robots Dream of Electric Pianos?

Beyond the Breach: Proactive Defense in the Age of Advanced Threats

Steve Klabnik and Herb Sutter Talk about Rust and C++

Challenges and Lessons Porting Code from C to Rust

Grab Employs LLMs for Conversational Data Discovery with GPT-4, Glean and Slack

Cell-Based Architecture Adoption Guidelines

Software Architecture Tracks at QCon San Francisco 2024 – Navigating Current Challenges and Trends

Making Digital Accessibility More Than Just High Contrast: Building Truly Inclusive Software

What Developers Can Do to Continue to Program as They Age

How Rules Can Foster Creativity: The Design System of Reykjavík

Launching AI Agents Across Europe at Breakneck Speed With an Agent Computing Platform

OSI Releases New Definition for Open Source AI, Setting Standards for Transparency and Accessibility

Being a Responsible Developer in the Age of AI Hype

Optimizing Uber's Search Infrastructure: Upgrading to Apache Lucene 9.5

Improving the Efficiency of Goku Time-Series Database at Pinterest

Expedia Migrates a Massive Cassandra Cluster to ScyllaDB with Zero Downtime

QCon San Francisco

QCon London

InfoQ Dev Summit Boston

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?