Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Improving Incident Management through Role Assignments and Game Days

Improving Incident Management through Role Assignments and Game Days

This item in japanese

John Arundel, principal consultant at Bitfield Consulting, shared his thoughts on how to ensure incidents are handled smoothly and quickly. He suggests assigning specific roles to each team member responding to the incident. Red team versus blue team exercises can also be leveraged to ensure the team is prepared to respond accurately and quickly.

Within the incident response team, the incident commander has the most critical role. They are responsible for running the team's designated incident response process. Arundel notes that:

The key thing is to have one person in charge. You need a decision maker. Often, this will be the team lead, but over time, you should make sure to give everybody a turn in that chair.

The next role that Arundel recommends is the communicator. The communicator's job is to provide status updates both internally and externally. This includes updating management, project managers, and the impacted clients. Supporting the communicator is the records person whose responsibility is to document everything as it happens, including taking notes, capturing screenshots, and collecting log data and metrics for future analysis. The final role that Arundel recommends is the researcher. Their responsibility is to hunt down answers to questions as they come up in the incident response process.

This matches closely with how Netflix runs incidents as seen in the recent open sourcing of their incident management tool Dispatch. Dispatch can automatically assign an incident commander based on the type, priority, or description of the incident. Dispatch can also facilitate communications by allowing for notifications to happen on a cadence removing the need to have a human remember to send them out.

As the team becomes better at resolving incidents and mitigating the issues that led to them, they may need other ways to ensure they are prepared. As Arundel states, "The more reliable your systems, the less frequently real incidents happen, so the more you need to practice them." This is where he recommends using red team versus blue team exercises. This concept, which originates in military exercises and is heavily used in information security, has one internal team take on the role of "attacker". Their job is to create an incident that the blue team needs to respond to. This is similar to the concept of game days in which a failure is simulated within the environment to allow for testing systems, processes, and team responses.

Adrian Cockcroft, VP cloud architecture strategy at AWS, shares this sentiment and believes that adopting a "learning organization, disaster recovery testing, game days, and chaos engineering tools are all important components of a continuously resilient system."

Arundel shares some tips for teams looking to host their first game day: "Keep it short and simple the first time round. Put together a basic plan for what you’re going to do: this is the first draft of your incident handling procedure." As the team becomes more practiced, he recommends starting to assign the various roles. For the first attempts at practice incidents, he advises keeping the exercise to around one hour in length. Finally, he feels that moving the debrief to the day after will provide a better experience as the team will have had time to reflect on their actions and learnings.

Eugene Wu, director of customer experience at Gremlin, shares a number of the same tips. He also adds the importance of clearly identifying up front the purpose of the game day and which scenarios are going to be tested. This allows for clearly identifying the correct individuals to be involved, both on the execution and the response sides. He also suggests scoping out the test cases to better define the perceived impact and extent of the potential blast radius. Finally, he recommends having a clear exit strategy in case the experiment needs to be aborted quickly.

Rate this Article