Incident Management During Remote Work

Michael Fisher, a technology enthusiast and group product manager at OpsRamp, recently blogged about how IT operations and DevOps teams can take a problem-first approach towards the incident management process. This appears to be a trend within the software development industry, as Dr Laura Maguire and Nora Jones have also recently written about related challenges in their learningfromincidents.io article "Learning from Adaptations to Coronavirus".

As the world reacts to COVID-19, organizations are asking their employees, network operations center (NOC), site reliability engineers (SRE), IT, support, and software engineering teams to work remotely. Looking at this scenario, Maguire and Jones argued that "staying in the loop and maintaining context on decisions and changes is more challenging." Fisher explained the need for teams to be proactive to manage the availability and health of enterprise services.

As an initiation point for the problem-first incident management approach, Fisher suggested using the Utilization Saturation and Errors (USE) method. Developed by Brendan Greg, an industry expert in computing performance and cloud computing, USE method focuses on important areas when solving common performance issues. It asks for building a checklist by examining each resource for utilization, saturation, and errors. USE method is intended to be used early in performance investigation to identify systemic bottlenecks. This triggers the need to set up key metrics to alert systems so that log monitoring can detect an incident early.

Fisher asserted that the current reactive approach of looking at the metrics and working backward to detect the problem needs to be "flipped". DevOps combines people, actions, and tools to create workflows and responses. So, it makes sense to coordinate across all departments. When it comes to incident management, while there is no one-size-fits-all solution, there are many best practices already which help maintain stable infrastructure.

The incident response team needs to have high levels of coordination, communication, and responsiveness. With sudden remote-IT work, it becomes important to stay connected through multiple modes of communication. Fisher suggested using tools like Slack, Zenduty, and Squadcast, so that teams can reach the right person at the right time and communicate effectively. Maguire and Jones also stated that new tools are being adopted to try and keep people in the loop, and people will need to adapt the practices accordingly:

"Create shared visual frames of reference as much as possible (virtual whiteboards, trellos, google docs, murals) that can be easily shared and jointly worked on."

Looking at exhaustion and burnout due to long working hours, Fisher urged managers to be aware of morale, and attempt to ensure this remains good across the teams. For example, PwC initiated the Quarantine Days project, which brought together teams around the world to work on and share doodles in a virtual chat room. There are chatbots like Pez.ai’s Expert, which can have conversations with team members regarding safety protocols, work schedules, payroll processes, etc. This would help make the information about these topics available at all times.

Referring to the scaling requirements, he stressed the role of automation. Considering the time-sensitive and critical nature of incident response activity, automation can help reduce the time taken to mitigate an incident. There are numerous free tools available to triage and investigate incidents thereby automating incident response tasks.

InfoQ Software Architects' Newsletter

Follow us on

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter