InfoQ Homepage Incident Response Content on InfoQ
-
Moving Past Simple Incident Metrics: Courtney Nash on the VOID
The Verica Open Incident Database (VOID) is assembling publically available software-related incident reports. InfoQ talks with Courtney Nash about their recent findings including how MTT* metrics may not be beneficial, the average time to incident resolution, and the importance of studying near-miss reports.
-
Building an Effective Incident Management Process
A good incident management framework can help organizations manage the chaos of an outage more effectively leading to shorter incident durations and tighter feedback loops. This article introduces the components necessary for a healthy incident management process.
-
The Hows and Whys of Effective Production-Readiness Reviews
At QCon Plus November 2021, Nora Jones, CEO and founder of Jeli, talked about how to build production readiness reviews (PRR) with emphasis on context and psychological safety. Her talk focused on the particulars of a PRR process that relates to incidents.
-
Analyzing Incident Data across Organizations: Courtney Nash on the VOID
The Verica Open Incident Database (VOID) is assembling publically available software-related incident reports. InfoQ talks with Courtney Nash on their recent findings including how MTT* metrics may not be beneficial, the average time to incident resolution, and the importance of studying near-miss reports.
-
DevOps and Cloud InfoQ Trends Report – June 2022
This article summarizes how we see the "cloud computing and DevOps" space in 2022, which focuses on fundamental infrastructure and operational patterns, the realization of patterns in technology frameworks, and the design processes and skills that a software architect or engineer must cultivate.
-
How to Best Use MTT* Metrics to Optimize Your Incident Response
Selecting the correct MTT* metric to improve your incident response is important. If the wrong metric is chosen, the improvements may get lost in the noise of a multivariable equation. This article reviews the various MTT* metrics available and discusses the best scenarios for selecting each one.
-
DevOps and Cloud InfoQ Trends Report - July 2021
This article summarizes how we see the "cloud computing and DevOps" space in 2021, which focuses on fundamental infrastructure and operational patterns, the realization of patterns in technology frameworks, and the design processes and skills that a software architect or engineer must cultivate.
-
Designing & Managing for Resilience
The fourth article in a series on how software companies adapted and continue to adapt to enhance their resilience explores the strategies used by engineering leaders to help create the conditions for sustained resilience. It provides stories, examples, and strategies towards designing an organizational structure to support resilient performance and managing for resilience.
-
Piercing the Fog: Observability Tools from the Future
Visibility into those distributed systems and how they are performing is challenging. Despite all the observability tools available for site reliability, debugging remains incredibly difficult, and many SREs would agree that their debugging processes have only marginally improved. This article explores how observability for troubleshooting could be done from the user’s point of view.
-
Adaptive Frontline Incident Response: Human-Centered Incident Management
The third article in a series on how software companies adapted and continue to adapt to enhance their resilience zeros in on the sources that comprise most of your company’s adaptive resources: your frontline responders. In this article, we draw on our experiences as incident commanders with Twilio to share our reflections on what it means to cultivate resilient people.
-
Learning from Incidents
Jessica DeVita (Netflix) and Nick Stenning (Microsoft) have been working on improving how software teams learn from incidents in production. In this article, they share some of what they’ve learned from the research community in this area, and offer some advice on the practical application of this work.
-
Shifting Modes: Creating a Program to Support Sustained Resilience
The second article in a series on how software companies adapted and continue to adapt to enhance their resilience explores how organizations can shift to a Learn & Adapt safety mode and compares the traits of an organization that is well poised for successfully persisting this mode shift. This shift will not only make them safer but will also give them a competitive advantage.