InfoQ Homepage Incident Response Content on InfoQ
-
The Time it Wasn't DNS
Sean Klein explains how Azure handles massive outages using modern incident analysis. Moving past the "Five Whys," he shares how systemic factors—not operator error—caused the 2023 global WAN outage.
-
The Human Toll of Incidents & Ways to Mitigate it
Kyle Lexmond discusses the human side of major system failures. He shares psychological insights and architectural tactics for surviving high-pressure incident rooms.
-
The Ironies of A^2 I^2
J. Paul Reed explains the "ironies of automation" and AI in incident response. He discusses how reliance on AI can erode manual skills and camouflage system failures during high-stakes outages.
-
AI-Powered SRE for Autonomous Incident Response
The presenters discuss incident response, how AI-enhanced SRE platforms connect signals from logs, metrics, traces, and historical incidents to enable autonomous decisions.
-
Week-Long Outage: Lifelong Lessons
Molly Struve shares a "murder mystery" outage story from a massive Elasticsearch upgrade. She explains why you need a rollback plan, how to check biases, and why leadership support is a stabilizer.
-
An Incident Story: Tips for How Staff+ Engineers Can Impact Incidents
Erin Doyle discusses her experience with a critical 3-day-long incident and how she missed a key opportunity to help prevent the incident and how to build a culture to prevent similar situations.
-
The Incident Lifecycle: How a Culture of Resilience Can Help You Accomplish Your Goals
Vanessa Huerta Granda describes how to apply resilience throughout the incident lifecycle in order to turn incidents into opportunities, looking at real-life examples.
-
Two Years of Incidents at Six Different Companies: How a Culture of Resilience Can Help You Accomplish Your Goals
Vanessa Huerta Granda looks at real-life examples of companies she has worked with who chose to invest in improving their incident programs and have seen it pay dividends.
-
Comparing Apples and Volkswagens: the Problem with Aggregate Incident Metrics
Courtney Nash presents data from the Verica Open Incident Database (VOID) to demonstrate how aggregate incident metrics (MTTR) aren't representative of systems' resilience.
-
How Did It Make Sense at the Time? Understanding Incidents as They Occurred, Not as They are Remembered
Jacob Scott explores the basics of failure in complex systems, the theory and practice of how it made sense at the time, and actions to take.
-
Rethinking Reliability: What You Can (and Can't) Learn from Incidents
Courtney Nash discusses research collected from the VOID, challenging standard industry practices for incident response and analysis, like tracking MMTR and using RCA methodology.
-
Incidents, PRRs, and Psychological Safety
Nora Jones discusses the context around PRRs and provides takeaways on how one can improve production reliability.