InfoQ Homepage Incident Response Content on InfoQ
-
AI-Powered SRE for Autonomous Incident Response
The presenters discuss incident response, how AI-enhanced SRE platforms connect signals from logs, metrics, traces, and historical incidents to enable autonomous decisions.
-
Week-Long Outage: Lifelong Lessons
Molly Struve shares a "murder mystery" outage story from a massive Elasticsearch upgrade. She explains why you need a rollback plan, how to check biases, and why leadership support is a stabilizer.
-
An Incident Story: Tips for How Staff+ Engineers Can Impact Incidents
Erin Doyle discusses her experience with a critical 3-day-long incident and how she missed a key opportunity to help prevent the incident and how to build a culture to prevent similar situations.
-
The Incident Lifecycle: How a Culture of Resilience Can Help You Accomplish Your Goals
Vanessa Huerta Granda describes how to apply resilience throughout the incident lifecycle in order to turn incidents into opportunities, looking at real-life examples.
-
Two Years of Incidents at Six Different Companies: How a Culture of Resilience Can Help You Accomplish Your Goals
Vanessa Huerta Granda looks at real-life examples of companies she has worked with who chose to invest in improving their incident programs and have seen it pay dividends.
-
Comparing Apples and Volkswagens: the Problem with Aggregate Incident Metrics
Courtney Nash presents data from the Verica Open Incident Database (VOID) to demonstrate how aggregate incident metrics (MTTR) aren't representative of systems' resilience.
-
How Did It Make Sense at the Time? Understanding Incidents as They Occurred, Not as They are Remembered
Jacob Scott explores the basics of failure in complex systems, the theory and practice of how it made sense at the time, and actions to take.
-
Rethinking Reliability: What You Can (and Can't) Learn from Incidents
Courtney Nash discusses research collected from the VOID, challenging standard industry practices for incident response and analysis, like tracking MMTR and using RCA methodology.
-
Incidents, PRRs, and Psychological Safety
Nora Jones discusses the context around PRRs and provides takeaways on how one can improve production reliability.
-
Incident Analysis: Your Organization's Secret Weapon
Nora Jones discusses how to move faster and focus on the things that matter by using incident analysis.
-
More More More! Why the Most Resilient Companies Want More Incidents
John Egan discusses how companies of any scale can improve their understandability by lowering their barriers to incident reporting and simplifying their processes for documenting postmortems.
-
Lessons from Incident Management and Postmortems at Atlassian
Jim Severino shares what worked (and didn't work) in incident management and post-mortems for Atlassian.