InfoQ Homepage Incident Response Content on InfoQ

Presentations

RSS Feed

Newer Older

DevOps

The Time it Wasn't DNS

Sean Klein explains how Azure handles massive outages using modern incident analysis. Moving past the "Five Whys," he shares how systemic factors—not operator error—caused the 2023 global WAN outage.

Sean Klein
on Jun 23, 2026

Icon

43:59
DevOps

The Human Toll of Incidents & Ways to Mitigate it

Kyle Lexmond discusses the human side of major system failures. He shares psychological insights and architectural tactics for surviving high-pressure incident rooms.

Kyle Lexmond
on Jun 02, 2026

Icon

51:40
DevOps

The Ironies of A^2 I^2

J. Paul Reed explains the "ironies of automation" and AI in incident response. He discusses how reliance on AI can erode manual skills and camouflage system failures during high-stakes outages.

J. Paul Reed
on May 21, 2026

Icon

45:16
DevOps

AI-Powered SRE for Autonomous Incident Response

The presenters discuss incident response, how AI-enhanced SRE platforms connect signals from logs, metrics, traces, and historical incidents to enable autonomous decisions.

Rohit Dhawan Pavan Madduri Alina Astapovich Goutham Rao Renato Losio
on Apr 28, 2026

Icon

01:00:24
DevOps

Week-Long Outage: Lifelong Lessons

Molly Struve shares a "murder mystery" outage story from a massive Elasticsearch upgrade. She explains why you need a rollback plan, how to check biases, and why leadership support is a stabilizer.

Molly Struve
on Apr 28, 2026

Icon

49:32
Culture & Methods

An Incident Story: Tips for How Staff+ Engineers Can Impact Incidents

Erin Doyle discusses her experience with a critical 3-day-long incident and how she missed a key opportunity to help prevent the incident and how to build a culture to prevent similar situations.

Erin Doyle
on Jul 17, 2024

Icon

48:16
Culture & Methods

The Incident Lifecycle: How a Culture of Resilience Can Help You Accomplish Your Goals

Vanessa Huerta Granda describes how to apply resilience throughout the incident lifecycle in order to turn incidents into opportunities, looking at real-life examples.

Vanessa Huerta Granda
on Jun 26, 2024

Icon

45:22
Culture & Methods

Two Years of Incidents at Six Different Companies: How a Culture of Resilience Can Help You Accomplish Your Goals

Vanessa Huerta Granda looks at real-life examples of companies she has worked with who chose to invest in improving their incident programs and have seen it pay dividends.

Vanessa Huerta Granda
on Mar 14, 2024

Icon

45:01
Culture & Methods

Comparing Apples and Volkswagens: the Problem with Aggregate Incident Metrics

Courtney Nash presents data from the Verica Open Incident Database (VOID) to demonstrate how aggregate incident metrics (MTTR) aren't representative of systems' resilience.

Courtney Nash
on Jan 18, 2024

Icon

38:16
DevOps

How Did It Make Sense at the Time? Understanding Incidents as They Occurred, Not as They are Remembered

Jacob Scott explores the basics of failure in complex systems, the theory and practice of how it made sense at the time, and actions to take.

Jacob Scott
on Sep 14, 2023

Icon

38:15
DevOps

Rethinking Reliability: What You Can (and Can't) Learn from Incidents

Courtney Nash discusses research collected from the VOID, challenging standard industry practices for incident response and analysis, like tracking MMTR and using RCA methodology.

Courtney Nash
on Aug 31, 2023

Icon

38:24
Culture & Methods

Incidents, PRRs, and Psychological Safety

Nora Jones discusses the context around PRRs and provides takeaways on how one can improve production reliability.

Nora Jones
on Jul 22, 2022

Icon

39:10

Newer Presentations

Older Presentations

InfoQ Software Architects' Newsletter

Presentations