InfoQ Homepage Incident Response Content on InfoQ
-
An Incident Story: Tips for How Staff+ Engineers Can Impact Incidents
Erin Doyle discusses her experience with a critical 3-day-long incident and how she missed a key opportunity to help prevent the incident and how to build a culture to prevent similar situations.
-
The Incident Lifecycle: How a Culture of Resilience Can Help You Accomplish Your Goals
Vanessa Huerta Granda describes how to apply resilience throughout the incident lifecycle in order to turn incidents into opportunities, looking at real-life examples.
-
Two Years of Incidents at Six Different Companies: How a Culture of Resilience Can Help You Accomplish Your Goals
Vanessa Huerta Granda looks at real-life examples of companies she has worked with who chose to invest in improving their incident programs and have seen it pay dividends.
-
Comparing Apples and Volkswagens: the Problem with Aggregate Incident Metrics
Courtney Nash presents data from the Verica Open Incident Database (VOID) to demonstrate how aggregate incident metrics (MTTR) aren't representative of systems' resilience.
-
How Did It Make Sense at the Time? Understanding Incidents as They Occurred, Not as They are Remembered
Jacob Scott explores the basics of failure in complex systems, the theory and practice of how it made sense at the time, and actions to take.
-
Rethinking Reliability: What You Can (and Can't) Learn from Incidents
Courtney Nash discusses research collected from the VOID, challenging standard industry practices for incident response and analysis, like tracking MMTR and using RCA methodology.
-
Incidents, PRRs, and Psychological Safety
Nora Jones discusses the context around PRRs and provides takeaways on how one can improve production reliability.
-
Incident Analysis: Your Organization's Secret Weapon
Nora Jones discusses how to move faster and focus on the things that matter by using incident analysis.
-
More More More! Why the Most Resilient Companies Want More Incidents
John Egan discusses how companies of any scale can improve their understandability by lowering their barriers to incident reporting and simplifying their processes for documenting postmortems.
-
Lessons from Incident Management and Postmortems at Atlassian
Jim Severino shares what worked (and didn't work) in incident management and post-mortems for Atlassian.
-
Preparing for the Unexpected
Samuel Parkinson talks about how the Financial Times manages incidents and what they are doing to make it a sustainable process.
-
How Many Is Too Much? Exploring Costs of Coordination During Outages
Laura Maguire shows how resilient performance is directly tied to coordination, and examines problematic elements of an Incident Command System, using case study examples.