InfoQ Homepage Failure Content on InfoQ
-
How Did It Make Sense at the Time? Understanding Incidents as They Occurred, Not as They are Remembered
Jacob Scott explores the basics of failure in complex systems, the theory and practice of how it made sense at the time, and actions to take.
-
Managing the Risk of Cascading Failure
Laura Nolan discusses some of the mechanisms that cause cascading failures, what can be done to reduce the risk, and what to do if there is a cascading failure situation.
-
Culturing Resiliency with Data: a Taxonomy of Outages
Ranjib Dey overviews the categorization of outages that happened at Uber in the past few years based on root cause types.
-
Failing over without Falling over
Adrian Cockcroft shows how to use System Theoretic Process Analysis (STPA), as advocated by Professor Nancy Leveson’s team at MIT, to analyze failover hazards.
-
#FAIL
Kevlin Henney keynotes on some of the failures that people had in various projects and the lessons to be learned from them.
-
Rules in Agile Transformation: 80/20 and “Not Everybody Likes to Dance”
Zbigniew Piecuch discusses why some teams do not manage to master Agile.
-
What Breaks Our Systems: A Taxonomy of Black Swans
Laura Nolan talks about Black Swan events - unforeseen, unanticipated, and catastrophic incidents - that may happen in production and can take the system down.
-
How Did Things Go Right? Learning More from Incidents
Ryan Kitchens describes more rewarding ways to approach incident investigation without overly focusing on failure prevention.
-
How Condé Nast Succeeds by Buildling a Culture that Embraces Failure
Crystal Hirschorn talks about learnings found by building a culture that embraced failure through Chaos Engineering practices, what her teams have learned & adapted for their platforms at Condé Nast.
-
Building Resilient Serverless Systems
John Chapin explains how to use serverless technologies and an infrastructure-as-code approach to architect, build, and operate large-scale systems that are resilient to vendor failures.
-
An Engineer's Guide to a Good Night's Sleep
Nicky Wrightson gives some practical insight into how to handle failure in today's more complex distributed microservice systems.
-
Towards Specifications of Robustness - the Things That Programs do _not_ do
Sophia Drossopoulou discusses holistic specifications", an extension of traditional program specifications that support the expression of robustness properties through spatial and temporal features.