Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Chaos and Resilience Engineering: Mental Models, Tools and Experiments

Chaos and Resilience Engineering: Mental Models, Tools and Experiments

This item in japanese

In a recent InfoQ podcast, Nora Jones, co-founder and CEO at Jeli, explored the differences between chaos engineering and resilience engineering. She also provided advice for planning and running effective chaos experiments, and discussed how to learn effectively from incidents.

Jones began the conversation by summarising the key takeaways from the recent QCon London track she hosted, Chaos and Resilience: Architecting for Success. The chaos engineering and resilience engineering fields, although inextricably linked, are often incorrectly conflated. One of the primary goals of this QCon track was to help attendees “understand the relationship between those two worlds, and also to provide real world examples.”

Diving deeper into the potential source of confusion surrounding the two topics, Jones quoted Sidney Dekker’s definition of resilience engineering:

Resilience engineering is about identifying and then enhancing the positive capabilities of people in organizations that allow them to adapt effectively and safely under varying circumstances. Resilience is not about reducing negatives or errors.

She continued by stating that the resilience engineering framework and methodology certainly influenced chaos engineering and the principles behind this domain. However, one of the current challenges is that the discipline of chaos engineering often overly focuses on tooling and the execution of experiments.

Tooling is obviously important, but it must be viewed within the larger context of usability and helping teams to build mental models of underlying systems. Engineers that create these tools often overlook the value of user experience (UX), or don’t have the relevant skills in user design research. This can lead to a tool not being used, or counter-intuitive outcomes being observed.

I had interviewed someone on a team, we're [both looking at the chaos experimentation tool] together, doing some basic user research, and I just have them use it. And there's this label on the form that's, "How much SPS do you want to impact?" And SPS is Netflix's heartbeat metric. It's the key metric. If SPS is too high or too low, people get paged, it's a big thing in the organization. And so apparently seeing that on a form is very scary to people.

Jones also states that the before and after of running a chaos experiment is as important as running the experiment itself. However, the aspects of planning, creating effective hypotheses, and analysing and disseminating the results are often under-resourced.

We all work in socio-technical systems. It is important to take the time to understand both aspects. Developing empathy and working alongside teams that you are trying to influence is essential. It is extremely important to continually work to build correct “mental models” of a system.

Incident analysis can be a catalyst to help you understand more about your system. The Learning from Incidents website, alongside books such as Sidney Dekker’s The Field Guide to Understanding Human Error and Scott Snook’s Friendly Fire, can provide excellent background information to these topics.

For a further exploration into all of the topics covered in the podcast, Casey Rosenthal and Nora Jones have recently co-authored an O’Reilly book, titled “Chaos Engineering: System Resiliency in Practice”.

The podcast audio, including shownotes and a full transcripts, can be found in the accompanying article, "Nora Jones on Resilience Engineering, Mental Models, and Learning from Incidents".

Rate this Article