BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Continuous Learning as a Tool for Adaptation

Continuous Learning as a Tool for Adaptation

Bookmarks

Key Takeaways

  • Engineering leadership should emphasize learning over action items to maximize the value spent in post-incident activities.
  • In doing so, action items become more likely to be completed, more collaborative, and more productive when space and time are given to learning after an incident takes place.
  • Performance improvement is best achieved by increasing and disseminating insights, not just reducing errors.
  • Asking questions after incidents using techniques like Cognitive Interviewing can increase insights and allow post-incident meetings to be more worthwhile.
  • Developing a learning organization is a competitive advantage, and leaders who study, encourage, and understand what it means to develop this kind of organization are well-positioned to achieve stronger collaboration and team dynamics under pressure.

To most software organizations, Covid-19 represents a fundamental surprise- a dramatic surprise that challenges basic assumptions and forces a revising of one’s beliefs (Lanir, 1986).

While many view this surprise as an outlier event to be endured, this series uses the lens of Resilience Engineering to explore how software companies adapted (and continue to adapt), enhancing their resilience. By emphasizing strategies to sustain the capacity to adapt, this collection of articles seeks to more broadly inform how organizations cope with unexpected events. Drawing from the resilience literature and using case studies from their own organizations, engineers and engineering managers from across the industry will explore what resilience has meant to them and their organizations, and share the lessons they’ve taken away. 

The first article starts by laying a foundation for thinking about organizational resilience, followed by a second article that looks at how to sustain resilience with a case study of how one service provider operating at scale has introduced programs to support learning and continual adaptation. Next, continuing with case examples, we will explore how to support frontline adaptation through socio-technical systems analysis before shifting gears to look more broadly at designing and resourcing for resilience. The capstone article will reflect on the themes generated across the series and provide guidance on lessons learned from sustained resilience.

 

The Covid Resilience series has highlighted and focused on distilling sources of organizational resilience through learning from unexpected events. In this capstone article, I will explore key themes from each of the articles series with a special view on the practicality of organizational resilience for building companies that adapt to surprises and continue to thrive despite uncertainty and unexpected events. Additionally, I will provide practical guidance to engineering leadership and recommendations on how to create this investment based upon what I’ve seen in the field during my experiences at Netflix, Slack, Jet.com, and now as CEO and founder of a company (Jeli.io) that is focused on helping organizations uncover their resilience through learning from incidents. In 2019, I founded the LearningFrom Incidents community in software engineering to create a space to bring together engineers interested in resilience engineering to share stories, initiatives, struggles and successes in coping with complexity and surprise in their organizations. Through these experiences, I’ve worked as both frontline responder during incidents and in a consultative capacity across a wide variety of organizations, and much of this article is a synthesis of themes I’ve seen throughout the industry over the years. While the four articles before this piece focused on the how and the why of enhancing organizational resilience through incidents for individual contributor audiences, I aim to gear this piece towards the what and speak directly to leadership of technical organizations to provide specific recommendations. 

Looking at your organization’s ability to adapt and function to uncertain conditions and events can tell you many attributes about how your organization works vs how you think it works. This becomes especially important for leadership in technical organizations to provide the time and the space for. In this article, we will focus on just how that time and space can be given and how it should be leveraged.

Leadership Call to Action

Like many leaders in software engineering, I was first an SRE, then line manager and now an executive leader. Throughout this progression, I’ve understood  the need to have a pulse on what’s going on, where time is being spent, and where things are difficult, in order to shift organizational focus in positive directions enabling progress and upward momentum. However, I’ve noticed, the further up the organizational chain you are, the harder it is to keep a handle on what is actually going on, where the gaps are, what is leading to disruption, and what organizational-related issues are costing. Yet, it is your job to enable an environment where you can easily see and surface these things, so they can be aided, enhanced, or fixed.

And this tension between needing to know about operational realities and facing increasing difficulties in knowingabout operational realities impacts leaders ability to help cultivate and support  resilient teams. 

As Article 4, Designing & Managing for Resilience notes:

When leaders don’t have a front row seat to the events unfolding, the fidelity of the information received goes down. They can end up with filtered versions of what is getting discussed amongst their responders, leaving them less well calibrated to the capacities and needs of their team both during the incident, but also in making more strategic goals down the road.1

Ultimately, as leaders, we want to ensure things are improving and we are enhancing and supporting the capabilities of the most important part of our system: the people. By continuing to invest in enhancing and supporting these capabilities, we enable the people in our organization’s system to grow and operate together in ways allowing for this resilience to be sustained, even as people and pieces of technology inevitably enter and exit the system.2

It’s part of our duty as leaders in these organizations to create an environment where we are constantly enhancing the capabilities of the system, and allowing time for the capabilities of our people to understand it, especially as it grows more complex over time. Which, it inevitably will. We can’t prevent complexity.  What we can do is acknowledge it is there and allow space for our people to understand it.

Over the last 18 months I’ve worked closely with companies who are attempting to create a more learning-centric organization and I’ve compiled a series of common questions I have been asked from leaders as they seek to move away from “fixing” a team where engineers feel as though they are continuously fighting fires towards supporting a team to adapt to the inevitable complex surprises they’ll face. 

Our company has always emphasized identifying & completing Action Items to support performance improvement. How do I show what's actually getting improved when focusing on learning?

Taking a ‘Learning from Incidents’ approach can be a challenging shift for many organizations. It's a different way of understanding, and addressing, organizational problems. Ironically, this resistance is typically a signal that the approach is actually needed to really generate momentum in continual improvement.  Taking a learning-focused approach to incidents is central to Resilience Engineering and has its roots in the successes of many high demand work environments.

In particular, Gary Klein is a research psychologist known for studying experts and expertise in their natural environments (e.g. firefighters at a fire). Learning more about his work, and how he’s done this, has enabled me to explore ways to surface the expertise of engineers in the various organizations I’ve worked in, and subsequently distill the knowledge to others in the organization, thus building up more expertise and ultimately improving performance of the system. In his book, Seeing What Others Don’t, he notes performance improvement in organizations is the result of a combination of error reduction plus insight generation. Implying that we can’t actually improve the performance of our systems unless we focus on generating and disseminating insights throughout the organization3.

Reference: Seeing What Others Don’t, Klein

This equation makes sense the more you think about it!  So, why as an industry, do we focus more on the error reduction portion? Why are we constantly focusing on a number of certain incidents, and MTTR, and MTTD and open action items? Those metrics are all looking for error reduction, without providing any information on the insights generated. It’s no surprise that capturing those things doesn't actually direct our attention to areas of the system that need to be improved, but instead keeps us “fixed” in a fixing and fire-fighting mindset. 

We focus on the error-reduction metrics above because they’re simply much easier to measure.4 It’s difficult to measure insights generated, and it’s much more difficult to measure insights disseminated. It’s also difficult to actually do both of those things in an effective way.

This is not to say that things aren’t important to fix. Of course performance can’t improve if you don’t look at both of these items but it’s not helping us to measure open action items and attribute the success of a team to the completion rate of their action items. Action items getting completed isn’t always a signal that things are getting better. Action items not getting completed isn’t always a signal that people aren’t working hard. It’s only part of the story and part of the equation.

Completion and addressing of action items is typically regarded as a standard measurement of organizational improvement. I want to offer a challenge to leadership though, how do you know your action items are actually good, worthy of being completed, and focused on things you should be doing? Rather than just a reaction to the last “bad thing that happened”. After an incident, there is a tendency in software organizations to rush to find the answers, specifically when an answer is owed to a customer about exactly what happened. And even more pointedly, the aforementioned rush and pressure to find answers increases when the incident in question has a higher impact towards the customer, or towards the bottom line.

As a leader, there are tangible ways to support the shift from error reduction to insight generation. A leadership role is uniquely positioned to change the culture around where teams focus their attention - on generating Action Items or on sharing knowledge and learning.

For example, the kinds of questions you ask when attending incident reviews shows your team that you are interested in your operations as they are, not as they should be. This subtle shift helps you understand how the event unfolded as it did and why it made sense for people to respond as they did.  This is a deeper form of inquiry that aims to see the bigger picture around how the incident was handled and allows for discussions that can begin to surface the kinds of details that are critical to you for being able to support resilient performance.  

Similarly, being interested and curious about how your company supports people new to a system and how they approached the problem in front of them shows you are interested in supporting your engineers being well oriented to handle the problems they’ll inevitably face.

Lastly (and perhaps contentiously), separate the action items from the incident review -- people need “soak time” to come up with things that are actually valuable. Coming up with things very close to the date of the actual incident (just for sake of coming up with action items) are not likely to be of high value and are usually done to “check a box” that action items got recorded. The box checking action items are typically the ones that do not get completed.Or, when they do, are not activities that are meaningful to preventing future incidents or future handling of new issues. 

Over time, focusing on better individual incident reviews, will lead to more performance improvement. The added benefit is your team begins to trust that they are able to identify problematic systemic organizational issues or knowledge gaps and they will not be called out for it. 

Ok, I'm interested in switching to learning instead of fixing. What will this mean for how I lead my engineers?

I’d encourage you to take a look at the typical tenure of your organization and at what point people leave, and ultimately take their expertise with them. What happens when that expertise leaves your organization? There will be a knowledge gap5 and incidents will increase in complexity due to over-reliance on a particular person and their expertise, rather than focusing on distilling their knowledge to those around them. I don’t mean that the expert in question should be the one focusing on leveling up those around them. Research shows that experts are fundamentally not great at understanding their expertise and explaining it to others6. The good news? So, why do we put so much pressure on experts to train others when they may not be able to do it? Fortunately, there is a way around this -- by investing in folks in your organization leveling up their cognitive interviewing skills, you can unearth this expertise, in a way that doesn’t put all the stress on the expert, because let’s face it: they’re busy!

Preparing for unforeseen events is one way to enhance expertise.7 Surely, we can’t prepare for every unforeseen event. However, if we understand how we reacted to past unforeseen events, it strengthens our capability to do better in the future. 

Chaos Engineering is a discipline that emerged from Netflix in 2011 as a way of preparing the system for the aforementioned unforeseen events. The idea behind it was to inject change into the system in a safe way that allowed operators of such a system to test the ability of the system (including themselves) to adapt under changing conditions that would ideally be occurring in production anyway. In article 1, Meeting the challenges of disrupted operations, Laura Maguire highlights some of the ways Incident Analysis and Chaos Engineering are inextricably linked and can be used to enhance each other in organizations.

Change is a constant - the answer to safer operations is not to attempt to stop change, it’s to design better practices and instrument your systems with better tooling to enhance monitoring so your engineers are better equipped, when required, to be well calibrated to be able to safely adapt.8

However, focusing too much on the unforeseen conditions (Chaos Engineering), can distract our focus from capitalizing on events that have already occurred. As Alex Elman notes in article 2, Shifting Modes: Creating a Program to Support Sustained Resilience, “no one could’ve predicted the pandemic”9. Is it useful for organizations to simulate events like us all needing to suddenly go remote at the exact same time? No. By focusing on where we need to learn more, we learn where our weaknesses lie, and what we’re actually really good at. 

Through my experience implementing Chaos Engineering in various organizations, I’ve seen a common problem of focusing on this too early (read: without investing time in understanding where our problem areas are and what we are actually doing well at), and being gravely disappointed in the return on the investment. There have been substantial advancements in Chaos Engineering tooling over the years10 as the practice has emerged and strengthened. However, the focus of this tooling is often on being able to safely inject failure while mitigating the blast radius of the injection. The advancement in this tooling has been quite remarkable, however the focus on the safety bounds of the failure injection has detracted from the focus on the primary benefit of Chaos Engineering: refining mental models and improving our system’s ability to react

Incident Analysis allows us to capitalize on unforeseen events that already occurred. There is no reason to let a good incident go to waste, when you’ve already made the investment. If after an incident you find action items around “training”, “runbooks outdated”, or “lack of …”, you are missing out on the value of a well-run incident review that can provide the items. Put differently, adding these as action items on your postmortems means they might get done, yes -- but who does them? Typically, this work falls on The Expert. Someone who is costly, and scarce within the organization. Understanding who your experts are, and just how often you rely on them when they’re not “supposed” to be there (i.e. they are not on-call), can point to islands of knowledge in your organization, and areas you need to enhance expertise in.

In article 2,  Alex Elman notes the following happens when a company is doing well (revenue-wise): New employees will be hired and trained to account for the increase in workload. Unforeseen threats will act upon the system. Inevitably, there will be more incidents. It’s a great thing when the demand for our product increases, it is a signal that our businesses are growing and successful. However, as this happens, oftentimes the definition for what makes an incident (as noted in article 3, Adaptive Frontline Incident Response: Human-Centered Incident Management11), adapts and changes too. If I went to 5 different engineers in your company at this moment, and asked each of them to define what makes an incident at your company (without pulling up a document outlining severity levels), how many different answers would I receive? Possibly 6! 

The likelihood your own engineers would respond in a similar fashion is an indication that being able to continually refine mental models is a core competency of a resilient organization.  As such, efforts to improve on learning need to target those whose alignment is more critical - the end user (the engineer). 

How do you make investing in learning relevant to your end users? 

It’s not an inconsequential thing to ask leaders to create the conditions for end users to be able to learn in real time but it’s crucial for making a difference in the value your team extracts from incident analysis. 

As Elman notes in article 2, a poorly-run post-incident meeting can be very expensive, especially if they are an hour long and it feels like a presentation of the events unfolding, rather than another point of data collection, and without a focus on collaboration about what happened and how it happened. These meetings only feel expensive if they’re facilitated in a way where the facilitator is presenting instead of facilitating. When this happens, it can feel demoralizing, process-heavy, and a waste of time. I completely understand why organizations want to spend less time on these based upon how they are sometimes run today, they are a poor investment of your team's valuable time. However, the ROI of a good one is undeniable. 

As an industry, and as we are trained as software engineers, there is a gap in our training -- how to ask the right questions after an incident occurs, and what to focus on in a review so that we capitalize on the right action items, and level-up expertise in the right places. 

If you’re not taking the time to learn from these incidents, the conditions that led to them in the first place are going to continue to be present, and lead to more incidents.

Running post-incident reviews in an effective way (a way that enables the most return on investment) can feel uncomfortable at first, mostly because we’re not used to asking these kinds of questions. 

New ways to ask questions that get impactful answers

There is a tendency for organizations to over-focus on incident response, instead of investing in learning where some of the issues lie, which can often lead to disappointing results. Cognitive questioning is a technique that can be used to focus on fruitful areas after an event has occurred, to uncover expertise in the system and point to why exactly it made sense for a person to do the thing they did. We want to ask these types of questions, because, if it made sense for one person to do something that didn’t make sense in hindsight, it’s probably going to make sense for someone else to do the same thing. By shifting the focus to understanding, instead of chalking it up to “needing more training”, we allow the person to explain in a safe way what exactly was going through their head at the time. 

Cognitive questioning can point to new topics to continue exploring around: relevant on-going projects, related incidents from the past, past experiences in the organization. Surfacing this kind of detail about the ways in which everyday operations influenced incident handling can ultimately lead to better action items. 

As a leader, encourage your investigators or post mortem facilitators to bring together multiple, diverse perspectives on the incident. While those involved in the incident are a good starting point you’ll amplify the learning when you ask them to include other SREs or engineers who may need to deal with that component or part of the system in the future, impacted stakeholders, customer service agents and other supporting roles not traditionally included in a review.  By asking your team to think broadly about who is ‘involved’ in the incident you extend the learning as more roles become aware of how the system operates, where the expertise lies and how to work effectively together under pressure. 

You can support these efforts by working with peer leaders to give your teams time and have your teams sit in on each others’ incident reviews to share knowledge and enhance future coordinative work. Giving engineers time to spend on developing their knowledge is sending a signal that it is important and provides a forum for knowledge exchanges to take place. 

As article 4 notes:

“To enhance transparency and well calibrated flows of information across roles in an organization, there must be mutual trust and sufficient psychological safety to share information without fear of reprisal. Engineering leaders can move beyond merely paying lip service to the concept of learning by enabling two other core functions for their teams - by maintaining slack and creating lightweight feedback loops.“

As you move toward a learning-centric organization we see organizations effectively have difficult discussions “in public” and this can be a signal that your efforts are making sustained change that will support their ability to adapt to future surprises more fully than a laundry list of Action Items can.

How do you know your team/organization is learning?

Ultimately, leaders want to measure progress and performance improvement (not error reduction), and ensure the focus does not deviate too far from what is providing your business revenue. 

So how do you know? In short, you ask your engineers. 

There are two simple metrics you can use when moving towards more learning oriented reviews, sending out an anonymous survey that asks:

  • Did you feel this meeting was a good use of your time?
  • Do you feel more confident of your knowledge about the different pieces of the system that interacted to enable this incident to occur?

These metrics recognize that what is central to fast, effective incident resolution is having capable, competent engineering teams. Asking these questions empowers engineers to invest their time in their own development because their capacity to cope with future problems is directly related to helping your organization improve its incident handling. 

The resilient performance, learning and leadership connection

Throughout this article, I’ve answered the most common questions I’ve encountered in my work as an advocate and change agent to draw the connection between resilient performance, learning and leadership. Your engineers are already adapting to the surprises they face - sometimes gracefully, sometimes poorly. I’ve outlined some steps for you as a leader in an organization faced with increasing demands for reliability and resilience to support these adaptations while still focusing on the kind of performance your company needs to stay competitive. 

In closing, I’ll leave you with some action items to help your journey. 

As we spoke about in this article, action items are important -- however, they need soak time in order to provide value. Here are some directions you can start with to enable insight generation and dissemination, and know that it is actually working:

  • Take a look at your post-incident process. Is it focused on learning, or is it more about checking boxes? 
  • Take note of if people are reading incident reports, or merely filing them 
  • Look for normative language in your incident reviews and post-incident meetings
  • Delete actual action items from your ticket board that haven’t been dealt with in 5 months (if they’re important enough, they’ll pop up again)

Developing a learning organization is a competitive advantage, and leaders who study, encourage, and understand what it means to develop this kind of organization are well-positioned to achieve stronger collaboration and team dynamics under pressure. Meaning that incidents might not be so detrimental, and just might be another avenue that your people and business can easily cope with and adapt to.

About the Author

Nora Jones is the CEO and founder of Jeli.io, and a former engineer with a passion for people and reliable software, as well as the intersection between those two worlds. She truly believes that safety is pivotal with software development nowadays. She co-wrote two O’Reilly books on Chaos Engineering, and how a product’s availability can be improved through intentional failure experimentation.

She also shared her experiences helping organizations large and small reach crucial availability and in November of 2017 keynoted at AWS re:Invent to share these experiences with an audience of ~40,000 people, helping kick off the Chaos Engineering movement we see today. Since then she has keynoted at several other conferences throughout the world highlighting her work on topics such as: Resilience Engineering, Chaos Engineering, Incident Management, Site Reliability, and more from her work at Netflix, Slack, and Jet.com.

References

  1. Maguire, L. (2020). Designing & managing for resilience. InfoQ. Retrieved from here.
  2. Woods, D. D. (2018). The theory of graceful extensibility: basic rules that govern adaptive systems. Environment Systems and Decisions, 38(4), 433-457.
  3. Klein, G. (2013). Seeing what others don't: The remarkable ways we gain insights. Public Affairs.
  4. Wears, R. L. (2008). The error of counting" errors". Annals of emergency medicine, 52(5), 502-503.
  5. Johannesen, L., Sarter, N., Cook, R., Dekker, S., & Woods, D. D. (2012). Behind Human Error. Ashgate Publishing, Ltd..
  6. Klein, G. A. (2017). Sources of power: How people make decisions. MIT press.
  7. Weick, K. E., & Sutcliffe, K. M. (2001). Managing the unexpected (Vol. 9). San Francisco: Jossey-Bass.
  8. Maguire, L. (2020). Meeting the challenges of disrupted operations. InfoQ. Retrieved from here.
  9. Elman, A. (2020). Shifting Modes: Creating a Program to Support Sustained Resilience. InfoQ. Retrieved from here.
  10. Rosenthal, C., & Jones, N. (2020). Chaos engineering. O'Reilly Media, Incorporated.
  11. Rupee, E. & MacDonald, R. (2020). Adaptive Frontline Incident Response: Human-Centered Incident Management. InfoQ. Retrieved from here.

To most software organizations, Covid-19 represents a fundamental surprise- a dramatic surprise that challenges basic assumptions and forces a revising of one’s beliefs (Lanir, 1986).

While many view this surprise as an outlier event to be endured, this series uses the lens of Resilience Engineering to explore how software companies adapted (and continue to adapt), enhancing their resilience. By emphasizing strategies to sustain the capacity to adapt, this collection of articles seeks to more broadly inform how organizations cope with unexpected events. Drawing from the resilience literature and using case studies from their own organizations, engineers and engineering managers from across the industry will explore what resilience has meant to them and their organizations, and share the lessons they’ve taken away. 

The first article starts by laying a foundation for thinking about organizational resilience, followed by a second article that looks at how to sustain resilience with a case study of how one service provider operating at scale has introduced programs to support learning and continual adaptation. Next, continuing with case examples, we will explore how to support frontline adaptation through socio-technical systems analysis before shifting gears to look more broadly at designing and resourcing for resilience. The capstone article will reflect on the themes generated across the series and provide guidance on lessons learned from sustained resilience.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT