Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Designing & Managing for Resilience

Designing & Managing for Resilience

Key Takeaways

  • To extend organizational resilience capabilities, leaders should think about developing strong networks not just strong teams 
  • Resilient performance is both designed into the system of work and managed for across the engineering leaders’ own network
  • Engaging multiple, diverse perspectives that span layers and roles in the organization enhances real time adaptability 
  • For cross-functional perspectives to effectively collaborate, there must be well-established common ground across roles 
  • Leaders who want to support resilient performance are relentless in their pursuit of learning

To most software organizations,Covid-19 represents a fundamental surprise- a dramatic surprise that challenges basic assumptions and forces a revising of one’s beliefs (Lanir, 1986).

While many view this surprise as an outlier event to be endured, this series uses the lens of Resilience Engineering to explore how software companies adapted (and continue to adapt), enhancing their resilience. By emphasizing strategies to sustain the capacity to adapt, this collection of articles seeks to more broadly inform how organizations cope with unexpected events. Drawing from the resilience literature and using case studies from their own organizations, engineers and engineering managers from across the industry will explore what resilience has meant to them and their organizations, and share the lessons they’ve taken away.

The first article starts by laying a foundation for thinking about organizational resilience, followed by a second article that looks at how to sustain resilience with a case study of how one service provider operating at scale has introduced programs to support learning and continual adaptation. Next, continuing with case examples, we will explore how to support frontline adaptation through socio-technical systems analysis before shifting gears to look more broadly at designing and resourcing for resilience. The capstone article will reflect on the themes generated across the series and provide guidance on lessons learned from sustained resilience.


Given the scale, complexity, and speed modern IT systems operate at, surprises are an inevitable part of managing digital infrastructure. Ongoing innovation, changes in company priorities and introducing new technology into the stack means that engineers who work on continuously available services are in a constant state of learning and adapting. Because of this, well-calibrated leaders make on-going, continuous investments in supporting their teams to safely adapt under conditions of uncertainty and time pressure.

Several studies (Allspaw, 20151; Grayson, 20182; Maguire, 20203) have closely examined how software engineers respond to surprising service outages. While the authors may have stopped short of explicitly calling them resilient practice techniques, they represent classes of strategies used by engineers to sustain resilience.  Less studied, however, are the strategies used by engineering leaders to help create the conditions for sustained resilience. This article begins to do that.

For this article, I had far-ranging conversations with five engineering leaders who work across four organizations of varying sizes and stages - from a securities exchange that launched just last year to one of the world’s most recognizable blue chip companies. Each leader possess deep technical expertise accumulated from years spent as individual contributors. The interviews were centered around a core series of questions aimed at eliciting stories, examples and strategies of their approaches towards two aspects of their role: 1) designing an organizational structure to support resilient performance (such as how teams should be structured or supporting coordination with non-engineering business functions) and 2) managing for resilience (the leader’s role in helping engineers teams prepare for, and coping with, surprise events). 

The discussions converged into three key propositions for engineering leadership to support resilient performance in their organizations.

Proposition #1: For resilient organizations, think in terms of networks, not just teams.

Proposition #2: Resilient networks depend on active and ongoing grounding across different levels of the organization.

Proposition #3: Resilience depends on learning.

Of course, context matters greatly in the approaches to supporting resilience. What follows is not intended to be a prescription for leaders, but rather thought-provoking propositions intended to consider how an organization’s current practices and structures may be impeding or enhancing the adaptive capacity of its teams.

Introducing the leaders

These leaders were chosen to provide reference points across a variety of organizations - at different ages, stages and services - to explore common patterns across contexts.  

Crystal Hirschorn has been thinking about and experimenting with how to help support her team’s resiliency for the last 13 years. Previously, she was VP of engineering at Condé Nast overseeing their global engineering operations, and prior to that she was a principal engineer so she has extensive knowledge and experience as an individual contributor who deeply informs how she manages her teams. In her current role, she is the director of engineering overseeing infrastructure, SRE and AppSec at Snyk, a developer-first security start-up.

Zoran Perkov knows a thing or two about adapting to changing events. He is the CEO of the Long Term Stock Exchange (LTSE) and led a cross-functional team that built and launched a national securities exchange for long-term investors during a pandemic.  Zoran has presented internationally at the Resilience Engineering Association symposium on digital services resilience. 

Joe Kondel, VP of engineering for LTSE, has an impressive pedigree of his own with over 20 years of experience building teams and the complex systems they are responsible for. He’s been involved with helping: NASA return the Space Shuttle to flight; people watching Game of Thrones premieres on HBO GO; NASDAQ reliably operating three US equity exchanges; IEX building and running their equity exchange; and Carta launching their CartaX ATS. 

Ariel Wei is a manager at Unity Technologies overseeing US based SRE and infrastructure teams. She was a lead network engineer in Salesforce and the product owner of their Global Network Operations team. At Unity, she works with a highly collaborative team of management peers tasked with helping the company refine its Site Reliability Engineering program.

David Leigh has held a multitude of roles within one of the world’s largest multinational software companies and now works as a distinguished engineer with IBM’s office of the CIO. He was instrumental in the formation of the Whitewater team (now Toolbox) which brought modern DevOps practices and tooling to IBM.  Leigh was part of the founding core of the SNAFU Catcher Consortium for digital resilience.

Proposition #1: For resilient organizations, think in terms of networks, not just teams.

Contrary to what might be expected, engineering leaders concerned with resilient performance were not focused solely on just their teams’ abilities. Instead, they saw their teams as part of a broader network within the organization whose success is linked to the performance (or struggles) of peer teams. An engineering team can never be entirely buffered from the pressures, constraints, aspirations and initiatives of others. Instead, their ability to meet their own goals will rely on being able to work effectively across organizational boundaries. Therefore, a leader focused on resilience thinks of an organizational network which can share additional sources of adaptive capacity during periods of high workload.  

As mentioned, this theme presented itself in two different ways - designing and managing for resilience.  

Designing for strong networks

Leaders have a unique opportunity to structure interactions across different levels of the organization to enhance resilience.  

At LTSE, development work is carried out by small cross-functional groups led by a Directly Responsible Individual (DRI) who owns the task to completion. A core function of the DRI is to engage others from across the organization to discuss events and issues around the tasks they manage keeping other relevant disciplines up to speed on the current work and its trajectory. Working across roles and functions within the business pulls multiple, diverse perspectives together to help inform rapid, real time decision-making while managing the kinds of multi-faceted risks faced when launching a new securities exchange. 

Unity’s structure for their site reliability team is also a network model. A centralized SRE group and the teams they aim to support are mutually dependent on one another. An on-call SRE team is dependent on individual dev teams to produce and maintain runbooks for their services. In return, the dev team ostensibly benefits from lowered call volume and can redirect attention to maintenance and feature development.  In theory, this is sharing adaptive capacity across teams to best serve the customer. But to effectively collaborate across team boundaries, dev teams need to regularly share important context about their services (such as recent or upcoming changes or higher than usual user volume) and join in the response for escalating incidents when needed. This requires well-established common ground around the practices that will be used and a willingness from both parties to continually engage in these cooperative efforts. This point will be elaborated further in Proposition #2. 

Cooperation also featured prominently in the discussions around coordinating cross-boundary activity within and across organizations. A 2009 study from a team of researchers at Microsoft found that what can “impact engineers most are not directly technical issues, such as code and APIs, but rather coordination issues.”4

Hirschorn recognizes that coordination amongst teams begins with coordination amongst engineering leaders - managers, directors, and the C-suite. However, many organizations still ascribe to a formal hierarchical organizational chart which raises the threat of individual business units siloing and working out of sync with one another. This creates brittleness at the boundaries between inter-organizational teams when information about highly interdependent systems can be slow or restricted from crossing to parties who need that information to adjust their own operations or when teams are reluctant to help one another.  

Instead, engineering leaders can think laterally and dynamically about how the organization functions instead of only how it is organized. They can do this by continually spending time working to understand other parts of the business and strengthening their own network. This means building strong relationships with the people who work in parallel or overlapping business units, specifically looking for opportunities to gain deeper insight into their peers organization including: what services they run, problems they might face, where their teams strengths lie, who they report to and how that reporting structure works and what kinds of goals they have set for themselves.

They do this because they think in terms of reciprocal benefits - enhancing collective knowledge about one another’s activities and priorities - which works to serve both teams.  By asking their counterparts ,”How can my group help you? What are you missing that we can provide?” and by ensuring others know about the activities of their own engineers; these leaders are building a stable basis for common ground. 

In maintaining a strong sense of their goals, priorities and work underway and finding opportunities to ensure smooth coordination for both parties, leaders are better able to proactively recognize when new initiatives or needed changes are going to impact others.  As Hirschorn says, “When the stakes get high, this matters.”  During a high pressure outage where there is widespread impact, she notes that without a strong foundation to lean on, interactions with other directors or engineering managers “can get fraught and transactional. Tensions can get really high and make things worse.”  The prior investment serves to reduce external pressure on the team and can provide an additional level of adaptive capacity to keep outages short and reduce the stress on individual engineers. 

Leigh notes that many can misinterpret the Agile principles for teams to be “self directed and focused, owning their own outcomes and do that with minimal dependencies on others. If you take that too far you’ll lose all the sharing across teams that is needed to successfully be adaptive.”  When other engineering teams lend their skills, their access to information and their attentional resources during times of overload instead of creating more pressure by demanding updates, escalating the issue to higher levels of management, or otherwise adding burden the system as a whole performs better. It increases the likelihood of reciprocity - of “repaying” the team for their help - in future incidents which sustains adaptive capacity sharing across the organization. 

Managing strong networks 

The concept of shared capacity and reciprocity within an organization is more complex than simply directing teams to work together. Many organizations do have cross-functional work teams or attempt to break down organizational silos by rotating executives throughout the business. However, organizations are defined by reporting structures, functional units or product teams - where each have their own goals and objectives. In addition, an engineering leader is tasked with setting direction, vision and priorities for their teams for a given quarter or phase of the business lifecycle which may put them at different tempos than their counterparts. Systemic and difficult problems that span organizational boundaries can be emergent or continuously changing as different teams make attempts to mitigate the problems within their own scope of authority.  This can make it difficult to coordinate clear goals and objectives with peers for inter-organizational initiatives.

Therefore, a function of the resilient leader is to advocate for capacity sharing and reciprocity as part of their team’s goals and priorities. This means ensuring leaders of other divisions buy-in to the idea of resourcing teams adequately to enable their people to participate in network strengthening activities.

This goes back to the earlier point about laying foundations with peer leaders. Hirschorn notes that it takes “some robust conversations with colleagues about the need to dedicate time to this. It's not something their engineers can just do on the side, in their free time. We need to actually give them an allowance of time against this.” This is a provocative statement to make when teams are already under production pressure to ship new features and requires substantial social capital to advocate for these focused, but loosely structured, collaborative initiatives. 

However, building this into the work can alleviate the ever encroaching production pressure on this capacity.  As previously mentioned, LTSE has hardwired capacity sharing across their growing organization. Engineers “work in public,” encouraging high visibility across the organization to promote transparency across multiple stakeholder groups. “Continuous sharing enables multiple perspectives” to work collaboratively, bringing a wider lens to each aspect of development. And to sustain this, LTSE has made adjustments to how work is managed. To keep teams current on one another’s activities - and therefore more quickly able to respond to requests for help - Kondel has his team break work into the smallest units of work possible.  This keeps teams from working on wildly divergent activities- where the cognitive costs to context switching is high - making them more quickly able to redirect to help others and remain open to resource sharing. 


At IBM’s massive scale, maintaining interconnectedness can be a challenge. However, failing to do so is a potential source of brittleness that can result in siloing, turf wars and working at cross-purposes.  Therefore, IBM has designed for healthy inter-organizational networks by embedding it in its incentive and promotions structures. Advancing into the company’s senior technical staff member or distinguished engineer roles is only accomplished by being able to demonstrate that you have a strong inter-organizational network developed..  

In this way, Leigh’s role as a distinguished engineer is a strategic executive position that is intrinsically about being the bridge between teams, business units, and work activities to smooth the cross-functional coordination and see the bigger picture of work happening throughout the business. “A common role for managers to play in IBM is to get their teams unblocked and it’s usually through our personal networks” Leigh notes. During service outages, teams will “explore all the avenues they can but then ask their manager to reach out to their network because their manager usually knows somebody who can help”.  

Managing resiliently then is, in part, providing new “vantage points” that can be useful for knowing where potentially useful shared capacity exists within the organization - which teams have complementary skill sets, hidden interdependencies, or similar objectives that could be amplified through collaboration. Engineering leaders can similarly empower others to take different “vantage points” from which to assess the organization, its goals and priorities, and the opportunities that are arising or fading with the shifting conditions by creating roles that aim to “cross-pollinate” by sharing ideas across the organization5.

One way to do this, of course, is to create opportunities for your engineers to participate in large scale gamedays, table top exercises and chaos experiments that include roles from across the organization. These structured interactions give teams a chance to learn about other parts of the system, others capabilities and what tradeoffs to make under pressure - all knowledge needed to help work efficiently during incidents.

A second, less formal method is to provide opportunities for engineers to collaborate in guilds and passion projects - focusing in on the organizational problems they may have a particular aptitude for or interest in. In these groups membership is an emergent phenomenon where groups have self-selected in, so they will draw a more diverse group together. An engineering leader needs to provide some structure, such as helping groups of engineers build the business case to contribute resources to working on a particular set of issues or practices. 

Investments here often reap unanticipated benefits - it develops a greater knowledge of the skills available across the network so engineers know who to call when a particularly finicky part of the system goes down. The connections borne from common interests and on-going collaboration can pay dividends to the organization when those relationships are able to work well under the pressure of a high profile outage. However, as we will explore in the next section, there are specific nuances of how teams work together that enhance how effectively they do so.

Proposition #2: Resilient networks depend on active and ongoing grounding across different levels of the organization

Designing and managing to create opportunities for multiple, diverse perspectives to collaborate is an important first step to systematically create strongly interconnected networks. It’s well known that enabling cross-functional interactions can increase the flow of information across the organization and allow individuals and teams to think more comprehensively about others’ perspectives and implications when assessing decisions and changes to their own work. 

However, all interacting parties - dev teams, customer support, security, management etc- need a baseline for coordination and collaboration. In other words, they need common ground.  Common ground is the set of shared assumptions, beliefs and knowledge between parties engaged in joint activity. However, 

“Common ground is not a state of having the same knowledge, data, and goals. Rather, common ground refers to a process of communicating, testing, updating, tailoring, and repairing mutual understandings (cf. Brennan, 1998). Moreover, the degree of quality of common ground demanded by the parties can vary due to the particulars of people, circumstances, and their current objectives.” 6

Creating the conditions for grounding 

To support resilience and adaptive capacity, leaders must explicitly design for the kinds of interactions that can help establish and maintain common ground for new or changing teams or ad hoc groups working together.  In part, this is where Zoran & Kondel’s commitment to transparency and “working in public” factors in; by designing work so that it is observable to others in the organization they are better able to maintain common ground about what is being worked on and at what pace so that others can adjust their own work to be better prepared for any needed coordination. This observability, coupled with the commitment to sustaining common ground requires people who are working jointly to be investing effort in monitoring each other’s activities. 

In a non-collocated world, the affordances of a shared workspace that enable monitoring of shared common ground - being able to see what someone is looking at on their screens or noticing when they are away from their computer - are missing. LTSE has adapted to the fully distributed world the global pandemic created by ensuring there are current and shared artifacts, such as dashboards, and overlapping access to key information sources available to each engineer simultaneously.  They’ve also experimented with analogs such as open video feeds and screen sharing shared real time monitoring tools to attempts to provide a proxy for re-establishing common ground. 

Continuous grounding of the engineering team, like that at LTSE, is not possible with Unity’s SRE team, which works with a wide variety of development teams. In their incident response model SREs are paged into an event and begin to manage the incident using available runbooks. Therefore, efforts to establish common ground is through a prior compilation of materials.  However, when these fail to address the problems faced, the dev team is engaged and the two teams must then work together - underscoring the need for rapid grounding. 

Enacting this process in real time, under time pressure and ambiguity, can be challenging. While the team has enacted a variety of techniques to support real-time grounding (spinning up Zoom channels, using incident chatbots and ensuring shared accessibility to needed information), they’ve also recognized the need to design for additional opportunities for maintaining common ground - namely, during post-incident learning processes discussed in proposition #3. 

Managing the flow of information

A central aspect of the leader’s role in grounding across levels of the organization comes in managing the flow of information.  

Leigh describes a company-wide critical initiative that required rapid adaptation from his team and stretched their organizational capabilities to reconfigure and redirect resources to the critical goal. It crystallized for him that his role as a senior leader is to support continuous grounding by facilitating the flow of information. 

In his role, he was privy to a lot of information about the state of change, its velocity and the myriad of things in flight surrounding the effort across business units. For his team, the current state was less clear as the requirements kept changing. Under conditions of change, organizations often restrict the flow of information to avoid sending conflicting messages when things are still being worked out. However, when information about upcoming changes is throttled, this has an unintended effect of disabling your engineers’ ability to anticipate future possibilities and prepare for them - to be adaptive. 

Hirschorn echoed this sentiment, describing her role as being one of “perspective bringing”. In this she means helping her engineers make decisions about tradeoffs and managing risks by grounding those discussions with perspectives from impacted stakeholders (including other senior leaders). When teams are well grounded it has the added benefit of making it clear who to go to and reach out to first when a situation is unfolding. 

To support resilient performance across a network, engineering leaders must work to support active and ongoing grounding across various roles and levels of the organization. 

Proposition #3: Resilience depends on learning 

The final proposition that emerged from the research relates to the organization’s ability to integrate new understandings, knowledge or beliefs into their existing models and practices as conditions change or new information becomes available. Learning is about sustaining fitness for the conditions as they change, adjusting practices to deal with surprises as they occur. 

Central to this has been the concept of a blameless post mortem - one that enables individuals to recount the actual events as they happened “without fear of punishment or retribution”7.  This proposition extends the concept of learning to include two other interdependent aspects related to more readily support resilient performance.  They are: 1) maintaining slack, and  2) enabling lightweight, real time feedback loops.  To design and manage for resilience, engineering leaders must consider these additional factors. 

Shifting the focus from blame to learning

Despite it being a core priority for her, Hirschorn knows how difficult it can be to cultivate an emphasis on learning. “There is a lot of investment required [to set up the conditions for learning] and you have to keep on top and improve... those processes”. For her, she can see ways her teams have learned when their post-incident focus is not just technical in nature, but can instead step beyond what the tooling did to focus on understanding decisions and actions in context. 

It’s similar at LTSE: “We don’t believe in human error” says Zoran. “What that construct is and means to the team is: I speak the truth and won’t be faulted. Instead, we want to understand how to move beyond and grow from this, how to evolve.” This is essential to improving the flow of information both across engineers and up to leadership - enabling leaders to be aware of problems that may arise, act sooner if needed and have greater context for the problem as it unfolds. At LTSE, the ultimate metric of how well they are embedding this within their engineering teams is when someone reaches out in a direct message and says “I think I did this” because it frees up so much valuable information that can reduce the amount of uncertainty around a problem. 

This transparency, seemingly simple, is incredibly valuable to resilient performance and one of the most challenging to cultivate within an organization. A developer’s willingness to openly share a perceived or actual mistake is informed by past experiences with blame and being unfairly held accountable for things outside their control.  

Hirschorn has experienced first-hand how fear of being blamed can hinder the flow of information she needs to manage resiliently.  Despite her years of experience as a principal engineer fire fighting alongside other engineers during incidents in the early days of her tenure in a new organization, when her team didn’t yet know or trust her intentions, she often finds herself shut out from the communication channels where incident response is taking place. “I say, “Oh, I just want to be a fly on the wall, I’m not going to ask anything”. But that gets hard when you're more senior, because your presence creates more pressure for people who work for you.”

And while she understands that, she also laments that losing this important context has follow-on effects in her ability to anticipate future needs. 

When leaders don’t have a front row seat to the events unfolding, the fidelity of the information received goes down. They can end up with filtered versions of what is getting discussed amongst their responders, leaving them less well calibrated to the capacities and needs of their team both during the incident, but also in making more strategic goals down the road. 

In other words, when information is censored, it limits a leader’s own ability to learn about the ways in which pressures, constraints and difficulties faced influence the performance of their team and makes it challenging to assess the ways to best support the engineering organization more strategically.  

To enhance transparency and well calibrated flows of information across roles in an organization, there must be mutual trust and sufficient psychological safety to share information without fear of reprisal. Engineering leaders can move beyond merely paying lip service to the concept of learning by enabling two other core functions for their teams - by maintaining slack and creating lightweight feedback loops. 

Maintaining slack 

Leigh claims that a significant aspect of empowering his team to be adaptive is to preserve time for learning. To him, resilience isn’t possible without some “slack”8 in the system. Over the years, he has recognized that while being focused on feature development, system maintenance or grooming the backlog supports healthy technical systems, it is insufficient for strong socio-technical systems. Instead, teams need capacity to participate in activities that help them develop deeper expertise, to share that expertise with their peers and to make novel connections to drive innovation.

This is not an easy balance for leaders, as Hirschorn noted in her comments in proposition #1 on advocating with her peers for creating time for teams to work collaboratively. However, she has seen the power of how this enables resilience in several ways. First, teams that have opportunities for learning and sharing that learning have a social “gravity” that draws other resources in. When a team is given some slack to cultivate new knowledge and skill sets, other engineers quickly see the value they create and want to similarly develop new skills. People begin to rely on one another to share knowledge, further enhancing the network effect by extending cross-functional capacity.

Create lightweight, real time feedback loops 

A second characteristic of the focus on learning is the ability to experiment by creating feedback loops. Several leaders described a rigorous approach to trying new things within their teams. 

“We live by pivoting and perseverance” Zoran emphasizes. “If you are off the mark, then pivot. We don’t produce widgets at the end of a cycle or an experiment; instead, we produce learnings. Even if you discover how you are building something will fall over once you get more data points into it, that’s valuable.”

Hirschorn couples this ethos for experimenting with a strong dedication to piloting to generate short-cycle learning. She encourages teams to start small to prove out a concept and be able to back up their instincts with results of the measurements they implemented as part of their experiments. 

For example, one of her teams noticed substantial knowledge gaps arose when people left the organization. So the team proposed an experiment to see who was committing most often to what repos, which was split off into services. By scraping this GitHub data, it created correlations of who paired with whom the most - some of which generated some surprising connections between engineers. These patterns showed engineers’ affinities for types of scopes, technologies and systems they look after. Because the network was made visible by this initiative, it made it easier to generate some wider pairings of engineers across different teams to distribute learning.

Setting up small scale experiments that can quickly generate feedback is a core characteristic of better informing learning within an engineering organization.


Engineering leaders have a unique ability to influence resilient performance and their team’s ability to adapt to surprising events. This can be achieved through design -by creating conditions and interactions that allow for faster adaptation - and  managing - by supporting practices that reinforce the benefits of their design. 

Specifically, by creating opportunities for teams to work cross functionally and organically using a network-based approach that includes perspectives across multiple levels and roles. In addition, by encouraging interactions that support continuous grounding including using their position to broaden the perspectives being considered in decision-making. Lastly, a strong emphasis on learning - including blameless post mortems, experimentation and short cycle feedback loops- can embed the sharing of knowledge and surfacing of relevant insights to aid in real time adaptation to changing conditions. 

What’s clear is from the discussions with these leaders is that - like subscription fees for tools your company can’t live without - investments in resilient performance are ongoing.  And, because of the substantial impact they have on an organization’s ability to cope with surprising events, they need to be made before you need them.  Engineering leaders’ contributions to resilience in the face of surprising events is to create the conditions for learning and adaptations to occur. 


  1. Allspaw, J. (2015). Trade-Offs under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages.
  2. Grayson, M. R. (2018). Approaching Overload: Diagnosis and Response to Anomalies in Complex and Automated Production Software Systems [Masters in Integrated Systems Engineering]. The Ohio State University.
  3. Maguire, L. M. D. (2020). Controlling the Costs of Coordination in Large-scale Distributed Software Systems (Doctoral dissertation, The Ohio State University).
  4. Poile, C., Begel, A., Nagappan, N., Redmond, W. A., & Layman, L. (2009). Coordination in Large-Scale Software Development: Helpful and Unhelpful Behaviors.
  5. Maguire, L. & Jones, N. (2020). Learning from Adaptations to Coronavirus. Learning from Incidents blog.
  6. Klein, G., Feltovich, P. J., Bradshaw, J. M., & Woods, D. D. (2005). Common ground and coordination in joint activity. Organizational simulation, 53, 139-184.
  7. Allspaw, J., Evans, M., & Schauenberg, D. (2016). Debriefing facilitation guide.

About the Author

Laura Maguire leads the research program at, where she studies software engineers keeping distributed, continuous deployment systems reliably functioning, and helps to translate those findings into a product that is advancing the state of the art of incident management in the software industry. Maguire has a Master’s degree in Human Factors & Systems Safety, a PhD in Integrated Systems Engineering from the Ohio State University, and extensive experience working in industrial safety & risk management.

To most software organizations,Covid-19 represents a fundamental surprise- a dramatic surprise that challenges basic assumptions and forces a revising of one’s beliefs (Lanir, 1986).

While many view this surprise as an outlier event to be endured, this series uses the lens of Resilience Engineering to explore how software companies adapted (and continue to adapt), enhancing their resilience. By emphasizing strategies to sustain the capacity to adapt, this collection of articles seeks to more broadly inform how organizations cope with unexpected events. Drawing from the resilience literature and using case studies from their own organizations, engineers and engineering managers from across the industry will explore what resilience has meant to them and their organizations, and share the lessons they’ve taken away.

The first article starts by laying a foundation for thinking about organizational resilience, followed by a second article that looks at how to sustain resilience with a case study of how one service provider operating at scale has introduced programs to support learning and continual adaptation. Next, continuing with case examples, we will explore how to support frontline adaptation through socio-technical systems analysis before shifting gears to look more broadly at designing and resourcing for resilience. The capstone article will reflect on the themes generated across the series and provide guidance on lessons learned from sustained resilience.

Rate this Article