BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Crafting a Resilient Culture: Or, How to Survive an Accidental Mid-Day Production Incident

Crafting a Resilient Culture: Or, How to Survive an Accidental Mid-Day Production Incident

Bookmarks

Key Takeaways

  • Automation and orchestration can be great, but make sure you understand what’s happening under the hood and what to do if your automation goes awry.
  • If people aren’t following processes, you’ll gain more by digging into why not than by trying to enforce something that isn’t working.
  • Build incident response and remediation work into your planning budgets to make sure you have enough capacity within your teams.
  • Look for ways to create a culture where people feel safe enough to ask for help and have the time to help out others.
  • Creating a psychologically safe environment is key to maintaining a learning environment.

Nearly everyone who has worked with computers for long enough has a story of when they accidentally broke something in production. For most operations teams I’ve worked on, breaking production was a rite of passage, with the joke being that once someone had done that only then were they a "true" member of the team. We trade stories at conferences or team social events of our disasters and near misses, and over time these stories often become part of a team or organization’s institutional body of knowledge. By learning from our mistakes, we can avoid repeating them.

The Apache SNAFU

One of my favorite production horror stories is the time when, while I was working at Etsy, I accidentally upgraded Apache on every single server that was running it. This incident, which won Etsy’s internal Three-Armed Sweater award (given to the engineer who inadvertently causes the most interesting or most learning-filled incident), was presented at QCon New York in 2018 and also documented as part of the Stella Report, but I’ll tell the short version of the story here as well.

As part of my work on an internal server provisioning tool, I needed to provision a server to test some changes I’d made. When I tried, the initial Chef run on the new server failed at the Apache install step, since the Apache version that was pinned in the Chef code was older than the version that had been installed from the local yum mirror. This was a known provisioning failure mode, as the yum servers were configured to pull down the latest available package versions while versions in Chef were pinned for stability.  The accepted way to fix this sort of issue was to bump the pinned version in Chef. At the time, Chef was configured so this type of change would only impact newly provisioned servers, and servers that already had Apache installed would not be upgraded.

To do a bit of due diligence, I checked the release notes for the new Apache version (it was a point release, and nothing in the changelog looked interesting), I installed it manually on test server, and verified that it installed and started correctly. Since I had done this same thing just weeks before without issue, I then rolled out the changes to both the staging and production Chef environments at once. To everyone’s surprise however, Chef not only upgraded the Apache package everywhere, but after the upgrade Apache didn’t start up properly, meaning everything depending on it (such as the entire Etsy.com site) was in a bad state.

The happy ending to this story is that several people from a few different teams all quickly jumped in to help. We figured out that running Chef a second time caused Apache to restart correctly, and while the site had become very slow during the process, it never technically went down. The whole incident probably didn’t last longer than 20-30 minutes before everything was back to normal.

Resilience and Robustness

What allowed the accidental Apache upgrade to be resolved so quickly and with minimal impact to end users was resilience. Resilience is not the same as robustness. Robustness is the ability to scenarios that have been anticipated and planned for in advance. Resilience, on the other hand, describes the ability to respond to unanticipated scenarios or failure cases. While software can be robust, only humans can be truly resilient.

More information on resilience engineering, which is beyond the scope of this article, can be found in "Resilience is a Verb" and "Four concepts for resilience and the implications for the future of resilience engineering", both by David Woods, one of the founders of the resilience engineering field and a member of the SNAFU Catchers group that studied the Apache SNAFU described here. For the purposes of this article, we will consider his description of resilience as "the capabilities a system needs to respond to inevitable surprises", focusing on the human aspects of those capabilities as they apply to modern software engineering culture.

Culture Design and Designable Surfaces

In any engineering organization, there are many different aspects of how people work and interact that can be described as that organization’s "culture". The devops movement has talked a great deal about the importance of culture, with cultural changes being discussed as a key part of a "devops transformation". A clear understanding of what culture means is necessary in order to be able to create and maintain an effective engineering environment.

Culture is the collection of social scripts, lore, behavioral norms, biases and values, incentive structures, and processes that describe behaviors of a distinct group or social environment. Devops as a cultural movement is a series of trends shifting instances of this collection - for example, the trend towards social scripts that encourage more collaboration between development and operations teams. Not all cultures are created equal. This is most often seen when comparing how an organization describes its values or practices in theory with what its culture ends up being in practice. Culture looks at what an organization actually does, not just what it claims to value.

With appropriate care, culture can be changed. There is no one-size-fits-all pattern that can describe what a "devops culture" or a "resilient culture" is, let alone how to get there. Each organization will have its own collection of values, norms, processes, and so forth that it has decided upon, as well as its own unique set of problems and challenges. Instead of trying to describe how to change a specific culture, this article will look at the concrete things that can be directly changed in order to impact one or more aspects of culture. We refer to these as designable surfaces.

Designing for a Resilient Organization

The rest of this article will take a closer look at the lessons learned as a result of the Apache SNAFU and describe the designable surfaces throughout that can be used to create a resilient organization. Without resilience, a surprise such as "accidentally upgrading a critical software package throughout multiple data centers" could have had a much more negative and long-lasting impact. It was the people’s ability to respond to that surprise that demonstrated the resilient culture.

Automation and Process

One of the first observations that came up in the post-mortem review for this incident was the role that automation had played in it. There were a couple noteworthy ways in which automated systems behaved in unexpected ways during this incident:

  • Chef, the configuration management tool in use, displayed unexpected behavior when it upgraded a piece of software in ways that hadn’t happened in the past
  • The software in question, Apache, did not restart properly after the upgrade, which also hadn’t happened in past memory (especially not for a point release)
  • The site stayed up due to a small number of servers where Chef was not running correctly and thus did not get the upgrade applied to them
  • A newer package version existing in the local yum mirror than was configured in Chef wasn’t itself surprising, as that yum configuration was a deliberate choice, but it was a contributing factor in the incident timeline

There are technical discussions that could be had (and were) in terms of looking to prevent this sort of issue from recurring. In fact, there were conversations around the behavior of the local yum mirror and whether it would be better practice to keep a couple older package versions around for some period of time; if Chef should be configured with more time in between runs, or if tooling should be created that would allow users to "kill" a committed Chef change that hadn’t been deployed to all servers yet; what if anything should be done about the servers where Chef wasn’t running correctly (even though that had a positive outcome in this one case). But beyond the implementation details of automation tools is how people use and interact with them, which is ultimately about process.

While mentioning ‘process’ is sometimes frowned upon in some organizations, processes are an important designable surface. They provide patterns and scripts describing how people are expected to act in certain situations, allowing predictability to be increased and best practices to be codified.

One notable process in this situation was deploying and testing changes in Chef. I noted in the first section above that I deployed the changed Apache version to both staging and production at once. There was a more thorough deploy process available - there were a couple different mechanisms for testing Chef changes, including separate Chef environments, cookbook versioning, and Chef safelisting. However, it was commonly understood that these steps were often annoying and time-consuming, with engineers often finding each other accidentally stepping on each other’s toes when trying to test changes, so people were encouraged to use their best judgement when deploying.

A naive approach would be to start requiring everyone to follow all of the possible testing steps on every deploy. But if you find yourself with processes that people are avoiding or circumventing, that is a prime opportunity to reconsider those designable surfaces. Why are people working around the processes? What outcomes are the processes and people involved trying to work towards? Enforcing processes arbitrarily as a way of trying to increase robustness often ends up making systems more fragile, as people will find ways to work around processes that frustrate their goals. In this example, a more holistic look at the Chef testing and deploy processes, even going so far as to do usability testing and interviews with a variety of Chef users within the organization, could go a long way towards finding testing processes that people are more likely to use.

Adaptive Capacity, Budgeting, and Buy-In

Being able to react quickly is a crucial part of incident response. There were technical aspects of the monitoring and alerting systems that allowed for a quick response in this instance - Graphite graphs showing failed Chef runs throughout the infrastructure, Nagios alerts on aggregate Apache status on various clusters of hosts, both of those feeding into PagerDuty, to name a few. However, technical ability to respond tends to become much less important if an organization doesn’t also have the human or social ability to quickly respond as well.

When I first rolled out the Apache change and noticed that it was not a no-op as expected, the first steps I took involved interacting with other people, not computers. I let the primary on-call person know that he was about to get paged for a bunch of Apache-related things and that I was aware of the issue, and then jumped into the infrastructure-wide Slack channel to explain what was going on. The immediate response I got was people asking how they could help. People who had the ability to interrupt what they’d been working on jumped in to help troubleshoot. Within minutes, people from different teams with various areas of domain expertise were looking at things like the Chef deploy, if that deploy process could be stopped or rolled back, the Apache release notes, our Apache configurations, and the overall health and responsiveness of the site. It was with so many helping eyes and hands that we were able to collectively get Apache up and running again within 20 minutes or so.

This pattern - of people asking for help when they needed it and other people jumping in to assist - worked because of a culture that deliberately built up this kind of adaptive capacity. This can be done in part with budgeting. When divisions, teams, and individuals go through their planning and goal-setting processes, they should make sure to budget for on-call and incident response. There is no one-size-fits-all number for how much slack to budget, as each on-call rotation or setup is different in terms of its capacity and how many incidents it has, but it’s important to pay attention to how much time gets spent on incident response and remediation items over time.

Also crucial to getting this to work is getting buy-in for these processes and patterns. If not everyone involved in a process or activity is on board with it, it’s more likely to fail. For example, if only one team built enough slack into their schedule to help respond to production incidents, they might find themselves feeling overburdened with on-call responsibilities, which might make them look less productive compared to the rest of the department despite their valuable contributions. Or if management didn’t accept the importance of that work, people might find themselves having to choose between meeting their deadlines and helping their colleagues. Getting buy-in across the board is necessary for cultural changes to succeed.

Dependencies and Transparency

One fun surprise that happened during the Apache SNAFU was realizing just how many of the internal tools used at the time relied on Apache. Graphite graphs and the internal dashboard tool? Nagios? The web interface to the Deployinator deploy tool? The web server hosting internal documentation, including on-call run books? All of these needed Apache. This incident quickly illustrated quite a few dependencies that didn’t usually show themselves. Dependencies are common in modern complex systems, and troubleshooting them can require lots of different domain experts even when half your internal tools aren’t on fire.

Because so many of the tools that usually helped with troubleshooting during incidents were out of commission during this one, everyone had to do communicate a bit more during the incident itself. Luckily, this was already common practice due to a culture of transparency. People were already comfortable sharing information on what they were doing ("I’m going to re-run Chef to see if that changes anything"), asking for verification ("The site seems slow up still up for me, can anyone else confirm?"), or admitting when they needed help ("I’m out of ideas with the Apache config - do we want to ping someone from the web team for another pair of eyes?").

Compare this to a less transparent culture, where people might not feel comfortable sharing their thought processes, mistakes, or questions for fear of "looking bad" or being ridiculed. If I had kept my knowledge of the breaking changes to myself, or if different teams of people hadn’t shared what they figured out during the troubleshooting process, the incident could have gone on for much longer, with people working individually and possibly against each other rather than collaborating on finding a solution.

Inter-Team Relationships and Social Scripts

In any sufficiently complex system or organization, knowledge of the system as a whole will be spread out over multiple teams and people, meaning that responding to incidents can require input from various sources. Making sure that people know who to go to for information or help outside of their own domain, and how to get help most effectively, is critical. During the Apache SNAFU, my being able to find people who knew more about our specific Apache configurations, for example, was key to resolving it as quickly as we did.

Building relationships between different teams or groups becomes more important the more complex an organization is. It can also get harder for these relationships to develop organically. Rather than relying on people getting to know each other around a water cooler, organizations can develop social scripts designed to build these relationships. At Etsy, a system of "pluses" (imaginary internet points distributed in Slack) thanked people for asking and answering questions in public channels, which encouraged people to ask for and share information publicly rather than privately. While imaginary internet points won’t work for everyone, look to find ways that you can encourage people to interact and share knowledge.

Another group of social scripts and related tools helped people get to know and interact with people on different teams throughout the organization. During the onboarding process, engineers can "bootcamp" on different teams to build up inter-team relationships; once a year after starting, they can do a longer "rotation" onto another engineering team for the same reasons. Several opt-in groups designed to pair up people who don’t know each other for a quick chat (similar to Donut.ai) also exist. Social scripts for taking advantage of these opportunities are an accepted part of the culture, so participation is not just allowed but encouraged. It is important here again to get buy-in throughout the organization, so people don’t end up with different opportunities based on what team or manager they happen to have.

Learning, Psychological Safety, and Incentive Structures

Finally, a resilient organization requires a dedication to learning. It is not sufficient to respond quickly and adapt to incidents in the moment. Learning has to happen to make sure that the same incidents aren’t repeating themselves. Learning requires psychological safety - the ability for people to feel comfortable taking risks, making mistakes, and admitting they don’t know things, without fear of being ridiculed or punished. It was psychological safety that allowed Etsy’s three-armed sweater award to feel like an actual award rather than a public shaming.

Compare that to a blameful environment, where whoever is determined to be "responsible" for "causing" an incident is likely to be fired or otherwise punished. When that sort of culture persists, it can lead to people trying to "cover their tracks" as they work or refusing to talk about changes they’ve made even when directly asked. A culture of fear is one that prevents real learning.

While there is a lot that goes into psychological safety in the workplace, one way to design for a culture of learning and blamelessness is to look at the incentive structures within your organization. In addition to getting a cool new sweater for a year and some imaginary internet points, I’ve gotten to give talks and write articles about this incident, as have other people who were involved. Getting to build a public profile through conferences and such can encourage people to think about incidents differently, focusing on what they and others can learn from them. If your skills matrices for promotions includes things like community contributions, post-mortem facilitation, or incident writeups, that can also provide incentive for people to take part in learning-focused activities. The behaviors that get rewarded and promoted within your organization will have a great deal of impact on its culture.

Conclusion

There are numerous ways a culture can be resilient. When looking at incidents, it is just as important to investigate the things that went right - factors that made an incident easier to troubleshoot, or helped it to be resolved more quickly - as it is to look at things that went wrong. By starting to see patterns in what went right, organizations can start to design for those patterns, using concrete designable surfaces such as processes, social scripts, and incentive structures to build and maintain the type of resilient culture they want to have.

About the Author

Ryn Daniels a software infrastructure engineer at HashiCorp with a long history working on infrastructure and operations engineering, focusing on infrastructure operability and observability, sustainable on-call practices, and the design of effective and empathetic engineering cultures. They are the co-author of O’Reilly’s Effective DevOps and have spoken at numerous industry conferences on devops engineering and culture topics.

 

 

Rate this Article

Adoption
Style

BT