InfoQ Homepage Presentations Growing Resilience: Serving Half a Billion Users Monthly at Condé Nast

Growing Resilience: Serving Half a Billion Users Monthly at Condé Nast

Bookmarks

View Presentation

Speed:

Download

49:29

Summary

Crystal Hirschorn outlines how Condé Nast practices Chaos Engineering, where this fits within the already established testing and verification ecosystem, and what emergent practices and tools are on the horizon. Last but not least, she covers how to build up an organization’s true superpower: Human Resilience.

Bio

Crystal Hirschorn is VP Engineering, Global Strategy & Operations at Condé Nast which is best known for its portfolio of global brands Vogue, Wired, Vanity Fair, The New Yorker. She oversees a globally distributed engineering organization and leads the technical strategy for building unified technology platforms deployed across the globe to meet the demands of more than 450 million monthly users.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Hirschorn: I'm going to talk about both some of the chaos practices that we use at Condé Nast which have been running for more than two years now, and also incident management. Also, about probably an often overlooked but even more important aspect which is about growing the resilience of your organization in the face of adversity and failure as well. I'm a VP of Engineering at Condé Nast. I've been there for just over three years. I look after their global strategy and operations for technology and engineering. You'll see in a second a picture of what we do. You might be like, who's Condé Nast? We are a portfolio company. We've published more than 30 different publications around the world, both in print and in digital. Here are some well-known brands that you might recognize. We have Vogue, GQ, WIRED, Vanity Fair, Glamour, The New Yorker, Bon Appétit. The list goes on. There's more than 30 brands in our portfolio. This is wild. I think it gives you a sense of our global footprint in terms of our company, and what we have to try and serve around the world. When I say more than half a billion customers, I'm like, is that requests or customers? I can't imagine one-sixteenth of the world is actually looking at our publications every month, but, apparently so. This is a bit old, actually, we're now in about 40 countries around the world. This is just giving you a taste of what publications run where. What underpins this is a Kubernetes platform that we've been building for the last two-and-a-half years. That platform itself runs about 10 clusters at the minute. We're running in five geographically distributed regions, namely in Japan, China, Frankfurt, Ireland, and U.S.-East-1 as well. We run on AWS as well.

Outline

I'll just go into what we're going to cover. I'll talk about what do we mean by complexity? What do we mean by resilience? I think these are two, not always well understood terms when we're talking about this in terms of software. I'll talk about how we've embedded some of the chaos practices from chaos engineering, and some of the tooling that we've used as well at Condé Nast, so that you can see how you might want to do this for your own organization. Incident analysis as well. I'll also use a real-life case study of an incident that happened at Condé Nast only a few months ago, and go through how that played out. Some of the cues and some of the signals that you might see during that incident, which are often overlooked. Often, we focus on very technical aspects, and come up with very technical actions, but we need to go beyond that as an industry. Then, finally, stepping beyond the tooling and thinking about how do you grow human resilience within your organization? Because we work in dynamic socio-technology systems, we need to think about both the technology that we build, but also the humans that have to operate that, essentially. How can we make that easier for them? How are we evolving our practices? What do we see? How are we evolving that internally, in my own company, but also what are we seeing in the industry as well?

I don't know if you saw the talk by Luke Blaney at the "Financial Times?" He's a principal engineer. I love the start of his talk, because he goes, this is our tech stack. No, wait a second. No, it's like this. Actually, it's like this. I feel the same thing can be said for Condé Nast. We're a globally distributed organization. We have 12 international physical operations. The tech stack that runs in each one of those locations is different. We do have one global platform that we've now managed to roll out, which was no small feat. This is only a very small part of the technologies that we run today. In terms of application stack, and CDN, because we mainly are talking about web applications, is Node, React, and JavaScript. We use Fastly mainly as our CDN provider. We also operate in really difficult places like Russia and China. We've had to actually use some very specific technologies, and be a little bit more agnostic in order to interoperate.

A lot of people think about legacy as a bad thing. I don't know if I see it the same way. We will always have technical debt, the first line of code that you write is going to exist for longer than you can ever expect. I see it more like legacy is the cost of success. It means that your company didn't fail really early on. That, over time, that's going to accrue and you will have to manage that in some way.

Complexity and Resilience

I just want to pick a really apt quote from one of my favorite books, which is called, "Drift into Failure" by Sidney Dekker. The talk is complexity doesn't allow us to think in linear, unidirectional terms along which progress or regress could be plotted. He is a part-time professional pilot. I think he's got quite the deep requisite knowledge in order to talk about complexity and resilience. What he's saying here is, you can't really predetermine whether something has had a good or a bad outcome. It's hindsight. Therefore, we end up with lots of bias in our industry around hindsight. You see a timeline of events, and you're like, of course, this is why this happened. Of course, that's why that person made that mistake. The way that we build applications, or the distributed nature or the very dynamic nature of them, both in terms of humans and the technical systems that we build, that's not true. If you think about that more deeply for more than a couple of minutes, you realize that it's not a linear thing.

Another thing I've been getting into a lot recently is systems thinking. I'm going to have a couple quotes around a book by Donella Meadows, she wrote, "Thinking in Systems: A Primer." She talks about thinking more holistically, what are systems? Where do we draw the boundaries? What are the actual mechanisms and feedback loops? She talks about stocks and flows. Where did we get bottlenecks? How can we make things more efficient? How can we leverage the best mechanisms to make our systems more efficient and successful? These are just a couple of screen grabs here. This one-stock system is actually an example of one of the first few chapters. Obviously, you can imagine it's gets a lot more complex than that. It's really interesting. It's a book that I'd really recommend, because it's very accessible. She doesn't talk in very technical terms. She tries to use everyday things. Here she's talking about a heating system, or she might be talking about how even a coffee machine works as well, so that it's really accessible to any audience.

Also, there's posts around the internet. This is a person that I really like, Will Lethain. He talks a lot on his blog about systems thinking. You can see how these concepts then apply to the work that we do day-to-day, and what feedback loops and mechanisms there are there. Here's one of my favorite quotes from the book. It says, "Because resilience may not be obvious without a whole system view, people often sacrifice resilience for stability, or productivity, or for some other more immediately recognized system properties." This is something we'll be diving into a little bit more later is, how can we have more of a whole system view so that we don't focus potentially on over-optimizing on the wrong goals when we're developing software.

I want to talk a little bit about chaos engineering, and what we've been doing at Condé Nast. We have been practicing chaos engineering in some form for more than two years. Before that, I was a Principal Engineer at the BBC. We also did some chaos engineering there. I took some of that practice with me into Condé Nast. I just want to do a myth busting. Chaos practices won't completely eradicate operational surprises. A lot of the reason why we did it was to try and build resilience in people, basically, and get them more adjusted and used to having to deal with things in the moment of surprise, which is often a very stressful time. Also, being able to think very critically and logically in that moment, rather than panicking. Here's another one as well from the primer book. The bad news or the good news, again, hindsight bias, is it bad news, is it good news? Is that even if you understand all these system characteristics, you may be surprised less often, but you will still be surprised.

Tooling and Architecture Considerations

What tooling and architecture considerations did we look at? This is from a talk from a couple of years ago, I ripped it off of a talk by Adrian Cockcroft, who's one of the VPs of architecture at AWS. He tries to get people to think about, what's your stack? Where might you be applying different types of tooling, or even processes in terms of actually experimenting effectively across your stack? One thing that we use quite heavily here anyway, is Gremlin. It is more of an infrastructure stack experimenting tool. It also does some application level fault injection. Apart from that, we don't use a lot of tools, and I'll go into why later. I think as an industry, we can focus too much on tools and forget about the actual craft of devising experiments. Also, we generally have access to CLI. We can probably break stuff pretty easily. Here are some takeaways if you want to go and have a look yourself. Also, just thinking about how we have evolved our architectures over the last couple of years as well. Thinking how we've gone from physical machines, more to virtual machines, containerization, and serverless. You might want to consider this and also look at the context of your own organizational technology estate, and work out where best to apply this tooling to what context.

There are some notable tools. I've put a link here to one of the posts that actually lists all these tools as well. One of the main dominant players is Gremlin. That's the one that we use. One of the key reasons why we use them as well is because it's very accessible for product engineers to use a tool like Gremlin because it provides a very friendly UI. It also offers an API on the CLI as well, which they can access quite easily. If you feel that your engineering organization is more experienced, and can do these things without a UI, then go for it.

One thing that we're looking to introduce at work, it's not finished yet, is that, we're also introducing the service mesh. We have about 10 Kubernetes clusters running all across the globe, managing that communication, and multiple regions, but also multiple clusters, pods, everything. We need some way to manage that in a better way. We're looking at using Istio at work at the minute. There's been a new tool come out, so I don't know if many of you practice chaos engineering, which is called Chaos Mesh by PingCAP. This is a takeaway, and go and look at. It's not something that we're using yet because we haven't fully implemented our service mesh, but it's something that we probably will end up using.

Also, give some consideration to whether or not you're using serverless. I think the patterns that we use there require probably a different thinking, but also different tooling as well, in terms of how you want to apply chaos testing to that. A really great person, if you want to know a lot about chaos engineering for serverless is Gunnar Grosch. I put a link here to him. He does lots of demos, but he's also built a lot of tooling in this community. He actually does a lot of talking on this as well.

Devising Your First Game Day

How do you devise your first game day? How many of you here are actually running game days, or have run? I didn't expect so few people to raise their hand. It's not that scary. It's not that onerous. The way that we started, just to make it seem less scary, is that I thought, let's get loads of people in a room. That's what we did. We got lots of people from product management, scrum masters, so delivery management. We have an internal function called IT service operations, and engineering. Engineering, of course, is comprised of lots of types of specialisms, that includes cloud infrastructure, SRE, product engineering. Those were the main ones. Got about 20 of us in a room.

We devised some scenarios. We called it TheoryCraft. This was our very first scenario. We ran this in early 2018. It says, an image has gone missing on the Vogue UK front page. Every time a customer loads the page, instead of the image, they see a blank box. This may have caused an increase in 500s, which has been reported by the load balancer and triggered a reload. Customer complaints, editors, and engineers are all sure to notice in the next few minutes. The on-call agent has just received the following page, Vogue UK 500s per minute. I see rocket runbooks. It says, what happens next? That's all the information we gave them. Before that, we had devised, how do we want to run incident management in our company? We tried to get people to follow this process. We even devised decision trees to try and get people to understand, how does this flow in the time of crisis? It was really interesting, for a lot of reasons.

I said, what should we do? Somebody said, why is it coming in like this? Why didn't they just get pinged first? Why is it the customers are coming through and complaining first? Then somebody else in the room said, they should get fired. Here I was, I was so smug, thinking I had built this amazing culture, a blameless culture. This was literally the first thing that someone said, I was actually shocked. I actually went quiet at that point for a few minutes, and other people took over. Actually, no, the first thing I said was, "No, nobody's getting fired. This isn't the way we're going to do this." It's a blameless culture, which again, if you look at blameless culture, it's basically creating psychological safety, but also giving the people accountability as well. You have to balance both things. It's not that people are fully unaccountable for breaking software. Also, you want to give them some safety as well. Then we went through this, and the interesting thing that I found was that people were jumping off the process. It was like, step one was good. Then suddenly they are on step four. Then suddenly they're on step seven. Then they jump back to step two. It was like, who's the incident commander? Just being really unclear. There's so much tremendous value just doing this. You don't need to go and instrument anything. You don't need to go and set up anything. This is how we started and this is how we ran it many times. I was like, until people internalize the process, I'm not really sure what value it brings by adding lots of tooling on top. I'm not really sure what we're solving? Then we had other scenarios that we ran through as well, more realistic scenarios like an ETCD de-sync. We had some intended expected outcomes. I can't remember if we shared that with the group or not. We're seeing whether or not what we thought the troubleshooting path would be, would match what they actually went through in that workshop.

Recent Game Days (December 2019 + January 2020)

We've run some more recent game days as well. I know the pain of trying to even get buy-in to do these things because we actually went through a bit of a break ourselves. Just a bit of context about Condé Nast, we actually went through a merger last year. We have a U.S. company and an international company we actually merged early last year, and so this went on a bit of a break while that happened. Luckily, we are actually running them now pretty much every week. This is one of the ones that we started running quite early on this year. Departures is our internal CI/CD platform. It's a full stack thing. It's got a UI on top. It has APIs underneath that actually can hook into things like Jenkins, or CircleCI, or Concourse. It can hook into several CI/CD workflows underneath. It makes it really easy for everyone, all product engineers, who don't necessarily have the deep infrastructure knowledge, nor should they always at a certain scale, to be able to deploy really easily and see the feedback from that as well. This was like, how can they follow runbooks? What other artifacts might they look at? This is a failure. We also had in-hours and out-of-hours process as well. How would people follow these processes differently if it was in-hours or out-of-hours? We could see that people were actually getting a lot better, actually. Also, at the same time, we were also trying to spread this culture into our U.S. company, which hadn't been doing this practice to date. This becomes a global practice for us now. It's almost a mandated thing that all engineering has to do now. Here are some examples of the actual conditions of the experiment. We killed the cluster scheduler, it's called Sabre. The expectation would be that our Departures CI/CD platform wouldn't show any available builds. It would block any type of deployments happening until that's fixed again. Because we've set up the conditions in advance, we can now see if those expectations match actually running this experiment. To be honest, we can go in and kill that. We don't need a tool like Gremlin, necessarily, to do that.

Growing the Complexity of Your Experiments - End to End Tests

Our systems are complex. There's a lot of types of testing, how can we try and run these experiments in a more interwoven way, particularly when we're serving a lot of requests to users? That has to go through multiple layers, different types of caching, request tracing. There are different parts in the stack in which things can fail. We wanted to look at our service map. We use Datadog quite a lot still. This was something that gives you dynamic service mapping. This is a snapshot from early last year, a part of our microservices that we have within our ecosystem. Just trying to understand, what are the actual connections and coupling between different services themselves, and trying to create experiments off the back of that? That's something that I would actually really encourage as well. You have to start simple, but they need to grow in complexity over time, just like your systems. Then, this is actually taken from a really great blog post from Cindy Sridharan. She also does a lot of really amazing articles around request tracing, and getting people to think about, there's multiple different paths in which I can execute. There's happy paths, there's a thing called sad paths as well, and making sure that you think about your failure cases, or edge cases, and try and capture that in a request-response system.

That other diagram might look quite the same as this one, which is actually Amazon's XRay service dependency map. You can probably get this from different tools. If you don't have this available, this is a place to start thinking about how to devise experiments around this. This is actually the customer journey that you're basically devising experiments around.

Is your organization ready? This is a great tweet from Charity Majors. She does a lot of talking on this subject. Her company Honeycomb is really interesting. We don't use it ourselves. We've been looking at it like, how do you move away from tracking more raw events instead of doing log aggregation. Using things like she mentions here, high cardinality to index things and to be able to find trends and variances a lot more quickly. She says your org is ready for chaos engineering when you're able to budget your reliability work instead of performing it reactively, and when you have enough observability, and that's where she's mentioned, it's not monitoring, to reliably identify the chaos you have injected. I think this is a really important point, don't inject chaos, if you literally can't observe what the fuck just happened, because otherwise it's just chaos. There's no point in doing that. Think about that. We didn't have an SRE practice until only about nine months ago. It took quite a long time to get that off the ground. The key thing for me was trying to find a really solid lead engineer to come in and actually spread that practice and build that practice in the organization. One thing that has allowed us to do is that it's now within their remit. It's the expectation that this team can help us have that capacity to do reliability work in a more proactive rather than reactive way.

A Story of a Web Platform Outage and the Importance of Observability

Now I'm going to take you through an actual outage that we had. We have a multi-tenanted web application. It serves multiple sites. Sometimes when things go wrong, it can take down several sites at once, which is what happened in this case. This is a view through what happened in Slack. I like to silently observe what happens, and see how people react. I like the human side of it, but also, where are they looking? What artifacts are they using? What are they saying to each other? How are they sensing out what to do next? Here's our cloud platforms channel at work. Somebody says, "I'm seeing DNS errors coming from rocket," rocket being our web application. Then somebody is saying, "I'm seeing some 503s here. It looks like I can't see the same thing happening in Datadog. Is it the CDN, Fastly? Where's the problem?" Now they're trying to find where do we think this is stemming from? Then more people start to get involved and they're looking at, "I found some artifacts." There are actually some graphs here that we have available to us to try and determine what's going on. Now we have an 8.5% error rate coming through. Thirteen percent include the 404s that are happening at the same time. That's fairly high. Something's definitely going wrong.

Then I think the interesting point here at the end is we're continuing to investigate but without an owner, or someone with domain knowledge to make the call, I'm not comfortable making this a P1 yet. Engineers are continuing to investigate the outage. That was probably somebody from IT service op saying, I don't know what to do next. It seems like I don't know who to contact in order to make a judgment call about what we do. It's bad. I annotated this as well. Here's an initial punt at the hypothesis, a lack of instrumentation in the app. We use Datadog APM quite heavily, but there's other types of instrumentation we put in there as well. This is a way of enriching the context available to the humans that are looking at this like, here's a graph. I think this is great. Also, don't overwhelm people with graphs as well. People love graphs. It's insane. Even actually at Condé Nast, I can't say what's happened at one company, I went into Datadog recently. I was like, "There's 50 graphs on this one view. I don't really know what I'm meant to be looking at here." Some of them made sense. Some of them I was like, "I don't even know why they're on here." Actually, I put pressure on them and was like, can we clean this up, please? If you really need them, put them in a different view. It doesn't need to all be here in one place. I said, in a time of crisis, people don't have enough capacity to look at all that information. It makes sense.

It keeps going. We eventually get to a point where we create what we call an incident room itself, so a dedicated incident room. Because we don't want to clog up the cloud platforms room, which is actually a shared public channel across the company, which people that aren't even technical join. Also, there's other ongoing stuff that people want to ask on that channel. Then you can see that people are starting to move across to that channel. There's more things that are shared. We've created a separate incident channel. I thought this was quite interesting. It wasn't an automated thing. Somebody actually added some stuff. Somebody has put, note for postmortem, APM for rocket is not instrumenting React Router routes. I thought that was cool that somebody actually put that in there so that we could capture it later in terms of our postmortems, or incident analysis. They're thinking ahead like, we want to capture this. We probably want to create an action off the back of that. That's good, because it's like they've been trained over time to do this.

We can start to see, something's going on. We have some sockets here. Maybe this is starting to create some cascading failures for us within our web application stack. Somebody needs to probably go and have a look at this as well. We also have other things that run in our environments. One thing I'll talk about is that we have shadow trafficking that we use in order to mitigate the risk of deploying experiments, essentially. They're shutting down all these non-essential things in both non-prod and prod environments. Engineers are drawn to incidents like moths to a flame. Suddenly there's four people in there, and this guy is meant to be on holiday. He's got the beach icon. He's going in anyway. He's going to go and have a look. He's probably going to start getting involved as well. It's like, just enjoy your beach holiday, it's just websites. Then after all of this time, I like this as well, somebody offered yet another hypothesis because they were given more context. They've had more time to think about what could be going wrong. This is really cool. Then we have some things that actually come back as more like an automated set of data that we can also capture in our postmortems. There's more people joining. We've got seven other people still joining this channel. I think in the end, we ended up having 30 or 40 people in this incident channel. I was like, "Why? It makes things worse." I think that Laura talked about this in her talk. It's really critical. It's like, how many is too many? It's a good question. It's one that I would like to know more about, and so hopefully, I can speak to Laura about this later. I often see as well, that in itself, people wanting to help create sometimes more chaos. This is also how I feel in a lot of these incidents.

Artifacts for Determining When Things Go Wrong

I'm just going to run through a couple of graphs. These are some things that we have in terms of artifacts that we can use in terms of determining where things are going wrong, but also, maybe they're not so interesting, because you'll have your own things as well. This is like, what's most important to look at as well at any given moment when you have an incident? We use Traefik quite a bit. It's a reverse proxy for Kubernetes. We were seeing some things here as well. I'm not trying to do a Datadog sales pitch. They give you flame graphs as well in terms of mapping your request lifecycle and seeing where things were taking the longest to complete.

There are so many types of testing. Think about your holistic, how does this fit in to the way that you do testing at your company? Also, trying not to be scared of testing in production. I'm a big fan of this, because I see a lot of reluctance in this space. I'm going to talk about a couple of the techniques that we use at Condé Nast in order to try and mitigate that risk and mitigate that fear that people have around this.

Failure Mitigation Techniques

This is an actual syslog from our traffic, which is one I found on the internet. We use Fastly. It exposes the syslog directly, so we can hook straight into that and we can get our traffic, and we can replay it in another environment. We take production traffic and replay it in another environment, which is a like-for-like of production. For us, this gives us a fair amount of certainty around whether or not things are going to go wrong before we actually have it running in production. Look at things like this, traffic replay. It's becoming more of a common pattern.

Another thing that we do is different deployment strategies. We actually do a lot of canarying at work. You might have different deployment strategies. We use feature flags really heavily. It may be somewhat easier when you work in a company like Condé Nast, which is a publishing company, things are heavily cacheable. There's not a high frequency of change. We use that in terms of running chaos experiments. We also run about 200 experiments a day on product features that we ship as well. We'll switch on this feature toggle for 10% of traffic on x website for x feature. If you already have that thing in your system, then these concepts won't be that unusual. Think about how you do that as well. Canarying is a really great way of mitigating fear and risk around breaking things. Colin Breck gave the talk around Tesla yesterday, which was amazing. I took one of his quotes from his blog post series, which is amazing. I said to him, if anything, I came away feeling really sold on functional programming, which I wasn't expecting, because of the title. It was amazing, very well-articulated in terms of concepts. Even with a Kubernetes ecosystem or infrastructure ecosystem, there's a lot that could go wrong. There's a lot of dynamic parts to it. There's a lot of message parsing, things happening at different times, not linear and not very deterministic as well. Think about this as well. Things could go wrong in different parts.

Steps Beyond: Building Human Resilience

We talked a lot about how to get started with chaos and some ideas around that. This is about building human resilience in an organization, which I actually feel is one of the most important parts. Another quote here from Thinking in Systems is, "We are too fascinated by the events that systems generate. We pay too little attention to their history," history being key, "and we are insufficiently skilled at seeing historical clues to the structures from which behaviors and events flow." People will often ask that, why did that happen? How did that happen? Often, they're looking for a short-term clue or a set of clues. It really takes a historical context to get away from doing hindsight bias, essentially, and making short-term determinations about judgments as well. Think about this quite heavily.

What Next: An Alternative Approach to RCAs

I love this post as well, which is, how does human work actually get done, and an alternative root cause analysis, which I think is a little bit bullshit. Don't worry, everyone at work still uses it, I'm trying to beat this out of them but they still use it. It's describing different ways to think about how humans work. There's work that you imagine that you're going to do, and you have your own mental model of what that's going to be, maybe you can even play it out in your own head. There's work as prescribed by your company, or your manager, or from other places. There's work that you disclose to others, things that you've done. Could be that you've drawn a diagram that you want to share, or even how you describe the solution that you did. Then there's actually the work that's done, how you actually did it for real. This blog post goes into quite a lot of deep detail around this. Like, why this is important, because it is like a Venn diagram. There's some overlap but there are some distinct differences. The work that you imagine in your head isn't going to be 100% fully shared with everyone. The work that you did, is not going to ever be 100% shared with all the humans that you work with. Mental models will not always align with yours. This is very important in terms of incident management, and wanting to do chaos engineering or building up resilience. Is being aware that part of the reason why we ran all these experiments, and these game days, and this theorizing around different things, was to try and align mental models as best as we could. Our systems are dynamic, so they will keep changing. That information gets out of date very quickly.

Actions: What Can We Turn Into Hypotheses/Experiments

We do postmortems, incident analysis all the time. I'm very hot on this, and make sure that they do this. How can we actually turn those actions into more hypotheses and experiments that we can run? It's all very well creating actions. They get put in Confluence, which is where best intentions go to die, essentially, in my mind. Everybody knows that. It's like, how can we actually take this and turn this into something tangible and something that's actually useful that maybe will uncover more things, and maybe will create more actions because we like creating work for ourselves? It's really important as well. One thing I also found from reviewing a lot of their actions was that they're very technical. It's very technical and very specific. It felt very root-causey. It's like, "This particular part did this. Therefore, we're going to scale this up. It's going to fix everything." I was thinking, we need to probably step beyond that again. It's fine to have technical actions in there because we have technical systems.

Something that's often overlooked would be something like this, gaps identified in architectural knowledge, getting somebody to do rotation. We actually had some product engineers do rotations through both of our cloud platforms teams and our SRE teams specifically for this reason. Because ultimately, SRE teams and cloud platforms teams, they still produce products, even if they're technical. How might they use them in terms of a user experience? Also, bringing that knowledge back into the product engineering teams as well, and making sure that knowledge is more shared around the organization. The thing around incident management flows, I would say, keep repeating that as much as you can. When you have new people, they also need to be onboarded in some way. I'll talk about how we onboard people as well. Too many graphs, too much information, information overload, I don't know where to start. The cognitive load on the people is too high. How do we turn that into a more realistic set of information for the people that are on-call?

On Call University

We have something now called On Call University where I work. It is mandated, so all engineers must join this. I think it's great. We only just started it two weeks ago. We've run it now twice. When somebody joins our organization, they don't go on on-call straight away. We tell them obviously in the interview that this is an expectation. We want to train them up a bit. It takes some people longer than others. Sometimes it could also be a personality thing. How comfortable are you with things that can be quite challenging? Could be, how much experience you have, as well. It's a mandated thing, within six months, we want to make sure that people get onto the on-call radar. It's also about providing ownership around alerts and incidents, because I think that's something that a lot of companies, ourselves included, have historically done quite badly. Especially, as people join teams, move teams, organizational shifts, you need to make sure that you have consistent ownership there. Then knowing, what's expected in terms of on-call. Of course, you get stickers if you do this, which is amazing.

SRE Weekly Newsletters

Then we've also started sending out an SRE weekly newsletter as well. It's good having a team established again. We're lucky to have that. Also, organizational change is really important too. It's all great if engineering is doing some great work, but we want to share why this is important with the rest of the company and what we're doing to track this stuff over time, and the health of our services and our applications. It's an internal marketing/selling, if you want, to try and say this work is really important. It's not wasted. It's not some science project that we dreamed up. We have different things in there. This has grown. We've been running the newsletter for about five months, we've iterated this quite a bit. The very first thing I said to the SRE team is, can we please clean up our legacy alerts? Because we got to the point where people were just ignoring alerts when they were on call because it was fatiguing. We had too many alerts. I loved the way that they did this as well, in red. These are all the legacy alerts, and here's the new SRE alert, so what we actually fundamentally feel is important in terms of a customer related goal, not just some ping that you get, because there's a 500 error here, or 400. Then this has evolved over time. We can follow the trends over time seeing if we're doing better or worse. Sometimes it just depends, we might launch a new product. That might also have an implication there too. Then we can also track risks in terms of how high priority they are for an organization as well.

Dynamic Artifacts

I've started thinking also about dynamic artifacts. I think what's really hard, having been an engineer myself for a very long time, almost 20 years. One thing that I see time and again is we create artifacts, but they get stale straight away. Because of the dynamic nature of our socio-technical systems, it's sometimes difficult to know how interactions work. Who knows the most about what? Who's got the most domain expertise? These are a couple things we're looking at, at the minute in terms of what we're doing at Condé Nast. We have something called Ardoq, which is more like a dynamic way of outlining your technical architecture. It has an API, which is hackable. You could actually change things more dynamically there. It gets us away from using Confluence too heavily as well to track our technical estate, which is also good for onboarding new engineers too. It's making sure that that information is always really fresh. Then we did some static analysis late last year in terms of working out through GitHub, how different teams commit to different repositories, and who commits the most, who pair programs with who the most? Trying to work out relationships and trying to create a more dynamic maintainership model as well. I think that Nora might have a little bit more to say on this, because as she was talking to me yesterday, she was saying we might think that just because somebody is committing a lot of code to one repository, that they're the experts. That's not always the case. For us, it's a good stepping stone.

Recap

Complexity and resilience, trying to talk through what do we mean by this? Resilience, more importantly. How do we use chaos engineering at Condé Nast? Hopefully, it makes you feel a little bit more able, and less scary in terms of being something you can actually start doing at your own company today. We went through a case study, how did that happen? What were the cues, the signals, the different artifacts that people were linking to? What should you be paying attention to in terms of when this happens in your own company as well? Then also about human resilience, how to grow that within your own company and how to advocate for that too. Then thinking about how we're evolving our own practices internally, and moving to how can it be more dynamic, and making sure that things are really fresh. Making sure that everybody is on the same page in terms of doing On Call University, and sending out weekly newsletters as well.

Questions and Answers

Participant 1: When you're doing your game day analysis, is it just working on the incident that you're talking about or does it include the retrospectives and postmortems? Do you just say, here's what you should do next time? Do you capture all that data as well?

Hirschorn: We do. We try to look at historical information to inform the next set of game days. It won't be the last thing that's happened. Within a postmortem we can take up to 20 actions. Then we might prioritize those a little bit. It's all very well saying, we're going to fix X, Y, Z. It's more interesting to think about, how can we turn that into more an experiment that we can actually run through? We do look back at historical postmortems to try and draw those things out?

Participant 1: Does that also influence the next game day as well, how do we do the game day better?

Hirschorn: Absolutely. It does capture, how could we actually improve together as a team, actually, troubleshooting an incident? That's what I was mentioning there before as well, it's capturing not just very technical actions, but also more human-focused actions and process actions as well.

Participant 2: Do you also consider any cost implications around spending time versus the downtime of an application or a website?

Hirschorn: Sorry, downtime in what way?

Participant 2: Specifically, if you're trying to do game days on production environments, maybe have an implication of downtime?

Hirschorn: Running an experiment around that?

Participant 2: Running an experiment on production means a downtime for a customer, or somebody?

Hirschorn: We have done that. I think, it's some really advanced companies do things like they can deploy experiments through their pipelines. We don't do that. We have run experiments on production. When you have mechanisms like feature flagging, or being able to route traffic to a certain percentage, it really helps. They call it minimizing the blast radius, essentially, in terms of who's actually impacted as a customer. You have to get to that point, really, if you really want to go and get more robust with your chaos testing. You need to test in production.

See more presentations with transcripts

Recorded at:

Sep 18, 2020

Crystal Hirschorn

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?