Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations How Condé Nast Succeeds by Buildling a Culture that Embraces Failure

How Condé Nast Succeeds by Buildling a Culture that Embraces Failure



Crystal Hirschorn talks about learnings found by building a culture that embraced failure through Chaos Engineering practices as daily routine, what her teams have learned and adapted for their platforms at Condé Nast International, which currently serves in excess of 220 million unique users every month across the globe.


Crystal Hirschorn is Director of Engineering and Cloud Platforms at Condé Nast who are known for their portfolio of brands: Vogue, GQ, Vanity Fair and Wired. Previously, she led the online technical strategy for many BBC News elections events, including the last general election, which served more than 65 million requests in a 24-hour period, with traffic peak at 3.2 million concurrent requests.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Hirschorn: I'm the director of engineering and cloud platforms at a company called Condé Nast, better known for their portfolio of magazines, which I'll come on to. I oversee the whole engineering function there, so it's all types of engineering: data engineering, software engineering, platforms, SRE, the whole lot.

I'm going to start off with this example. I don't know if any of you are fans here of "Choose Your Own Adventure" or also known as "Fighting Fantasy" novels. I was a big fan, I'm a '80s child, so I read a lot of these stories growing up. I think the reason why I use this example is it summarizes how we feel sometimes about our systems and our code, we think we know what's going on, but a lot of times we don't know what's going on. There are so many paths through the code that we often don't know about. I try to find real-world analogies to try and explain how we think about our systems.

Like I said, it's a big company. We're better known for our publications, we have "Vogue," "GQ," "WIRED." Yay, you've got a tech brand in there. "Vanity Fair," "Glamour," list goes on. We are a global company, so we have a lot of presence here in Europe. We also have an office here in London, which is the headquarters that I work for. We have presence in harder regulatory environments, like Russia and China. We're finding that quite a challenge to get our platforms rolled out there, it's quite an interesting challenge. As you can see from this diagram here, we have different brands in different markets as well, so not every country has the same brands. Also, the way we have to build our platforms and our systems are to be multi-tenanted and cloud-agnostic as well.

This might give some people a bit of an epileptic fit, so sorry about that, but this is a good illustration of we have 11 wholly-owned markets, but we operate in 28 countries at the minute. These are our licensee markets as well. Beyond getting our platforms rolled out for our wholly-owned markets, and the vision is to basically, to use our platforms in all markets. By the time we get to that point, I reckon it'll be 30-plus countries where we're rolling this out.

Before I start on my talk, I want to go into my first war story, which is quite a recent event, and it was a complete company-wide security incident, we call it Shannon Gate. It caused quite a stir, particularly in New York because that's where it happened. What happened was this lady, her son had won some "Golf Digest" contest, and she was trying to work out how to claim this reward. She called in and she was getting really frustrated. She called 20 people in the company, she couldn't get through to anybody, so she just started hitting random numbers on the automated telephony system and got through to some archaic bits of the IVR that left a voicemail for the whole company. She spammed the whole company. Of course, everybody thought it was hilarious because "Hello, my name is Shannon Butcher, I'm calling about a 1099 form," which I've no idea what that is, apparently, it's an IRS form. She's "I'm calling for my son."

Then the whole company wakes up on Monday morning and people start going, "Did somebody get a voicemail from Shannon Butcher?" Then it was, "Yes, I got that voicemail." Then you can imagine the kind of engineers in Slack, it went crazy basically. Lots of memes started popping up. She's really famous in the company, I don't think she knows it. Yes, it's quite funny, even somebody said, "I didn't get a voicemail from Shannon. Does that mean I'm being let go?" Which I thought was quite funny, especially about Condé Nast, what it's going through at the minute. Yes, somebody said "Hello, is it me you're looking for? This is 1099 form."

I'm just going to play you a clip, somebody remixed the actual voicemail. That's how popular it got. Yes, I love it. It's just geeks gone mad, basically.

Anyway, people were so excited about this that they kept calling her, by the way. This was cybersecurity, that channel never gets much, nobody cares about it that much until something like this happens and they were, "Wow, this is the most popular we've ever been” Shannon eventually just told somebody, "Can people at the company please stop calling me. I've literally spent the last few nights awake by Condé Nast employees calling me in the middle of the night." Yes, poor Shannon.

What are we going to talk about? The intention of that story is, things will fail in weird ways that we could never imagine. I think it's good to expose these stories with you all because that's the way that we learn. We don't brush things under the rug, these things will happen. Sometimes it'll happen in a fairly isolated way, sometimes at a massive scale like this. Luckily, she was non-malicious, but it could have been pretty bad, a hack like that.

What we're going to cover today is I'm going to talk to you about the DevOps culture that I've created at Condé Nast. I've been there just over two years, I was really early in the company as employee number 15. I had three software engineers to start with. I'll come on to the size of the team that is today, it's about 60. We'll talk about the data inference and kind of data collection points that I've used to understand how teams collaborate, the actual data, through GitHub and various other tools, and how our system architectures have evolved.

I'm going to talk about adoption and sponsorship, because now being in a more leadership role, I kind of understand the dynamics at play at the sort of more senior levels of an organization and what you can do. Whether or not you're a really senior in your company or just an engineer on the team, there are different ways that you can influence people. It's about what's the psychology behind that. I'll talk about resilience engineering, it's a very vast subject is what I would say. To preface that, I'll just talk a little bit about that but then I'm going to move on to Chaos engineering and how we are starting to use that at Condé Nast. Then it's closely with partnership with observability, because breaking things is not really good if you can't tell what your systems are doing.

Where We Started: Anti-pattern

What about us? We do 20 to 50 deploys prod per day. I got one of my engineers to collect some data on this, so this is fairly accurate. I have about 60 software engineers in London, 20 IT ops. I don't look after IT, this has been an interesting challenge to try and bring the two things together. I have a counterpart who looks after technology services, and together we run the whole technology group. We have 13 product development teams based here just in London, and then we have 12 engineering teams distributed around the world. There's 11 wholly-owned country markets, and now we have merged in with the U.S. It used to be our sister company, but about three months ago we've become merged in. Then obviously dev and ops. Traditionally they've operated silos how do you make them operate sort of more collaboratively?

Where do we start? There's this really great website that I would recommend called DevOps Topologies. They're coming out with a book soon, I think Matthew Skelton is one of the main people that developed all these different patterns and anti-patterns. It's really good to understand where are you today, and to try and find some artifacts like this. They're never going to be perfect is what I would I say, because, within any organization, it's pretty nuanced what your actual culture is, but I felt this summarized it quite well, especially for me to try and articulate that to the rest of the business.

This talk’s about DevOps, dev thinking almost a little bit naively that they can take on all the operational work. Then ops operating in a silo, so you get the fence-throwing situation. Then you get that really us and them culture happening as well, where they broke this and they did that. It's, how do we stop that, because that's just BS basically, your customer doesn't really care about your internal politics.

This is our aspirational model is what I'd say. We're still trying to get there, but we have made a lot of progress here. I spent a lot of time soul searching around, "Do I want SRE as a function within our organization?" It probably makes sense for us because of our scale, is what I would say, and the fact that we have to operate around the globe. I would encourage my teams to be not relying on SRE to be the only bridge into operations. Ops and dev work well together, so that they pair together, they come up with processes together. We're on call together, all my engineers are on call, by the way, it's not mandated, but it's highly incentivized. If anybody wants to know what I did around that, come and ask me afterwards. I try and make it as well compensated as possible, but people still don't want to be woken up at 3 a.m. no matter how much you pay them. I'll talk about that later. That's where we're getting to today, having this kind of model.

Our Service Map

I'm going to talk about some of the data that we collected around, what do our systems look like? This is a pretty static service map of most of the systems. We're in a more service-oriented architecture at the minute, and we're sort of migrating towards microservices. That has happened in some cases but not in all cases. I just want to be really honest about that. That's fine, a lot of you will be on that journey, if that's what you want to call it, as well. We leverage a lot of really cool tooling, Datadog, we use a lot. I don't work for the company, but I will sing their praises because I think that their tooling is getting more and more sophisticated and it's giving us more of the knowledge about our systems that we need without having to build it ourselves. Some real pragmatism much more in the camp like Sarah is, buy versus build, you spend your development effort where it differentiates your business, not on things that already exist. Although it might be quite cool for engineers to build them, but you have to resist that urge a little bit.

I'm just going to quickly show you what these dynamic service maps give you. It's pretty cool, it shows you how are your different services connected, what's the average request time, latency, all sorts of things like that, your error rates that you're having per service as well. You can really deep dive into these things as well. It also services up things, "Where are the dependencies in my system that I didn't know about?" Which is really cool. It's a dynamic thing, so it's constantly being updated.

Now I want to talk about some of the stuff that we did. This is a book called "Your Code as a Crime Scene," and it talks about how you can use forensic techniques to do static analysis on your code and try and work out different ways of looking at it. You can see the dynamics between people and teams. I'll show you some screencasts of things that we've drilled down in. Where's the most area of change? Where's the most area of complexity? Where's the most rate of churn as well? Making more informed decisions about architectural approaches and how you want to evolve your systems over time, and where the coupling and dependencies are that you perhaps didn't know about.

That book, by the way, I'd say that it depends. If you're working on a monolith, you might find this book really valuable, but probably if you're already in a microservices approach, it's questionable how much you'll learn from it. He came out with some GitHub tooling, which is all open source.

This is something we're looking to do at Condé Nast right now, which is following more of an inner source and maintainership model. We set up to do product development and we were doing MVPs and teams are based around certain products, but we want to build a much more flexible model, especially if you're moving towards also your architecture as somewhat informs that moving towards a microservices approach. It's about mission-led objectives within an organization and people being able to move between those missions. Also another thing that Sarah said yesterday is people will leave, people will change teams, it doesn't stay static. How can we know who the experts are and where our domain expertise is and who those people are within the organization so that we don't lose that over time? We can create a map from that.

One of the things we looked at was cyclomatic complexity, which is essentially, "How many logical execution paths are there in my code? Where is that most complex, and where is that least complex?" The reason why it's interesting to have this is to try and work out "Where should we be rethinking about refactors, where can we make the overall architecture less complex for our needs or where do we need to maybe split things off into separate services?" I highly recommend going and getting these tools, they're all open source, I think it's called Code Maat. It's a hosted version as well called CodeScene as well if you don't want to get the open source tooling, but it's all there. It gives you the graphs as well.

We also looked at knowledge maps between teams, how do teams interact amongst themselves? You can see here, you can look at different parts. You can look at who the developers are, who the teams are. Where are the hotspots in the codebase? Who is reviewing, who's pull requesting most and committing code to certain repose the most as well? I think it's really interesting to see the team dynamics because that's not something you would necessarily know when you're running a team of 60 people. This is really hard data that you can use to kind of go, "Ok, yes, these two people should be the maintainers on this specific repo." You can build your map out like that.

Then there was one specifically around developers as well. That was more around teams and how they commit code and where the highest rate of churn is, then there's one more around specific developers. That was quite interesting as well because you could see there were certain dynamics between different developers, some certain developers would help out other developers a lot and some would not interact at all. How do we make sure that we're encouraging that pairing, that knowledge sharing all the time?

Then the last thing we looked at was a systems evolution, how have our systems changed over time? What are we looking at here? Does anybody know what this is? Burndown chart, is the worst thing, such an anti-pattern of project management. We don't use them, by the way, terrible. Springs me on to my next topic. I love "Star Trek: Universe," I was watching it and then Lorca, he's an evil bastard, but he said this quote, "Universal laws are for lackeys, but context is king." I was, "Wow, so inspirational, I'm going to put that in my slides." No, seriously, I do genuinely believe this as well.

That's why collecting these kind of artifacts that we have like this, especially dynamic ones that we can produce, are really beneficial about understanding our systems. I'll come back to that point in a little bit.

Adoption and Sponsorship

I'm going to talk a little bit about adoption and sponsorship. That's the hard part really, I think a lot of times people are, "But where do I get started? How can I get the company to buy into this?" what I would say is, it's a really long road. I'm here talking to you and I started this journey two years ago, I was in a pretty privileged position because I was the director of engineering, I was staff number 15. I could set up this culture in the way that I wanted to right from the start. It's much harder when you join an organization that's already entrenched in a certain type of culture. I've been there all my sort of engineering journey before too.

Firstly, what are we adopting? A lot of engineers - I've been guilty of this - they focus on the tooling and not the actual cultural part, which is the hard part, making humans change and take on a different mindset, basically. I think this is a good illustration, "Still working on that tooling and nothing really has changed much."

I love "Adventure Time" as well. If you're into it, you might know this meme. It's, "Why not both?" Yes, you don't need to choose, you can have both. You can have tooling and you can have culture, it's beautiful.

There's a really great talk. Gremlin did a Chaos conference and there was a guy called Kriss Rochefolle, he's quite big in the Chaos engineering scene. He gave this really interesting talk about psychology and how to get sponsorship and, Chaos as a word sounds quite scary, if you start saying that to stakeholders, they might panic and think, "Oh, is that something we definitely want to do?" I think that's why we're reframing Chaos to be called resilience engineering is because it sounds much more friendly. I think it's about framing the right kind of questions, this could be to execs or somebody in your management chain. Could be to other stakeholders in operations if you're not part of operations or in development if you're part of operations. Why does leadership need to care? Why do I need to care?

I know every day there's 100 and something priorities that I could be spending my time on. That will change day by day, week by week, but I need to know an elevator pitch, why do I need to care? If you can get them on site or come onto the dynamics of that, then that will give you a lot of progress. What data can we show? What tactics can we use? Gremlin also do some really cool stuff around cost of downtime, Gremlin is a Chaos company. They produce a lot of APIs and hosted services where you can basically do Chaos as a service, but they talk a lot about cost of downtime. I think it's important to produce those reports for your own company. It's interesting because I have articles already out there about how much Walmart will lose every single second that it's down.

Also, it's not just about cost, the revenue cost, it's also about the trust that you have with your customers. If your service goes down more than a couple of times, you start to lose your audience. Building up loyalty again, for me, from my perspective is probably the harder part.

Case studies, what are other organizations doing? I go out and speak to other media and publishing companies all the time to try and work out "How are they doing it? How can I use that as evidence of the way that we should be following this as well?"

Then build on metaphors that they understand. We talk about technical debt, I think that that's fairly well understood in companies, but there's also operational debt as well. What's the cost of that, of not dealing with that over time? It needs to be iterative, meaning to be dealing with that all the time.

Then I'm going to talk about sociodynamics distribution, this was in Kriss Rochefolle's presentation, I would also recommend going and watching that. It's referenced down at the bottom of the slide. It talks about this influence model that they use in psychology. You can have people who are zealots, they don't need any convincing, they think your ideas are great. Then you have a fairly big body of influencers, they're usually on side with you. They usually agree, but if they're not part of it, you can also lose them. They can also become the moaners and the opponents. There's people that will cause schisms as well, so tread carefully there. There's three key groups that he says are willing to invest your effort into trying to influence them.

I think you need to have a real think about, who are your stakeholders? Where might they fit within a framework like this as well? You might not always be able to guess, but I think it is worth using a model to try and frame that. Your waverers, they call them fence sitters. They are not that invested, they could go either way. They could become your influencers, but you need to spend time with them to try and convince them. There are some other groups that you need to probably invest even more time in which is called passives, which they just frankly don't care. They don't know how it affects them at all in their day-to-day work or their organization.

The group that needs the most convincing are the moaners. Some moaners could have been influencers at one point, but if you weren't engaging them through the process of trying to make this change, they'll often fall into this camp, because they were, "Well, I supported you and now I definitely don't feel part of this process. I feel like you're just inflicting this on me." I think this group, they could be brought back to being influencers. I think that that's the key thing, if you bring them back around, you can build them up to be your influencers again. It's trying to build up that influence as much as possible throughout the organization. That's at all levels as well.

I've already mentioned tactic, this is referencing Gremlin's article about the cost of downtime. How much revenue is it costing you per second? Because execs love hard numbers, and things like this can be really useful.

Resilience Engineering

Resilience engineering, what is it? It's the intrinsic ability of a system or organization to adjust its functioning prior to, during, or falling changes, disturbances, and opportunities so that it can sustain required operations under both expected and unexpected conditions. I think the key thing here is about both expected and unexpected conditions. This guy, Erik Hollnagel, he's one of the most influential people in sort of resilience engineering or resilience as a topic. He's been writing about this for 10-plus years at least, and he's done some great work, both white papers but also books. I'd also recommend looking him up.

They created, I can't remember exactly who, but a group of them, probably Erik Hollnagel as well, created a set of organizational factors for resilience. These were the top ones. I talked about commitment, so senior and exec-level commitment. Something that John has talked a lot about and blog posts, which were pretty pivotal to my thinking early on was about just cultures, so blameless, postmortems. It's referenced here as well, you can go and read that. He wrote it all the way back in 2012, it's quite an old article, but it's still really relevant. I feel like a lot of companies aren't doing this yet, a company that can learn as well have a growth mindset.

The fact that I was telling you about this Shannon Gate, we didn't sweep that under the carpet and pretend that never happened because we ultimately don't learn anything from that. Ultimately what ends up happening is you end up blaming people and you don't learn how to fix for that.

Transparency about risks and the awareness of them, being prepared as much as possible. We talk about failure, sometimes it's unexpected. You can only prepare for it so much, but you should be prepared, also, really important are the flexibility and capacity as well.

One of the most problematic things that we find when we're doing things postmortem reviews, post-incident analysis is certain types of bias. The two that really catch us a lot is hindsight and outcome biases. I try to put a lot of references in my talk just so that people can look at the slides later and go in and read this stuff, because it's really interesting, but it's too much to talk about in this short time. It talks about how we look backwards over time. I don't know if any of you saw the movie "Sully," which was about the landing on the Hudson. The veracity of how they showed the courtroom situation has been challenged, I think it's really good. I would also encourage you to watch that because it talks about how we can look back in time and think that something, it was so simple, how could they not have done X, Y, Z? Then it starts to focus again the blame on people, why did they take that action? It should have been really obvious. What they did was, they put some pilots in a flight simulator and said, "Ok, see what your options would have been in that time." Of course, they were, "Oh yes, he definitely would have been able to make it to this other small airport, he would have had enough time." Then he challenged that and said, "But they knew the outcome already, so they were able to go straight to conclusion." In the moment, he's, "I was trying to make the best judgment that I could." It takes time to react, there are some other judgments that humans can make. I'd say systems are limited, we talk about automating systems and stuff, but humans are really good at being able to fix things as well and to be able to work out in the moment what the best steps are.

It was really interesting because then he made them go back and say, "Ok, we'll test this again.” You've got the black box recording, so you can see roughly when I was taking certain measures, what we were trying within the cabin or the cockpit. Try that out and see what happens." They found out that yes, the best option was to land on the river. There's no way he would have made it to the airport. He would have hit loads of buildings on the way. I think this is really interesting.

A company can become too complacent when things are going well, we think past success is a measure for future success, that’s not true. Things are changing all the time, constraints that are out of our control. When we think about it from a software perspective, we have lots of third-party vendors now that are changing staff, we're constantly changing our systems, our code. How could that ever be? Because we're judging it on a past state, and the current state is not that, be aware of that as well. One thing that Google do which that I've found really cool is they do simulated outages. If something hasn't failed for a while, they make it fail because first of all, they don't want customers becoming too complacent on their services, but also they want to test what happens in certain scenarios as well.

I know I'm trying not to talk too much about what I know John is going to be talking about in his next talk, but this is a Venn diagram of work as perceived versus work as done. This is something else to bear in mind as well. We have work that we imagine and in our minds we're already working out how we would do this work as prescribed. Sometimes that's coming down from a project manager, it could be something else. Work that we disclose, but then there's work that we've actually done. As you can see, there's not a lot of overlap here between what's been imagined and prescribed and what's been done. People's mental models of how a system looks is often very different. Be careful about this as well, perceived versus actual. I've got a key point about this in a minute.

I'm not going to read this out. I love this book, by the way, so I impress on you to go and buy copies of it. I probably took about 1,000 quotes out of it. This was really hard choosing my favorite. This talks about how we can become too complacent by past success, ensuring our future success. Particularly within an organization, they have this belief, and it's almost a shock to the system when something breaks, or there's a big outage because we're, "How could that have happened? We had risk under control. We knew exactly what we were doing."

Post-accident attribution to root cause is fundamentally wrong. Because overt failures require multiple faults, there's no isolated cause of an incident. I've got a bit of a bugbear, which is that we talk about root cause quite a bit still as an industry, but our systems generally don't fail because of one reason, it's usually lots of reasons. It's both human and technology, it's not this particular service failed and therefore it had this cascading failure.

The book is really good as well. It talks about the sharp end and the blunt end of the spectrum, there are things that are outside of our control, the way that government set regulations can have a constraint on our systems, particularly from my point of view that's true operating in Russia and China. There's a lot of challenges there. Also, the practitioners at the sharp end have to deal with a lot of that as well, all of that complexity. It's not just about building systems, it's also about all the constraints that the blunt end applies to your systems. That could also be management just not understanding, what you're doing.

The evaluation is based on such reasoning as root cause do not reflect a technical understanding of the nature of failure but rather the social, cultural need to blame specific, localized forces or events or outcomes. This is about the need to blame and trying to get away from the need to blame, because that doesn't produce any kind of learning.

Chaos Engineering

Sure, you've all been waiting for this, Chaos engineering. To preface this, you can't just jump into Chaos engineering, that's not something we did from day one. It was a dream of mine for a while because I used to work at the BBC. We did a bit of that in a couple of teams. I was part of their platform team. We did that there then I moved on to BBC News. We did that there as well, so I came in with the ambition to always do this. I knew that we had to reach some sort of maturity first in order to do this.

I really like this quote, "The cost of failure is education," if you let it as well. Getting away from blameless culture, getting to blameless culture, the whole point is to educate yourselves and to build more robust systems and more resilience in your practice as well.

There's a guy in the industry called Casey Rosenthal. He was at Netflix, I believe on their platform team as well. He helped write the book about Chaos engineering as well, this is a quote that he uses to try and describe Chaos engineering. It's about building systems that discover the Chaos, I've got a bit of a problem with the word "chaos" in there, already inherent in the system. It's the facilitation of experiments to uncover systemic weaknesses. Systemic weaknesses, we don't often know, we have this mental model of how our systems work, but this is trying to interrogate that and trying to work out whether or not our mental model is true, updating that as well, updating our mental models.

Sarah talked about this as well, in order to understand where to start your experiments, you need to understand your steady state and use a bit of scientific rigor in what you're doing as well. Follow this kind of process of building a hypothesis, running the experiment, looking at the results, this ties really well into the observability bit that I'm going to talk about next. What do you learn from that as well? Another thing that I see a lot of companies do is they get the results, they'll be, "Great, we'll write this report," and just goes in confluence to die basically. Which is, it's not great, you need to choose at least five things that you want to fix from that.

The way that we started at Condé Nast, we didn't start with tooling, the first thing we did was, "Ok, what do we roughly want this process to look like? How are we going to get ops involved?" Liz made a really good point yesterday where are your stakeholders? We involved all of people in product development. That include project managers, product managers, delivery managers, finance editorial as well, and tried to get them in a room, at least the first couple of times to understand, "Why are they doing this?" They got really excited, by the way, as well, because we don't often involve them, but also they were, "Oh, this is really cool." you're finding weaknesses in your system. They got really involved, they also helped us create some of these scenarios as well.

You have to think, editorial in a company like ours, which is a lot of the "FT," it's publishing company, it's editorial, we build a lot of tools for them. Also, they might say, "Well, have you tried this scenario?" Same for commercial as well, because we have to run campaigns everywhere around the world. A lot of our money comes from digital advertising still. A lot of things that we did with role play. We did a lot of role play to start with, also, just to see if we could follow a process as well.

One thing that was really interesting was, in our first game day, we found an incident commander, and he was trying to run through the process that we'd set up. Then we had a couple of engineers who were on call, so it was escalating through the path. The first thing that somebody said, an SRE no less said, "Well, why did that person have those permissions to production? Somebody should fire them." I was, "Wow, this isn't the culture I want to create." I was, "No, stop. Nobody is getting fired for anything. Blame this culture." It was really interesting to go through that and highlight these things. It's not about blaming people, it's about trying to understand, "How do we create these conditions that have created these failures."

We ran through that a lot until we got really adept. It was almost like a second nature of running through the process. An incident commander basically runs, when there's a scenario happening, they run the whole thing end to end. They give other people roles to do in order to come back to recovery point essentially. They usually write the incident analysis, postmortems as well. Are you doing this yet? Are you even doing this yet? Because this is a big step. Even getting to this point means that you have to have that kind of culture where you're really heavily working with your stakeholders, but also operations as well.

There are some low effort wins. It might have been Tammy Butow, she's a principal engineer at Gremlin. I think this might be her diagram, where she's, "What? Just use a whiteboard, just write some stuff down on a whiteboard. Maybe we'll just try this. Here's roughly the systems diagram that we think and we'll take this replica part of our database out, see what happens." They just start here, it'll teach you a lot.

We also use some of this. I know there's mixed opinions about this, but we use a lot of this for onboarding for on-call as well. A lot of engineers are very scared about things breaking, that's for good reason. A lot of people in my team come from software engineering backgrounds. In fact, all of them do, so not all of them are used to being on rota basically, for being called out. We wanted to try and teach them within hours as well how to get used to that as well, how to get used to things breaking. I know that I caveat that by saying that we can only contrive certain breakages, but at least they get used to the pressure of having to fix something in the moment, or at least bringing it back to a recovery point, which we can then go back to bed, which is what Nicky talked about. Definitely a big fan of that. "No, you're low, just go back to bed."

There's also a great talk here from Adrian Cockcroft I think is his name. He's one of the VPs of architecture at AWS. This is a great diagram, by the way, he talks about different tooling you can use at different layers of your stack. It goes all the way up to people as well using game days, which is what I talked about, but using Simian Army. That came out from Netflix, I think they've recently updated the Chaos Monkey to version two because for a long time it had gone quiet stale. It was relying on the open source community to keep it updated. I think they've put some more effort into it.

There's also things like chaostoolkit, which is what Russ Miles was talking about earlier. That's the part of the company that he owns. There's a lot of APIs there around how to run experiments that uses the chain of showing before around the hypothesis, being able to set your steady state. It's also got drivers as well for things like Kubernetes, which is something that we use heavily where I work too.

Then there's things like ChAP. I'm not sure if that's open source, I've tried to find out on GitHub but couldn't. I think Netflix are coming out with some great articles about some of the tooling that they've built, but not necessarily open sourcing it like they used to, which is a shame.

One thing that we started using where I work is Gremlin. They provide lots of APIs around breaking different layers, that includes infrastructure, networking. They even have something now which I find quite cool because it's something I've been caught out for in my career many times is certificates expiring, this happens all the time, and it can cause big problems. They also have now application fault level injection. That's really cool, for us it's a bit of a shame because they don't have it for node yet, so we use a JavaScript stack, but they do have it for a lot of other languages like Python and Java and Go as well. I'd say definitely check out Gremlin, by the way, that's getting really quite advanced.

This is from Tammy Butow, she talks about what can you do at different layers essentially. Talking about people in practices, but where can you automate your experiments? There are ways to do this now at all these layers, at the application level, the platform level, and infrastructure layers as well.

This is just a basic screengrab of the Gremlin UI. You can see there are different types of outages that you can cause in your system. The inbound traffic, HTTP traffic, the stuff around Kubernetes as well, it's got really good integration with Kubernetes. Our stack is heavily based on Kubernetes, so there's ways to take down your clusters, take down specific nodes and see how your system should recover from that as well. Taking down entire regions as well is something you can do, as well as doing things like gray outages, like saturation for requests.

Something that I mentioned is that the things of the past are still relevant today. People probably cringe when I say the word Ital, but I think they're still relevant. I think one thing that we did as a starting point as well is we whiteboarded a very basic DR plan, disaster recovery. Looking at, "One of our critical systems, what recovery times do we want to expect? What's the threat? the severity of those threats, how do we prevent...?" It's a bit hard to do that all the time, what's our response strategy? That could be anything from the way that we recover systems to the way that we communicate to people as well around the business and our customers too.


Last topic, observability, is a partner with the Chaos thing. Observability is a lot of things, it's logging, tracing, metrics, monitoring. I think tracing is the poor cousin of observability. We still have a long way to go as an industry to get tracing. I think tracing is one of the most key aspects as well. I quote here, I think it was Nicky who talked about Cindy Sridharan, she's amazing, by the way, anything about observability or just systems engineering, she's so fantastic. I'd go away and follow her on Twitter. She wrote this really interesting post about "What is tracing? What is it? What do you expect to find from this?" It's about following these execution paths through our code, trying to work out, "What's the latency? Where are the dependencies that we didn't know about?" and as you can see here, it's a bit like that, "Choose Your Own Adventure," it's not that dissimilar. You might find things like execution paths you didn't even know existed and you're, "Why is it even doing that?" You can go away and investigate these things.

Going back to Datadog, they have really nice graphs that we can utilize, there's hundreds of them. There's a point about that as well, information overload, but you can follow things really nicely through now. We're looking here at the network errors for sort of Kubernetes clusters, what's the errors? What are the network-level errors? We can follow that through and say, in front of that we have something called Traefik or Traefik, not really exactly sure how it's pronounced, but it's basically a reverse proxy access or ingress controller for Kubernetes. We can start to see, "Well, hold on a second, this looks a bit weird, this looks like we might have a memory leak," which is quite a common problem that we find, both within these kinds of services, but also within node, is we get a lot of memory leaks through node as well, which is a lot harder to debug and interrogate than it is in other languages like Java. Datadog is really clever for this.

We can also see there's also things like the P95th, what's the average? What's the max? Following this through, like this investigation. Then one of the most recent additions that they put in Datadog was the flame graph. This is more around tracing that you can almost get out of the box, what does the request life cycle look like? Was it hitting a service I didn't know about? How long was that service taking to respond and make them pass off the request to the next part? Here you can see we're looking at our inbound on our health check and it's going through our GraphQL layer. I was thinking, "Why is it doing that?" It's interesting to look at things and say, "Why is that happening?"

The last point, I think Charity Majors has been quoted a lot here at this conference, but she's really insightful. She said, "I've begun to see inexorable sprawl of alerts monitoring checks and dashboards as a deep well of technical debt." I'll give you an example. Back when I was at BBC News, I was working on some of their really high-profile elections projects. We're talking, we would get 60 million people coming to our site in a 5-hour window. Sometimes we'd have millions of people hitting the site at the same time. We did well because it’s like a publishing platform, caches, CDNs, great for that, but the thing was, I created some really basic graphs of things I felt were the most important things to be looking at.

Then another team, which was writing on their Go microservices, they basically produced these graphs that looked pretty cool, but I was, "I don't know where to look." The management came down and they were, "Yes, these graphs are really amazing." Then when the events started unfolding, I saw their smiles start to droop because they were, "I don't know what to look at. I don't know what to make sense of this." Then eventually one of them told me afterwards, " at first I thought your graphs were a bit basic." I was even using things like HTML frames to throw them in there, I was, "I don't care." He said they were the best ones because there were only four things to look at, just four things. He told me about the server saturation, how many HTTP requests we're getting per minute, what the error rate's, as well. It was really easy for him to understand that, so be careful with that too.


Recap, what did we talk about? DevOps culture, so the topologies, the values, behaviors, gathering data on your current practices. This is about the perceived versus actual, using hard data, because we might think there are different team dynamics to what's going on. Getting adoption and sponsorship, talked a little bit about what is resilience engineering, this is a massive topic, so I would encourage you to go and read about it. Chaos engineering, how do you start? What is it? What are some of the tools out there that you can use today? Also the importance of using observability in Chaos, because without the observability part there's not really much reason to do it.


See more presentations with transcripts


Recorded at:

Aug 04, 2019