BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations When Everything Goes Wrong

When Everything Goes Wrong

Bookmarks
41:09

Summary

Colin Humphreys takes a look at just how bad life can get, and what we can learn, when our plan isn't reality, our team isn't a team, and our users are furious.

Bio

Colin Humphreys is CTO for Cloud at Pivotal. He is responsible for the big picture strategy and roadmap for Pivotal's cloud platform offerings. He joined Pivotal from its acquisition of CloudCredo, where he was co-founder and CEO. He led the installation of the first SLA-driven production Cloud Foundry deployment, and organises the London PaaS User Group.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Humphreys: I'd love to talk to all of you about the experiences you've had when things have gone wrong. I'm going to start off telling you a brief little story about how I ended up standing on a stage here before you. I was having a chat about maybe three months ago with one of the conference organizers. She said to me, we're thinking about organizing this track called, when things go wrong. I said, "I love those tracks. I really do." Don't you love a sense of honesty, when people stand up in front of you that aren't talking about this amazing new tech they've invented, they're talking to you about how the amazing tech they invented went catastrophically wrong in production, and all the things that really happen in our lives. I said, I really love those tracks. Things go wrong all the time. It's the nature of us as humans that things go wrong, and then we learn and we adapt to things going wrong. This is all completely normal.

Pivotal

At Pivotal, we help our customers with forming balanced teams that own their outcomes, so they can learn, iterate, and adapt towards better. We help them with small batches and fast feedback, so that if things do go wrong, they learn quickly about where they go wrong, and they can then improve. We help them build resilient systems, so when unpredictable things happen, the system can adapt. It can flex. It can cope with the change. In saying all of these things, I suddenly remembered there was a time, a long time ago when none of those things were in place when I was working a particular project. I came out in a cold sweat. I started shaking and thinking about this one project. I said, "There was this time," and I get a brief like few thoughts about this really bad set of events that had happened. Then I thought nothing of it for a couple of months. Then the organizers reached out to me and said, would you like to talk on this track? I said, no, because it was that bad. I don't really want to relive it. I think what happened in those two months was that they'd reached out to as many people as they possibly could and said, do you have anything worse than this? Everyone had said, "No, we don't. That sounds terrible." Thus I stand here before you.

Monopoly

This project has just had its 10-year anniversary. You may be thinking, why am I talking about something that's 10 years old? Is that for the entire period between then and now it has been too soon and too raw to talk about it. I invite you to come back in time with me, to 2009 when this project took place. Who has heard of Monopoly? This is the total sample size of the number of people who are willing to raise their hands when asked to do so, because you've all heard of Monopoly. That is your sample size of 100%.

Monopoly looks a bit like this. What happens is, you move around the board, and you buy property, and you build on that property. Other players move around the board. If they land on your property, they have to pay you rent. If you land on their property, you have to pay them rent. Thus the game continues until everyone has either gone bankrupt or one player has established a Monopoly. Everyone get the basic concept of Monopoly. The problem with Monopoly is that it really hasn't changed very much since it was invented over a century ago. The original board looked like this. It looks actually very similar to the current board. There's a free parking down there. There's properties. It's actually really not changed very much. It was invented, ironically enough, by an American anti-monopolist to show the problems with monopolies. If you have the challenge while you're trying to sell more games of Monopoly, really, you've got two choices. Your first choice is to innovate with the concept of Monopoly, maybe introduce some other ideas into it to try and change it and make it better. Your second choice is to raise awareness of the game itself and get people thinking, "I haven't heard about Monopoly for a while. Maybe I'll go and buy an edition of Monopoly."

Inventing a New Version of Monopoly

Let's explore those two choices. For the first one, we're going to think about, how can I invent a new version of Monopoly? To do that, I'm going to present to you my top five, what were they thinking editions of Monopoly? Proving conclusively that no one has a Monopoly on stupidity. At number five, we have this Sun-Maid raisins version of Monopoly. There is no link at all between raisins and Monopoly, thus this board exists. Number four, we have the QVC shopping version of Monopoly. Actually, I quite like this one because it very nicely integrates two themes that are based on capitalism, monopoly and shopping, so those two come together quite nicely. At number three, we have joint number three, the Bass Fishing edition of Monopoly, and the Monopoly Horse Lovers' edition. Neither of these obviously have anything to do with Monopoly, but I like to call it the surf and turf Monopoly. At number two, we have the One Direction version of Monopoly. This only makes sense if your one direction is, do not pass Go, do not collect £200, and go straight to jail. At number one, for the most ridiculous version of Monopoly, we have the Swindon edition. What's interesting about Swindon is it has more roundabouts per capita than any other place on earth. What's not interesting about Swindon is Swindon.

Raising Awareness of the Game

We know, new mashup, new interesting versions of Monopoly, they're not going to sell it, so we need to try and raise awareness of the core game itself. What do you do if you want to raise awareness of a concept? You speak to an advertising agency. Who knows what this is? This is "Mad Men." This is full of really powerful people, smartly dressed, doing amazing things. I used to work in an ad agency and it is nothing like this. This is absolutely nothing like it. It's far more like this. This is "Nathan Barley." This was a comedy. Nathan Barley described himself as a self-facilitating media node. I think he's a lot like a lot of people that work in advertising agencies. What happens is, effectively, there's two groups in ad agencies, there's the creatives who come up with the ideas, and there's the people that have to make those ideas happen. I worked in that second group, after actually making the ideas become a reality.

The creatives in the ad agency, they would come to us, and they would say, we've got this idea. Could you please make this real? To help you understand this dynamic, I'll give you a couple of examples of things that happened. The first one, we were working closely with Volkswagen. They're one of our customers. The creatives had this idea. We were launching a new edition of a car. They said to us, could you please make a hologrammatic car come out of the monitor and hand it to the user so they can have a look at it? We had to explain why that wasn't possible.

The second idea that they brought to us was for a well-known bleach brand. The idea here was that they would have a campaign called cleaning up the web. They wanted us as the tech team, to go on to the dark web, to hack into servers hosting naughty images. To download those images, to cover them, any of the naughty bits, with the logo of the bleach brand. Then to re-upload them so that anybody surfing for those kinds of images would not see the bad content and instead would see the bleach brand cleaning up the web. Actually, at the time, we didn't even know where to start on the things that are wrong with that. The two that I can remember saying, firstly, those servers are largely operated by the mafia, and they will come for us and kill us if we do that. Secondly, what is it you're saying about the customer base of the bleach brand that makes you think they're surfing on those servers?

Monopoly City Streets

This idea's factory within the ad agency, they came up with the idea for this thing, this project I'm going to take you through. Because in reality, this is all about what happens when an ad agency meets Monopoly. That project looked like this. This is the Wikipedia page for Monopoly City Streets. This is actually a fairly straightforward idea for an online game. Imagine that we want to take the entire world, every single person, I want to allow them to play an online version of Monopoly that has every street in the world as part of the board. All of us playing that game, using OpenStreetMap data, and Google Maps for the visualization, and you can buy any street on Earth. You can build on any street on Earth. You and anyone else playing the game will move around those streets, and you will pay rent wherever you land, and people will pay rent to you if they land on your streets. Sounds like quite a cool idea, doesn't it? You can buy any street in the world. You can build on any street in the world. The whole thing is available. Effectively, what I'm describing here, is a massively multiplayer version of Monopoly for the whole world to play. As you can work out, because many of you are technologists here, this is a concurrent, transactional, financial system. All those things should be huge red flags, to all of you, in terms of thinking about actually delivering this.

This was going to be a three-month project for which we would run this game. The sting in the tail here was that the prize if you want it, if you got the highest score was that you would have your rent, or mortgage paid for a year, up to £2000 a month. The price effectively is £24,000. This is before Brexit when the pound was worth something. This is £24,000, and it's free to play. Which is why if we zoom in just a little bit, we'll see this line here. The game had more than 5 million accounts at the time of its end. Hence me saying ouch. That leads you, if you scroll down to this, which says, since its launch on September 9, 2009, the game has had severe web server problems due to the huge number of people trying to access the website and create accounts at once. Then basically says, it's been a terrible experience for everybody involved.

I'm going to take you through a brief timeline of the issues that led to this Wikipedia page. Let's play a bit of blame game because we all love blame game. Firstly, how much money do you think was allocated for actually running this game, a game of this scale? Effectively, when we were costing this up, we thought, it's going to be a lot, not so much zero. We actually ended up borrowing £30,000 from the marketing budget for the game itself in order to run it. This game had a £150,000 budget just to advertise that it existed, and yet zero had been allocated for actually running it. The reason we had a zero budget was because the project manager believed that when a site is live, it is alive and would not go offline, unless somebody killed it. Why do we need to pay money because we don't want to kill it? It's just live. Once we had a conversation about that, the £30,000 was found. This didn't seem too bad because we were predicted to get 20,000 users over the duration of the game in a nice even spread of load. We all know the internet loves evenly spreading load across nice long time periods. We cobbled together two API servers to run the game. We had two database servers, active-passive. We ran some basic tests, and we thought, we're good to go with this.

Launch Day

We get to launch day. We're thinking we're getting 20,000 users over the course of 3 months. Hold that in mind. We come to launch day. What we have is we have a nice publicity page for this site, running on a content delivery network, so users are going to the site and they're seeing, it's just about to launch. We've driven huge numbers of users to the site. We've got all this marketing and advertising out there pushing people to the site. Project manager says, this is all going well, everything's great. It looks good. Game looks good. Let's go live. I flick the switch to go from the CDN and make the site live. About five seconds later, I try and go and have a look at the game. I can't get to anything. Everyone says, we just can't get to anything. I say, that's strange. I wonder what's happened. This had happened. Everything was offline. I tried to log into one of the API servers over SSH, I can't get on. I try again to log in. I try for about five minutes to get in, I finally get in to one of the API servers to log in. The first thing I do is run talk, because that's what you do to see what's happening with the server, it has a load average of 480.

What happens is, the load average effectively describes the number of things that are runnable on the servers, the number of processes that want to actually execute on the server. Then you have effectively the number of cores or processors that can run things on the server. What you don't want to have happen is your load average number to go above the number of processors or cores, because that means more things want to run that can't be run, and that will go into a very nasty queuing effect, like building up and building up. These servers had 2 cores, and a load average of 480. The only reason they had a load average as low as 480 was because they didn't have enough power left to compute their own load average properly. Everything's falling over.

Obviously, I'm a member of the operations team. I'm trying to run this system. I call up the lead developer, to ask, what are we going to do here? To have that conversation we need to have. I have to call up the lead developer, because they're sat in a different building to me. Why would you have your development people and your operations people sat in the same building? They never need to talk. I tried to call this person up. Unfortunately, the project manager answered the phone. The developer didn't answer their phone. I said, can I speak to Paul, please? Because obviously, we've got dramatic problems here. We need to fix this. Can I have a chat with Paul to work out what's going on? Project manager said to me, Paul can't come to the phone right now, he's gone for a long run, and he won't be coming back. He was out. Why had he gone for a long run? Why was this such a catastrophic thing?

This is, I believe, the front page at the time of Marketing Magazine's website. This is the most influential publication in the marketing and advertising world, "Hasbro's Monopoly play in meltdown, as 1.7 million people fail to access game." From our stats, we could track 1.7 million people had tried to play it, which is slightly more than 20,000. This carries on. This is PC Magazine, recognize the branding, a fairly big publication, "1.7 million people try to access Monopoly City Streets." I like the strapline here. "Attracting 1.7 million people to a site in a month isn't bad. Luring them all there on a single day, superlative, but failing to add the server capacity to allow them to actually view the page, as the kids say these days: FAIL." I like this rich simplicity. I just like, "Monopoly City Streets, EPIC FAIL," with Epic fail all in caps. That sums it up.

We have a conversation with the customer who's paying us to build the game, because this is bad. We secure more budget because they're saying, loads of people are trying to play this game so you can have some more budget to make it run. I'm like, "Yes." I go into superhero mode, as you do at times of trouble. I start up 400 API servers, because the cloud is fun, isn't it? You can start lots of servers. I start up 400 API servers and I configure them. I start up with the help of the team, 64 bare metal database servers. These were 32 cores each, 128 gig of RAM each. I put a custom version of MySQL on each of them that I wrote during this reconfiguration period. I hacked MySQL to remove all of the concurrency and consistency safety to make it faster. We had 512, 4 gigabyte Memcached buckets running across these servers. This poses a logical problem here. If you've gone from having a single database server to suddenly having 64 database servers, how would you get your application to talk to 64 database servers over which your data is now partitioned? This is a fairly tough challenge. You've got to imagine that you've got 1.7 million people trying to play this game while you're doing this, who are failing to play it. How do you help all of those 400 API servers talk to the 64 database servers? How do you make this happen? The simultaneously right and wrong answer is you write a sharded object relational mapping layer in PHP, while this is all going on. I'm not a PHP developer but I tried to do this. I thought to myself, how hard can it be?

The PHPrinciple of Most Surprise

We're going to play a game to find out exactly how hard it can be. Even if you're not a developer, or you don't know any PHP, like me, this should be fairly straightforward. Because we're going to try and help you all go into the mindset here, think you've been awake for over 24 hours. You've got to make this work. You're panicked. You're by yourself. You're going to make this happen. You're going to make this game work. You don't know PHP. That's going to be ok, because programming languages follow the principle of least surprise. Everything is going to be ok. We're going to play this little game here, it's called the PHPrinciple of most surprise. I'm going to show you some PHP. The thing I want you to do for me, is to tell me whether or not you think the statement that's going to be on the screen here evaluates to true or false. Because the reason we need to know this, if you're going to write something, some logic in a programming language, a fundamental construct for any of the logic you employ, is going to have to be if true, and if false, do something else. You have to understand the concepts of true or false. They have to be unsurprising to you. If you won't be able to do that, you won't be able to write a sharded object relational mapping layer, for example. Remember here, no sleep, existing purely on caffeine. You've got to get this thing to work. It's going to be straightforward. PHP, like any language is unsurprising.

Very simple to start with. False, who thinks this is true? Who thinks this is false? It's obviously false, because languages follow the principle of least surprise. No language is a trap, is it? Next up, "false" as a string. Who thinks false as a string evaluates to true? Who thinks false as a string evaluates to false? It's true. This is unsurprising. Some of you look a little bit surprised. That's ok because PHP is right. The number 0. Who thinks this is going to be true? Who thinks this is going to be false? This is false. It's obvious. It's 0. "0" as a string. Who thinks this is going to be true? Who thinks this is going to be false? It's false. Zero without the quotes was false. This was obviously going to be false. PHP is right. "0.0" as a string. Who thinks this is true? Who thinks this is false? Remember, "0" as a string was false, therefore, this is obviously going to be true. This is unsurprising, I think.

An array with zero elements. Who thinks this is going to be true? Who thinks it's going to be false? It's false. An object with zero member variables? True? False? It depends on the version of PHP you're using. This is completely unsurprising. This is exactly how it should be. Thank you for joining me on that journey of what it feels like to write some PHP in a production environment with a couple of million people trying to play your game.

Liftoff

We finally get this working. We finally get the 400 API servers talking to the 64 database servers. We get the whole thing up. We have liftoff. I realize some of you may have lost your mind in this PHP journey that we've been on. We've now got the big online game of Monopoly working, our 2 million users are using this. This is about 48 hours after we attempted to launch. We have a working game, people can play it, but the return of MySQL. In fact, fairly early in the game's development, I had to have a chat with the developers and I said, you need to take care of transactions in the application layer. Then they'd Googled MySQL, and they'd seen that MySQL gives them the transaction guarantees they need, so they just let MySQL take care of it. That was true when we had one server. Then we had 64 servers and a hacked version of MySQL, and that was no longer true. Money was just coming and going at random in the game. We completely removed any consistency guarantee. I'm sure many of you have heard of ACID. Money was just absolutely all over the place. In fact, once we got this to a better state, so we still had good performance but we had some consistency, we had to reset the entire game's money back to zero, because it was just chaos in terms of what was happening. About a weekend, we had to reset everything, which is not a great experience for the users.

Working Long Shifts

During this launch process, I spent 54 hours in the office without leaving, without sleep, working straight through this. Has anyone ever worked a longer shift? People brought me food. People brought me drinks. People were just taking care of me. I just sat there and plugged my way through it. This was to go from trying to launch through to the game actually being playable. Over the course of the three months, I averaged over 100 hours a week, every week for those 3 months. It nearly killed me. There were so many bugs, so many data issues. It was really bad. No one else was really involved in terms of anything past the code being written by the developers and committed. I was taking it from there and trying to make everything work. I was a broken person. I was terrible to work with throughout that entire period. The CEO of the organization I was working for, he'd had a sabbatical for two months of this three-month period. He came back for the third month. We'd made quite a lot of money because we'd brought all this extra infrastructure and made margin on it, so he bought himself a new Porsche. That was great.

Retrospective

Let's recap here, just in case PHPs, again, cause you to lose your mind. We were building a massively multiplayer online game version of Monopoly to sell more copies of Monopoly. It had 1.7 million players, and a lot of people tried to play it, and it failed. It was an epic fail. The question I like to ask myself is, what can we learn? What can we learn from this? What conclusions can we draw, so it doesn't happen again?

What I want to run through with you here is a retrospective to think about what we can learn and how we can do better next time. At Pivotal, we hypothesize that, if you took away everything that we've built over the 30 or so years we've been running as a company, if you took away all of our practices, all of our organizational knowledge, we could rebuild everything from scratch with just the retrospective. Because in theory, if you are continuously experimenting, continuously iterating, learning and adapting, no matter how wrong you start, you will end up going in the right direction. I think stupid is doing the same thing twice and expecting different outcomes. As long as you are learning and you're adapting over time, you're going to be ok. I once worked with a customer who had spent £30 million on their Agile transformation. They removed the retrospective, because they said, if we spend £30 million on an Agile transformation, no one's changing anything.

Key Takeaways - Balanced Team

I want to try and help you think about some things you can take away from this, so you don't fall into the same traps. The first one of those is the concept of a balanced team. A balanced team in which the team owns the outcomes and you feel a joint shared sense of responsibility. I think we could have avoided a lot of the problems I've spoken about. For example, myself, the project manager, the developers, we were all separated by organizational boundaries. I had this one-off conversation with the project manager. I called the developer to have this. We weren't constantly communicating. We weren't working as a team. You'll often find, if you have these organizational boundaries, that the sum of the parts does not equal a coherent whole. If I'd spoken to the project manager earlier, for example, I could have said, "We don't have enough budget to actually put a resilient system in place here." We could have done a lot better as a team. I was just fixing someone else's PHP, which as we've all experienced now, is a hellish nightmare. I'm asking you to form a team, if you're not already in a team. Communicate with your team. Spend time with them. Learn about how you work together. As a team, own your outcomes. Take responsibility for what you're delivering.

Saying that, I learned on this project that heroism is not equal to success. I worked amazingly hard. I'd like to think I built some amazing technologies over the course of this project. It didn't actually make the project successful. It was an epic failure. I was not a team player on this. I went for the superhero, "I'm going to do it all myself, 54 hours, all this stuff." It didn't result in success. Actually, this was really bad for everyone else that was around me in terms of work because I was so grumpy and so terrible to work with for this period. I spoke to no one else apart from issuing occasional threats. I was a really bad person. Don't be me. Rather than reaching out for help and support, I pushed other people away. Don't be me. Build that team. Rely on that team. Support each other. Don't be the one-off superhero and push everyone else away, because it doesn't breed success.

Fast Feedback

Next point I want to highlight, fast feedback. We should have tried with a small batch of users first, got fast feedback. Realize what was going on with the game, and incrementally added in more functionality and more users to build up our sense of confidence. We should have delivered progressively, because as we learned, this big batch launch equals big problems. We went from some basic tests, I can click through the game, to 1.7 million people. That's a dangerous thing to do, and a huge risk.

When you're building up that risk over time, so we build the game, we add some more to the game, we add some more functionality, we do things, we built up this huge batch of work. We didn't show it to anyone. We had this holding page, "It's going to be amazing," making all these promises, and then we let everyone in in one go, and it was this huge batch. We talk about a big bang release, it was a literal bang as everything went offline. Instead of that we should have been incrementally delivering, reducing that risk, not building up the batch size. We should have learned early and then built up gradually from there. This is why you see things like beta tests. Why we should have had maybe an invite-only mode for a while. We should have gradually ramped things up and understood just the size and the shape of the demand and being better prepared. Whenever you see your organization, your team building up huge batches of work, "I'm going to ship them with a big bang Tada to the world." Do talk to them about this project. Say to them, that is a huge amount of risk, what can we do to derisk? What can we do to learn in a safer environment for us and the work we're doing?

Resilient Systems

Last one, a bit more of a technical perspective here. I think these things are intertwined, because your people problems always impact your tech and vice versa. Build resilient systems. Build systems in which effectively emergency is normal, in which things are constantly failing, and they can cope with failure. They can cope with high load. They have fallback built into them. I'm not asking you to over-engineer. Given that we were building a game that was going to have a very large price, was free to enter, and ask the entire world to play Monopoly against each other, it was quite reasonable to expect we might get a lot of load. We didn't prepare for that. The whole thing fell over as its failure mode. I learned about things like circuit breakers, where you have systems that can remain resilient, even if their dependencies are falling over. Designing for failure, chaos engineering. I learned about various kinds of ideas on this project. The silver lining here is that I learned some really tough lessons on this project that were really helpful to me in future years.

An example of that. This is the diagram for the donations platform that ran Comic Relief for about six or seven years. I was able to design and architect this to a very high degree of resilience, based on some of the catastrophic things that happened on the Monopoly game. This system never lost a donation in all its years of use for Comic Relief. Sometimes you have to go through those tough times, and have those mistakes and things have to go wrong, so that you can learn a better way of working. Because I would say to you all that it's ok to fail. It's ok for things to go wrong, so long as you learn. Please take these three things away with you: balanced teams, fast feedback, resilient systems.

Questions and Answers

Participant 1: Why did they actually cancel the Monopoly in the end, because it sounds like that was potentially a success story?

Humphreys: I think a couple of things. Firstly, all of that infrastructure was expensive to run. They did sell a lot of copies of Monopoly on the back of this, but it was an expensive game to run because it had so much infrastructure powering it, was the main thing. A slightly tangential success story. I think there's an open source version of this game, you can run and play with friends, that's available. If you go to the Wikipedia page now, people have taken it, reinvented it, used the OpenStreetMap data, and you can go and have a play. I think you have to host your own server, but it's fairly straightforward to do that. The reason why they stopped the project was because it had so many players on it, and they weren't monetizing it, so it was just costing them a large amount of money. Yes, they were getting publicity out of it, but quite a lot of that publicity was negative publicity, unfortunately. Maybe my fault. It was just a three-month project to drum up some interest and then finish it.

Participant 2: You talk about the stuff that you learned as a result of this, did the agency learn anything?

Humphreys: I'm not sure. I hope. Agencies have really struggled with the transition from print and TV to digital. I think this was one of those tough learning lessons for them. They've definitely gone far more digital and being more digital aware now. This was early days. I wasn't joking about the project manager who thought that website's alive, I can see it, why would you stop it? Why do you need a hosting budget? I realize, this may seem strange to us now, but these were the ideas, "If I can see it, everyone can see it. It's online," was the idea. I think they have learned, but again, larger organizations generally take more time to learn these lessons, so I think they had a few more challenges later.

 

See more presentations with transcripts

 

Recorded at:

Dec 17, 2021

BT