Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Capacity Planning for Crypto Mania

Capacity Planning for Crypto Mania



Jordan Sitkin and Luke Demi talk about how Coinbase had to deal with the cryptocurrency spikes of 2017 and how the engineers are applying lessons from these experiences to create new tools and techniques for capacity planning to prepare for future waves of cryptocurrency enthusiasm.


Luke Demi is a Software Engineer on the Reliability Team at Coinbase. Jordan Sitkin is a Software Engineer at Coinbase.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Demi: Let's get right into it. If you tried to visit Coinbase in May of 2017, you probably saw a page that looks something like this. That was whether you're trying to check your balance, buy new cryptocurrency, or just get your money off our site, but prices were spiking. People were suddenly really interested in buying cryptocurrency, and our systems were crumbling around us, kind of like QCon's website, am I right? We've been joking, like we can't wait to see the war story from the guys in charge of that.

But anyway, cool. So yes, this is bad for a lot of reasons, but primarily because if you're a customer of Coinbase at this time, you're worried. You're worried that your money's gone, you're worried that Coinbase just got hacked, what's going on? We began to see stories like these come in. Things like- this is terrible read, but “Coinbase goes down with Bitcoin and Ethereum.” Well, pretty true. But they hurt. We've been heads down for years, trying to build this ecosystem, and trying to be the people who are leading this charge. These Reddit comments, particularly, I don't know, were kind of humorous, but it kind of caught the mood of the day. I'm going to read one: "Funny thing is Coinbase supposedly uses AWS, so capacity isn't an issue unless they don't use autoscaling or are too cheap to pay for more AWS resources or just lazy."

This is a picture of us in a room one day. This was a particularly interesting day because we were down for the entire day, like eight hours, of us just sitting in this room with couches, trying to figure out how to get the site back up. This is the day that New York Times published an article saying that Coinbase was down and weren't able to handle the load. This is a major inflection point for us as a company, is this realization that if we didn't get our act together, Coinbase wasn't going to survive.

Sitkin: We call this talk Capacity Planning for Crypto Mania. It's going to be about how we got our groove back, so to speak, how we turned that story around from that dark day Luke just introduced last year. We're going to present some of the things that we learned while scaling Coinbase to what ended up being a 20 x increase in traffic over the course of just a few short months. We made some mistakes, we learned some lessons, and we had the privilege of working on some fun challenges while we did it. During this time, of course, our tools and systems had to mature very quickly. We're going to walk through the story of Coinbase starting from that day, we're going to call out some insights and learnings as we go. And then we're going to get into what we're doing now to be better prepared for the future.

To get some quick intros out of the way, I'm Jordan, this is Luke. We're on the team called the Reliability Team at Coinbase. To be clear, we're just two of the many folks who are involved in the wild ride of last year, and also of maintaining the reliability of our systems now and into the future.

Buy and Sell Digital Currency

As many of you indicated, you're already familiar with Coinbase. What most people know of Coinbase is this app, which we like to be the easiest and most trusted place to buy, sell and manage your digital currency. But we're actually a lot more than just this app now. We're a small constellation of other brands and services around the Coinbase name. And to give a very high-level overview of what our tech stack looks like, the application we're going to be talking about mostly here is a monolithic Rails app. It's backed by a MongoDB database, and we deploy and manage our infrastructure on AWS.

In this talk, we hope to convince you of two main things. First, the title of the talk is about capacity planning, but we're really going to be spending most of our time talking about load testing. And that's because we feel that load testing, otherwise known as capacity testing, stress testing, volume testing, is the most important tool you have when working towards capacity planning. What I mean by volume testing or all these other different versions of the same term, is really being able to simulate and study realistic failures that might happen in production before they occur, so you can understand them.

The second thing we hope to convince you of is that perfect parity with production in your load testing environment is not necessarily a requirement to get good results from your load testing system. One of the ways we're going to break this down is by introducing this concept we call the capacity cycle. This is going to be a theme that we return to throughout the talk so I want to introduce it now, we're going to go into more detail about what it means, but we're going to be sharing a lot of our experiences through this lens.

Backend RPM

Demi: So let's jump back a little bit and give you a little more context on that story. So just to show you what our traffic patterns looked like before things got crazy, they were pretty casual, is what I'd say. This is our backend traffic to that Rails service. And you can see that traffic is going up, traffic is going down, but it's staying within a pretty tight bound. Drawing a red line here for no particular reason other than to show that it's bad up there, we were nowhere near what this red line represented.

At this point in time, a red line would have meant people really became interested in cryptocurrency, profitability, these type of things, they weren't the type of things we were worried about. We didn't think we're going to hit this red line. However, that red line ended up being a disaster. This is a picture of our traffic. This is to be clear, up until around July of 2017. This is showing our traffic and showing the ways that we went right past that red line. Obviously on the back of this is Ethereum and Bitcoin enthusiasm.

We had sustained days above that red line. We had days where starting at 3 a.m. Pacific all the way until like midnight Pacific we were over 100,000 requests per minute, because people just kept trying to get on to buy stuff. We'd be down but it didn't matter, people just wanted to keep coming back. The best way we could represent this was this explosion, people flying everywhere. Also put The New York Times article there, we're just trying to really represent that this was a time when things were not going well.

Web Service Time Breakdown

And the reason things weren't going well, is that real quickly, this is what it looked like when things were going right. So this is New Relic, very simplified, but it's showing our Ruby time and MongoDB time. You can see that Ruby time is obviously a big chunk of what we're doing inside of our application with this tiny little frosting of MongoDB time. This is what it usually looked like on our site when you visited. 80% of the time it's been Ruby, a little bit MongoDB, and there's some other services we weren't counting here. But when we were experiencing these issues, when the site went down, it looked like this. This was confusing for a whole bunch of different reasons. First of all, Ruby and MongoDB are spiking together. You can see that for some reason they're going in lockstep, obviously, they're up a lot. I mean, you look at the chart on the left, that's four seconds of response time just to get the Ruby and Mongo. You can imagine nobody's getting through at that point.

This is a really confusing chart. But what made it even more confusing is at that point in time, this was the only chart we had. We took this chart at face value. We would sit there and we'd be like, “Well, obviously Mongo is slow, but so is Ruby.” Okay, so we were like, "Can we prove otherwise?" And people would be like, "Dude the chart says this." We didn't have the tooling we needed to get deeper into what was happening. The reality is that this chart was wrong. But it led us down some dramatic rabbit holes during this period of time. We spent days doing ridiculous things. We would be down and the best idea in the room would be something like, "What if we adjust the kernel parameters?" But that made sense at the time. It was like, yes, maybe the kernel is the problem here. There were a bunch of rabbit holes.

The Wakeup Call

To illustrate in terms of that cycle Jordan brought up, this is what was happening at that time. We were having a nice consistent load test every morning or all day. When we went to analyze it, we went in the complete wrong direction. The instrument and improve part of the cycle, what you're supposed to be doing is you're supposed to be adding instrumentation to help you understand what's happening during your load test. At this point in time, we weren't doing that, we were just staring at the same instrumentation again, and thinking of outlandish ideas.

To get out of this issue, to solve some of these scaling problems, we did what everybody loves to do best and blame it on the database and upgraded versions. We upgraded every single version of MongoDB to get ourselves up to the latest. If you're familiar with MongoDB, we were on a really bad version to start, so there were some massive improvements. We did things like splitting apart our clusters. For example, we have a user's collection and a transactions collection. They were on the same host. Split them apart. They have a lot more room to grow, and this got us out of the situation. However, we began to realize that if we wanted to be able to withstand more load, we needed a better approach. Because at this point, it felt like traffic was going to keep going. I mean spoiler, it did.

But the major lesson here, little bit obvious, but it's really important to drive home. Good instrumentation will surface your problems. But bad instrumentation will obscure them, it'll confuse you, it'll make you look stupid on a stage one day. So let's jump forward a little bit.

Sitkin: Yes. Getting back to our timeline, let's say we're in the middle of 2017 now. Going back to this traffic graph we've been looking at, you can see that after this breakthrough and this panic and rapid improvement that we had to go through to survive this, traffic was higher, but sort of relatively stable again. We've sort of reached a new baseline level of traffic. We're basically handling it at this point. You can see that there's some variation here, but we're basically at a new sort of regime of traffic.

At this point, the load test that we were getting from our users wasn't causing our site to fail in the same way anymore. We weren't jumping into this cycle that we only came to understand later. As a result, we sort of spun our wheels a lot, optimizing what we had, but we weren't being creative enough in stressing our system to create this next failure pattern. So as a result, we're still a little bit uneasy and worried about the future. To illustrate this in terms of cycle, this ended up being what we affectionately call our YOLO load testing phase, where you can see we have sort of a week version of our load test step here.

We realized that we wanted to artificially get back this load test, so we can get back into our rapid improvement cycle. So just did the simplest thing that we could imagine. We just basically ran some synthetic load tests, using some off the shelf tools. We tested against our development environment. We weren't really sure where best to start, so we just jumped in and started working.

The Pillars of Realistic Load Testing

Before I go into a little bit more detail about exactly what we did, this is a good time to introduce the concept that we're calling the pillars of realistic load testing. This is a way of breaking down your load testing strategy across three categories as a better way to assess how realistic you might be getting. An important point in this is that each of these three categories is something you can address separately, while increasing the realism of your load test. You don't necessarily need to increase them all together, or they don't even necessarily have the same importance between each other.

The first one we call data; this is the shape of the data in your database. This is things like how many rows of each type do you have in your test environment? Are the relations between these records realistic compared to what you have in production? The next one is traffic. And this is the shape of the traffic that's coming out of your load testing system. You might ask yourself, how many requests are coming out? What's the rate? To which endpoints are these requests hitting? And does the distribution of these match the type of traffic we actually see in production? The third category is the most straightforward one of all, it's just does the physical systems that make up your load testing environment match what you actually run in production? This is how many servers of what type you're running in your load testing environment? How are they interconnected? Does the networking layer look the same?

In this context, jumping back to what we actually did at this point, we started with an off-the-shelf open source tool called Locust. We felt that this was the easiest way for us to get started performing distributed load tests. We knew that we'd be able to get a pretty good amount of throughput out of this. If you're not familiar with Locust, it's a tool that lets you write simulated user flows in Python, and then play them back across many nodes to perform a distributed load test. The one main control you have in this is turning up the number of users that you have in your pool of simulated users, and then how quickly they get added to that pool.

A little more detail. This is what a basic architecture diagram might look like of Locust. You have one master node, which you interact with to control the test, and then it controls a fleet of slave nodes. Each one of those slave nodes is responsible for maintaining a little pool of users that's running through these simulated user flows that you've written in Python.

At this point, this was our first attempt. It ended being little more than a toy for us. It wasn't really delivering the types of results we needed. But we didn't really have the perspective to fully understand why it wasn't delivering, we just weren't really confident in the results it was giving us. But despite some of these problems, it did help us better understand what a good load testing system might look and feel like.

To put this in perspective of these pillars that we've just introduced, we knew that the data and the systems didn't match production. That was sort of by design. We knew that we were going in with just sort of a naive first pass approach. But the real problem was that we didn't understand in which ways the traffic coming out of our system wasn't realistic. We hadn't set up a process for increasing the realism or even just creating this initial user flows. We didn't really do this from first principles. As a result, the system never became an important tool in our toolbox because we didn't trust it. And coincidentally, this also quickly became irrelevant.

The Holy Grail

Demi: Because in late 2017, which is really like October 2017, cryptocurrency became actually mainstream. We call this the our mom knows the company we work for now, the texts from people in high school phase. We went to the top of the App Store, and the news articles began to look something like this: "Bitcoin mania," "Cryptocurrency mania," "Should you buy cryptocurrency?" "Should you invest all of your money in cryptocurrency?"

Our traffic looked something like this. That red line is that arbitrary line I drew for scale, and for even more scale, there's a massive jump that brought us to our knees before. During this period, it wasn't so bad. We definitely had an outage here or there, depends on how closely you followed, whether I'm telling the truth or not. But the reason we're able to survive is because we had this here, we call it the Holy Grail of the capacity cycle. To walk through this, basically what we realized, we got in this rhythm, we'd wake up one day 4:30 a.m. right when people on the East Coast are like, "Should I buy a bitcoin?" They began to hammer us. We'd spend the morning just keeping stuff online. We'd get in at lunch, we'd analyze the results, what was taking us down? What would definitely take us down tomorrow?

For example, let's say, our user’s collection in Mongo, which had now been broken out into a new cluster, we had no further room to vertical scale. What can we do? So you spend the evening trying to improve. You add instrumentation to add clarity, you improve by potentially adding a new feature. Then the next day, you get load tested again. This cycle would continue. Every day, we'd wake up, sip our coffee, “Okay, here comes a load test,” come in the afternoon, analyze what happened, and improve an instrument.

Some of the major things we did during this time for that user's cluster. One weekend we added an entire caching layer, using Memcached, the identity cache, we call it. This dropped load on the user's cluster by 90%. That bought us enough headroom for at least a few more weeks. It allowed us to move on to the next day and solve the next problem. This flow was what got us through that issue. And it leads to this major lesson here, that faster feedback when you're solving problems means faster progress. Every time you go through this cycle, it's a chance to make an improvement. The more times you go through this cycle, the more times you can make improvement.

Sitkin: Let's get back to our timeline. We're in 2018 now, and we've just gone through this harrowing, very tiring experience of being in this cycle, getting this rapid feedback loop and rapid improvement, making a lot of progress on our systems, to a time where we're not hitting all-time highs every single day. We start to hit them less frequently. You might have thought we'd be relieved to be experiencing this. And yes, trust me, it was nice to get a few more hours of sleep during these days.

But there was an underlying sense of anxiety by having this wonderful load test taken away from us every morning. Which is we didn't have this real catalyzing idea of exactly what the single most important thing to work on was, in terms of increasing our capacity anymore. In terms of the cycle, again, basically, this is when our load testing step kind of dropped out from under us. It was a little bit scary, because we knew that there were still some lingering performance problems left over, obviously, still new features get added, new things are happening.

In fact, now we're focused a lot on launching new currencies. Every time we do that, we see these slightly different interesting spikes in traffic, different traffic patterns that we want to prepare for. So we felt the need at this point, to get back this realism, to get back this load testing step. We needed to be able to see realistic load in our test environment.

Capture & Playback of Production Traffic

Building on what we experienced with our YOLO synthetic load testing approach was that we weren't really confident in the traffic that was coming out of our synthetic system. There was this lingering question that came up a couple of times while we were working on this, which was, why can't we just use real production data in our load tests? It seems like a natural question to ask, it seems pretty interesting. This way, the way we created failures, in our load tests would just naturally match the kind of failures we'd see in production because we're actually just using actual user behavior.

Further, we felt pretty sure that MongoDB was the most sensitive piece of our stack right now. It's the most shared resource, and it's the thing that has historically taken us down. While in general, it might not be a good idea to test certain things in isolation, I want to use another concept here to justify why we feel that doing isolation testing in something like database is useful. Neil Gunther has this concept of a Universal Scalability Law. This is one of many ways of describing how different system topologies scale. We've got two lines on this simple graph here; the dotted line representing how a system with no shared resources might scale. It has a perfectly linear scaling story because as the throughput increases, the load has a perfectly linear relation to it. There's no reason for the relationship between those two things to change as load goes up, there's no shared resources in the system.

But in contrast, we have this red line, which represents how a system with shared resources might scale, because there's contention for these shared resources, such as a database. As the load increases, the throughput is going to decline based on how much contention around that shared resource there is. The shape of this curve is going to change based on the actual design of the system. But the point here is that you get diminishing returns on these resources as you grow, and then you even start to backslide a little bit, as load increases even more. This is all a way of saying that it's almost always the load on the database that takes you down, it's precisely for this reason that it's a shared resource. Most of the time, you have a pretty clear scaling story for things like app nodes that are stateless that aren't shared resources themselves, but it's usually load on the database that's problematic.

Mongoreplay Cannon

So to this end, we set out designing a capture playback story around load testing our database. We ended up building something that we call Mongoreplay Cannon. This is made up of two main parts; we've got the Mongoreplay capture process, which lives alongside our app nodes, and our background workers. It's created an extra socket on the database driver that is listening for Mongo's wire protocol traffic, and storing it. And then we have another process, which later on when we're ready to perform a load test, is able to interact and process these capture files, synthesize them into a single one, and then play it back at multiples of real volume into a test environment which is generally a clone of our production databases.

This was a big win, this worked out really well, for us, when we're talking about testing the database in isolation. We use this to good effect to do things like right-sizing our production clusters very confidently, knowing that we can adjust different tuneables on the database change parameters, and know exactly how it might respond in the real world. Because again, we're playing back the same load on the database that our application actually generates. We're able to create these very realistic failures in our test environment. So of course, the next thing we thought about, if this is working out really great, we're testing our database in isolation, what about testing the rest of our system using the same strategy?

Demi: Obviously, as you can see, we were able to test MongoDB in isolation. So literally, between Rails and Mongo, we were able to capture that traffic and replay it. That was the closest production as we could get. However, what we realized is that we had a lot more boundaries in our system and we had to come up with a way to be able to test this in a way that matched the success of our Mongoreplay story. This is all in an effort to make sure that we are not going down in an opportune moment. If you're not testing the relationships and boundaries in a load test, it can tell you a lot about that single isolated system. But what if a new regression occurs between a slightly lesser prominent system or any other shared resource in your environment? So we knew we needed to try to find a different approach.

Traffic, Data & Systems

Let's walk back through those three tiers or three pillars of realism, starting with traffic. These are kind of ranked in terms of how effective we see them as being. For example, all the way at the bottom, the most unrealistic or basic way of testing, would be to use this naive single user flow. So that means just test what happens when a user hits like three or four pages, just kind of like if you're running A/B or something like that.

The second, and maybe a little bit more realistic approach, might be to do something like synthetic traffic generation based on real user flows. For example, maybe you have a tool that is synthetically generating traffic, but it's looking through your log data to figure out how to play that back against a new environment. And finally, the Holy Grail is this idea of capture playback. You're taking the exact real production traffic, and you're just pointing it right back at something else, and seeing how it responds.

Quick aside, capture playback sounds awesome. The one major issue we ran into, especially with regards to traffic, is that recording post bodies is extremely difficult for us. We store a lot of sensitive information in post bodies and replaying that back against a non-production environment, or even a production environment, has a whole host of issues. Additionally, a lot of IDs that are used in subsequent requests are not generated deterministically. So when you're trying to replay traffic, it's very difficult to match up those requests. What you end up needing to do is rewrite them in some way in order for that to work, which detracts from realism.

Then talking back about data. The easiest way is just to siege your database the same way you siege your development environment so you create them. Just create some basic users, plug them into your load testing framework and get going. The next best option is synthetically generate the data, but do it in a more realistic way. Actually go through and see what's the layout of your users, do you have a lot of users with a lot of accounts, what are the relationships between these users, and create those synthetically in your environment.

Finally, the top two here, kind of this idea of scrubbing data. So you can take a production database, and you put it in a less production environment or another production environment. But it's important to remove obviously, that important customer data that we're trying to protect. Obviously, the best possible way to get a realistic load test is to test in production actually, on that data.

And finally, talking about systems, this kind of makes sense. But if you're testing on a simple development environment, that's not going to give you realistic results. If your clusters aren't sized correctly, they're going to deliver unrealistic results. Then there's creating a production parity environment. Maybe you have this idea of a codification framework that's creating all your production resources, just point it at another AWS account and run it against that. That's pretty realistic. But obviously, testing in production is the ultimate tactic. You're getting the exact production environment you have with all the quirks and knots and blah, blah, blah.

However, we began to think about this a little bit and realized the capture playback approach is tough, synthetic approach is tough, but we're a blockchain company. Why can't we find a blockchain solution to this problem? So we decided to take a bold new approach. So we're happy to announce today, launching on Coinbase is Load Testing Coin. Load Testing Coin is an ERC 20 compatible coin that allows you to tell your users to come load test you. I'm just kidding. I don't want to get anybody too excited. We had to do it, don't go buying load testing coin in the hallway, it's fake. But really, wouldn't that be cool though? If you could just tell your users to come … not the coin, the coin is stupid. Actually, would you buy it?

Sitkin: I might think about it, yes.

Demi: We'll pump it. Sorry, this is transcribed, right? Will you cut that part? So the point is the ideal solution here, tell your real users to come to your site and do the real things. Just like happened to us, we want to create that crypto mania on demand, but it'd be cool if we could incentivize people to come in. Unfortunately, this idea got shot down. But the reality is, what we decided to do is we decided to go back to that original strategy we had with Locust. We decided to see, how much can we improve this strategy based on these tiers? How can we walk up these tiers and improve this strategy into something that could be realistic for us to use on a day-to-day basis?

The Capacity Cycle in Practice

Sitkin: So as Luke said, we're going back to what we had previously called our YOLO testing strategy. But we're starting over from this new starting point of having a better understanding of what we need out of our load testing system. So in this section, I'm going to walk through what it might feel like or what it feels like for us to apply this capacity cycle in our actual day-to-day work, in terms of taking a load system, a load testing system that's at a pretty rudimentary state, and then using our goes around the cycle to increase its realism. And then eventually, to find some interesting scaling problems with our system.

The setup for this example here, we're going to get started as quickly as we can. We scoped it down to just creating seed data using spec factories. The baseline level of realism we can get with data. Then for traffic, we're just going to be recreating a script that's just a single simplified version of a single user flow, just the most basic thing that could possibly work. And then systems is going to be based on a dev environment that we already run. It's a handful of nodes behind a load balancer. There's no background workers, and then just a single instance each of our three main backing databases. So very straightforward, very simplified, very pared down.

Let's get started. We do our first load test, we get ourselves up to about 400 requests per second. The first thing that jumps out is that CPU on Redis is fully utilized. We're at maximum. So naturally, we look a little bit deeper into how the system is laid out. It becomes pretty clear that the problem here is that we isolate workloads in production by running several different Redis clusters and then our test environment, we're just running this one that's serving everything. So this is a quick go around the cycle, it's very clear that the easiest thing we can do here is make our systems more realistic. So we're going to improve that tier by breaking out our Redis clusters and our test environment to match what we run in production.

All right, so again, let's go through the cycle one more time. This time, we end up failing at the same point. Again, looking at the CPU stats across our Redis clusters, we see that one of them now is pegged at 100% CPU. It turns out that we just actually forgot to disable a tracing tool that's quite computationally expensive that we don't use in production. So again, we have a good chance to improve the realism in our systems category in our load testing environment, just by disabling that tool. So that was another quick feedback cycle.

All right, let's run the test again. This time we get a little bit farther, we get to 850 requests per second. Looking at these metrics across our system, something different was happening this time. We looked at the Redis stats, they look okay. We looked at our Mongo stats, those looked okay. So we zoomed back out, we looked at just an end-to-end trace of the requests between what we actually see in production on this endpoint, and what we see in our load testing system, and everything looks pretty much the same. But clearly, we fell over sooner than we'd expect. But then it turns out, we noticed that we're maxing out CPU on our app nodes. Basically, Ruby time is killing us.

Here, we have another chance to improve the realism in our systems by matching the ratio between app nodes and backing services that we actually run in production. The easiest thing to do is increase the number of app nodes we run in our test environment because that gets us closer parity with production. So we can quickly make that change and once again, go around the cycle.

The fourth time around, we get farther, we get to 1,500 requests per second. This time, the first thing that jumps out is CPU load on our MongoDB clusters is too high. It turns out that we use this great tool called mloginfo if any of you are heavy MongoDB users. This is a really great tool for parsing MongoDB server logs and giving you a ranked breakdown of specifically which query shapes are expensive and slow in your system. And this pointed us to a single expensive query that we're not used to seeing ranked very high on this report. That sort of pointed us to the fact that there's probably an important discrepancy in the shape of our data between our load testing system and our production environment. And so here, we realized that we weren't properly doing a good reset of our database state in between each of our load testing runs. This pointed us towards the fact that we needed to increase the realism of our data category, like fixing this reset script.

We do another test run, now with our improved load testing environment. We fall over at 1,400 requests per second, pretty similar range. But this time, the CPU stats again on MongoDB are doing okay. But looking at the traces, we saw that actually, the responses from Memcached were slow. That's definitely something we're not used to seeing. So we're sort of digging around a little bit, looking at hardware level stats. The first thing that jumped out is we're trying to run a production level load against a single T2 micro, which is a tiny instance size running Memcached. So obviously, that fell over pretty quickly.

Again, an easy way of increasing realism there is just to bump up the instance size on that MongoDB node. That gets us another chance to go around the cycle. This time, we actually do pretty well; getting up to 1,700 requests per second on this user flow, which is actually a particularly expensive flow in terms of resources. We don't normally hit this kind of load in production very often at all. And hitting 1,750 requests per second in a test environment was pretty encouraging; this is a good result.

CPU stats on Mongo look pretty good, like our hardware stats across our stack are looking pretty healthy. That actually gave us pause for a moment because we wouldn't naturally expect to be able to serve this kind of load if our user flow here was actually exercising it in a very realistic way. So this pointed us to the fact that there's probably discrepancy in the traffic category here between what we see in production, and what we see in our load testing environment. Digging in a little bit farther, we noticed that was definitely true. We were only exercising a fraction of the endpoints that we really need to hit in order to reproduce this user behavior in our app. We're able to improve realism in the traffic category by doing a better job of investigating and writing our scripts to better recreate what we see in the logs when users are actually exercising this flow.

The last time around the cycle, we get up to about 700 requests per second this time. That's probably a little bit more realistic, based on what we might expect to see. Analysis shows hardware stats across most of our stack particularly Mongo is okay, we're sort of used to seeing failures in Mongo, so this is often the first place we look at hardware stats at our database layer. However, again, CPU on our web nodes is maxed out. Now, last time we ran into this problem, we felt comfortable doubling the number of app nodes in our test environment, because that got us to a place where we were closer in parity with our production environment. But we weren't going to do that this time because now the ratio between the number of nodes we run in our load test, and what we've got in production was accurate. So it would be the wrong thing to do, it would detract from our realism by adding more nodes at this point.

But when we actually dug into some profiling to see what our app nodes were doing CPU-wise, what was actually occupying all this time, we uncovered something interesting. So this is an issue that we had sort of been aware of. But we noticed that this one particular performance problem was the bottleneck that took us down in this test. We had some event tracking that was doing some inefficient processing. So here, this was the result we were hoping for. We actually uncovered a bottleneck that was convincingly realistic, something that probably would have taken us down in production when we reach this level of load. The right thing to do here was actually apply the fix, we improved our systems. And now we can jump back into the cycle and get some more insight from it.

I'm going to leave this example here. But I hope by walking through this real-world example that I've given you an idea of how powerful this quick feedback loop can be, in terms of both improving the system under test, but the test system itself. Each iteration of the cycle is a chance to improve one of those things. Before too long, you're going to get to a place where you're actually uncovering important issues.

Also an important thing to note about these issues is, you're probably aware of many potential performance issues that you might have in your system. But the real value of this is that it forces one of those issues to pop to the top as the most prominent one, the thing that's most likely to take you down, and it sort of makes your work planning very easy. The single most important thing to work on, that's actually going to increase your capacity the next level.

So now that we're here, we've got some bigger plans looking into the future. But I want to leave this section with one lesson, which is that we're all pursuing realism in our load testing environments. But the point is that the journey, as you increase the realism with each of the steps, is just as important as the end goal because you're going to learn a lot of things along the way.

The Future

Demi: That takes us up to the present day. So what we're using this load for now, is preparing for new currency launches. This is more of a Luke and Jordan tool right now. We're able to use this and we're able to find out and ensure that every time we launch a new currency that we do anything that's going to cause a spike in traffic. We can test the flows that we know are going to be prominent during those, and ensure that there are no bottlenecks that we've yet to uncover that exist before they happen.

We're currently in the process of increasing the realism of these tests. Building automated ways to understand the most recent traffic patterns and apply those to how we're actually running our tests is at the top of our list right now, as well as continuing to improve the way we use data. What we would like to get to soon is this idea that our traffic is based on exactly real traffic, and that our data is based on real scrub user data. Ideally, our systems are the same as well.

The cycle continues to guide us. We're very excited about this capacity cycle. We're preaching it to everyone internally; some are more convinced than others. But what we're really trying to accomplish internally is to make this idea stick. We don't want this to just be a Luke and Jordan tool. Because a Luke and Jordan tool doesn't live past us. What if we get distracted when we're working on something else? What if the tool starts to fall apart? It becomes useless.

What we're trying to do is create a repeatable ongoing element to how we actually ship code to production. We want to create this essentially fast feedback loop for everyone. So when somebody pushes a new change to production, they should know, did this affect the way that performance in production works? So some of the ways that we're approaching this are potentially just part of the build process in a squeezed test environment, like a very consistent environment, how many requests were you able to handle of our most important flows? These are things that we're focused on right now.


But before we go, I want to really run through our lessons real fast again. The first, good instrumentation will surface your problems, and bad instrumentation will obscure them and confuse you. I think this applies to almost everything, especially if you're someone who's in production frequently digging into issues. If you don't have the right instrumentation, add the right instrumentation. If you don't understand your instrumentation, get curious. Don't just assume that what you see is what actually exists.

Faster feedback means faster progress. This applies to everything, but it especially applied to our ability to survive, and somewhat thrive during these periods of crazy traffic growth. It was only because we're able to use that fast feedback to guide the actions we took. And finally, we're getting great load testing results using simplified load testing environments, and increasing realism. So when you're pursuing load testing, the journey can be as valuable as the destination. Don't be afraid to dive right in and immediately start to get results. We've had issues in the past- and I've seen this fail countless times, where what you try to do first is hit that realism, and you try to make things exactly the same as production, but in that process, you're losing the learnings along the way.

This is the end of the presentation. We ended up learning a lot of really valuable lessons, and a lot of hard lessons around capacity planning. We're definitely hoping this talk helps you avoid repeating some of the mistakes we made. These are our Twitter handles. But we're really interested in hearing how you guys are doing load testing on the field. Hopefully, this inspires some interesting conversations. And yes, please, reach out on Twitter catch up to us some other time in the conference.

Questions & Answers

Participant 1: Thank you for the talk. It was very insightful, and it really reflects some of the activities that are going on in my team right now. I'm curious to learn from your experience and apply the same back to my team. One of the questions that I have is, where do you trigger these tasks from? And how far do you stretch into, say, for example, if the load test should go 5 times, 10 times, 50 times off the production load, and that's where you would be satisfied with, “Okay, this is the current design and what it can support”?

Sitkin: Locust offers a web UI that lets us actually kick off the test. So that's one of the nice things about using this open source tool. To answer your question about how we decide how far to ramp up the test, up until now, we basically just set it to a kind of a ridiculous level, and we're looking for the point at which performance starts to degrade, and the service breaks down. So at this point, we're really doing stress testing. We're looking for when the system stops working, and then just measuring what that level is, and continually increasing it. We sort of have like a back of the napkin idea of where we'd like to be, which is a rough multiple of what took us down during the last major run that we're shooting for. But a lot of it is just awareness around where specific user flows are going to start to die.

Participant 1: One more follow up. How do you work with the dependent services, because that is not in your control? And if those are the bottlenecks, what is your advice to work around those, or even work with those teams to get things resolved?

Sitkin: Yes, so this is where I'd say a little bit of the art of load testing comes in. One of the nice things that we experienced about starting very simple is if it's too difficult to get started, just stub it out for now. Then as you gain confidence in the realism of this small part of your flow, then you can start to add those dependencies in one by one. And each of those are going to have their own idiosyncrasies about the best way to stub them or simulate load on them.

For example, as a cryptocurrency company, a lot of things we do end up touching are blockchain. And of course, we have to get creative about stubbing out those things. There's only a certain amount of realism we could ever achieve with something like that. So yes, I find it really hard to give a universal answer to that kind of question. But I think going back to the flow, it's all about addressing them one by one. Get that fast feedback loop and only add them one at a time. Don't shoot for the full realism all at once. Because the insight you're going to get just by adding that one change might be really useful.

Participant 2: Thank you for the great talk. My question is, as a fast-growing company, as you guys are launching new products and features very quickly, do you see that in the production environment, the shape of the data and the pattern of the request change frequently? And if so, how do you keep your load testing environment up to date with those changes?

Demi: Yes, so especially in terms of traffic, it's one of the major benefits of using capture playback. You're guaranteed take a capture, run it again, you're guaranteed that the traffic is going to be up-to-date. But the way that we're doing that now is by a script that runs through sections of data and we pull out flows that exist in there. So there are certain stateful flows that exist. So you can't just like play the request back in order, you need to pull them out.

What we're able to do is generate a shape of data and then apply that to Locust. That part is actually easier. The shape of the data is actually extremely hard. That's something we haven't solved at all. We know what the shape of the data looks like, we're actually able to use analytic tools that our data team uses. They're able to guide us on like, our biggest user ever has, let's say, 10,000 different transactions, right? So we can kind of build a distribution there, but keeping it up-to-date, we haven't solved that problem at all. It's a hard one. But ideally, what we can do is interface with those tools and kind of adjust that as we clear and as we update the test.

Participant 3: Great talk. I was wondering - because generating load tests, that's quite expensive, it’s basically like attacking, you're trying to get your site down - I was wondering if you find any methodology to maybe calculate a theoretical but practical limit? So based on some example, to extrapolate and find where will be the next bottlenecks based on that?

Sitkin: I kind of touched on this a tiny bit with this concept of the Universal Scaling Theory. Those are the kinds of things that we do keep in mind when practicing a little bit more of this artful approach to load testing. And that helps us to sort of form a hypothesis about what might possibly break. But I think the nice thing we discovered about getting into the cycle is that we have a quick chance to test these hypotheses. It also is sort of sharpening our performance theory approach of seeing how these things actually relate to real-world stuff. But to more directly answer your question, we haven't done a lot of that sort of more like academic approach. We've been a lot more hands-on focused on heuristics that are coming out of our system right now. Would you say anything different about that?

Demi: It is expensive. What we've had to do now is build some really complicated techniques to not get our door knocked on by finance. It kind of involves how you can scale. Fortunately, AWS makes this really straightforward. The way databases can scale, the way application services can scale, we have a script of bring it up, bring it down. And that's helped.

But what we love to do, and definitely when we get to this idea of squeeze testing, every single pull request or commit that comes into our major projects, that's what we're going to try to do. What we can do as a start, I guess, I haven't really talked about this, but take that same data that we currently have in the squeeze testing environment, and do it on a full production environment, see how much they handle. Then you can kind of extrapolate there, and that would be a potentially cool way to handle that. What do you think about that idea?

Sitkin: Nice idea. Luke.

Demi: It's hard doing a talk together. We've got to bounce every idea off each other. I'm glad you like that.


See more presentations with transcripts


Recorded at:

Mar 17, 2019