InfoQ Homepage Presentations How to Build an Exchange: Sub Millisecond Response Times and 24/7 Uptimes in the Cloud

How to Build an Exchange: Sub Millisecond Response Times and 24/7 Uptimes in the Cloud

View Presentation

Speed:

Download

49:16

Summary

Frank Yu shares Coinbase’s engineering philosophy for building resilient, fair, and fast financial exchanges. He explains the power of a single-threaded architecture combined with the Raft consensus algorithm to maintain 24/7 availability. He discusses how determinism enables zero-downtime rolling deployments and the ability to replay production logs for perfect bug reproduction.

Bio

Frank Yu is an engineering leader at Coinbase, focusing on distributed low latency trading platforms. Prior to Coinbase, he served as Principal Engineer and later Director of Software Engineering at FairX, leading the design and build of what would become the Coinbase Derivatives Exchange post acquisition.

About the conference

Software is changing the world. QCon San Francisco empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Frank Yu: My name is Frank. I am Director of Engineering at Coinbase, specializing in exchange platforms. Today we're going to talk about building exchanges. Over the years we've done plenty of builds, from 0 to 1, scaling 1 to 100, optimizing 1,000 to 2,000. This talk represents our current thinking on the matter. Don't hold me to any of this stuff, it may change as soon as we find out new things. What's an exchange? We're a place where you can submit an order to buy something and we'll let everyone know about the price you want and we'll let you know when your order gets filled. Fairly simple. Think of exchanges as financial infrastructure. Everyone who participates in finance and markets, they come to us to get the up-to-date price of things and also to print trades, assign prices when folks want to trade something. To a lot of folks, we're a trusted third party. A lot of people might think of this when they think of an exchange, it's a bunch of numbers next to some short name for a thing, a little stock ticker, whatnot.

Over the years when I think about building exchanges and running them, I think about this. My mind, it goes really fast, and if it goes wrong then by the time PagerDuty rings it's over. Pack up, go home. Maybe you won't have a home to return to, maybe nobody will. The potential losses we risk are multiple orders of magnitude more than any revenue we might get from any transaction, like 0.00001%? That's happened in my career. We focus so much on reliability that people in financial markets effectively outsource their reliability to the exchange because they figure, if the exchange goes down, we're toast anyways. What has happened is not only are we keeping state and making sure everything is correct for ourselves, we're also doing that basically for all of our clients.

Exchanges: Must-Haves

As financial infrastructure, so we're financial infrastructure, exchanges have to be correct obviously, but they also have to be fair. We got to provide participants with equal opportunity to get performance. We can't advantage different people from each other because what will happen is when you do that, other participants in the market are going to stop trading because it is going to become prohibitively expensive to continue trading when you're at a disadvantage. Additionally, we've got to be available. The last thing you want is you buy some asset, it starts to go down, and when you're trying to sell the thing to manage your risk, cut your losses, we're down, we can't handle your orders. This is one way in which our technical outages might result in your financial ruin.

Another thing we got to do is regulators are super interested in exchanges. They want to know exactly what the state of the entire market was on some arbitrary microsecond tick up to like X number of years in the past, and that has happened. We have served lots of, at microsecond, 3, 6, 5, 7, whatever, in 2020, whatever, this is exactly the state of the market. We've got to be able to recover state from exactly that and then show people what actually happened on a tick-by-tick basis. We've got to be persistent for anything we've sent out and confirmed.

The other thing we got to worry about is, if we're financial infrastructure, when people think of infrastructure, they want something that's consistent. They don't want to get surprised by their infrastructure. For us, we want to make sure our P99s are just as flat as our P50s, are just as flat as our averages, and we put in a lot of effort to really stamp down those tails because people literally set up their own models accounting for our variance and delays. Participants will do that because if we randomly pause you for a second, that generally results in you getting a bad trade and losing material amounts of money.

How Do We Build This?

We've talked about the non-functional requirements. It's super daunting. How do we build this? I think we got it backwards. We talked about non-functional, but let's talk functional. The first thing you do when you build an exchange is you test it. The last four or five times, build an exchange. What I do is open up the IDE or the whatever, build up a black box harness of user API actions or user stories to specify the behavior that I want. We're going to do the same thing here. This harness isn't in code, it's in boxes or whatnot. Here's a good way to harness the input of a system, what the system is keeping track of, and then what it sends out to everybody.

At the beginning, start my exchange. Nobody submitted anything in the market. No one wants to buy anything. No one wants to sell anything. Somebody's got to be first. What we do on an exchange is we've boiled down a lot of requirements into simple concepts. One that I'd like to highlight for you is price, time, priority. Three words. Pages of functional spec. In general, it's super convenient to just have some concepts that describe the universe of user interactions on the system. A lot of exchanges use this, and it's super easy to specify complex behavior from those three words.

Let's talk about what that looks like. At the beginning, let's say we've got a participant, Mark M. He's first. He submits an order. He's down to buy BTC at 100. Don't ask me what the units are, but it's 100. What will happen is that we'll handle that API call. We'll go ahead and convert his buy order into something called a bid. It represents the resting state of the active orders in the system. Then we'll send it out to everybody. We'll tell everybody, somebody, not Mark M, but somebody is down to buy BTC at 100.

In this way, everyone knows that if anyone wants to sell BTC, they'll get 100, but only one. There's only one BTC on the market. Another little bit of terminology. We talked about buy and sell. I think everyone knows about buy and sell. We convert buy and sell into bid and ask to represent what the public sees. I use ask here. I think a lot of folks use offer. That's probably more common. Ask is three characters, and so it's nicely symmetrical. We're using ask here. When Mark goes and sells BTC, let's say Mark just has a ton. Mark wants to make money by buying low and selling high at scale. Mark is totally down to sell BTC at 101. Now he won't match with himself because he's not trading with himself. That sell order is going to become the top part of the ask. Then we're going to go ahead and send that to everybody. Now, if you want to buy Bitcoin, it's 101.

If you want to sell Bitcoin, you get 100. This is what is called the top of book or the best prices on the exchange, something like that. Then, Mark has a lot of BTC. He might go ahead and say, if you want to buy two more from me, I'm going to need a better deal. If you're going to sell a lot of Bitcoin to me, I'm going to want a cheaper price. Mark goes ahead and sends a buy order for two Bitcoin at 99, a sell order for two at 102. If you notice, better bids, bids are sorted price going up, and asks, so sells are sorted price going down. Higher bids are better in general. This is the market. Mark M is not the only participant in the market. Let's say Alice is down to buy two Bitcoin. She goes ahead and sends two Bitcoin at 100. Now here's where we explain what price, time, priority is.

If an order comes in, whose order gets filled first? Does Mark's or does Alice's? What we like to do is we say, Alice came in after, so she should go behind, but her order was better than Mark's second order. She's going to slot here. Then we'll go ahead and broadcast out that there's another person who wants to buy two Bitcoin at 100. All this is done, still no trades. Nobody's traded, nobody has submitted a buy and a sell with overlapping prices. We call that a cross when it happens. What happens when somebody actually goes and crosses? Bob is like, I'm down to sell two Bitcoin at 99. That is going to overlap. The sell order is going to overlap with the buy orders, and we should make a trade. When two orders love each other very much, they make trades and annihilate.

All three of those orders in the bid would satisfy this order. How does it work? We start with the best price they get, and then we start with the first one. Mark's order at T0, which was also the best price, gets to fill. Bob is going to fill that order. One of his BTC that he wanted to sell sells to Mark, and then the other one will sell to Alice's order and not Mark's other order, because Alice had the better price. All of this is to incentivize people to give better prices. Alice's order gets updated. We fix the display. Then, what price did Bob get?

In general, Bob, because he's coming in second, he's going to get the best price. Because we want to make it safe for Bob to say, I'm down to sell as low as I can, because it's safe. I don't have to worry about the price changing and catching me at a worse price. Bob's going to get a better price. He's going to get the price improvement. The second person generally gets the price improvement in a cross. This is what we're going to send out to the public. Two Bitcoin was traded at 100. Bob's happy. He's down to sell at 99, but he got 100. Then Mark M and Alice are happy, because they got their orders filled. This is basic layer functional requirements. There's a lot of interactions here. You can envision this big test script with lots of user interactions and state to expect. It's pretty complicated.

Scaling

How do we get this into production and how do we scale this? What does scaling mean? Obviously, it means throughput. We got to be able to handle lots of orders per second. Scaling also means we need to be able to change the behavior fairly quickly. I don't want to wait three months in between my deployments. I want to be able to change stuff quickly. How do I change stuff quickly when there's this complicated bit of logic? Let's focus on the core and let's keep the core as simple as possible. Let's imagine different ways where we could solve that test case. We could be like, if I want to handle a lot of different orders, why don't I assign different CPUs to handle orders? Then each of them can go find other matches, and this and that. That way maybe I can take advantage of parallelism and all that. Could work, for sure. What happens is, good luck finding out what happened given the state of the market when you went back 5 years.

If you let things be concurrent, that's where you lose determinism. What we talked about, 1, 2, 3, 4, 5, all those actions really should result in the same output every single time. If the same inputs result in the same outputs, what we get is determinism. If I just handle all those orders in order, like I was clicking through these buttons, I just put it all in one thread. One thread, what does that mean? That means I have one program running on one CPU core, not reordering anything, just handling orders and doing stuff. Obviously, there's superscalar execution and all that, but we don't know about any of that. The other thing to worry about when you're single-threaded is now your scalability is directly tied to your performance.

If the thing is faster, you get to handle more orders per second. In some sense, this is like the best unification of KPIs or metrics or whatnot. It frees you up to optimize the hot path, and immediately you get to do more business. It makes it super simple to decide what to do to scale up your business. Another cool thing is you get benefits in terms of availability too. If it fits on one box, which it does, because it's just one thread running on one CPU, and boxes have many CPUs these days. If it fits on one box, you can run copies. Why would you want to run copies? There's a couple of different ways to make sure that your behavior gets persisted and it's durable. You could write it to disk. What if that disk fails? Maybe you have some replicas on the disk, or you run in RAID or something like that. That does work, but I'll show you why that's slow. What we can do is we can rely on another way to get durability within our SLAs, which is consensus.

If you've got a really fast consensus implementation, then you can basically handle orders as they come in, only deal with persisting those, and then your business logic can basically be processing concurrently with the durability process. This is really different compared to something like your standard web server writes to Postgres and goes back. Normally what happens is your web server receives an API call, it has to do the work, and then it sends its state over Postgres' network protocol to Postgres. Postgres has to do some computation, compute the result, send out some write-ahead log stuff, but also send out the acknowledgment to you. All of that is happening blocking. Your main process needs to handle it, your Postgres needs to handle it, and then it comes back to you. You got to wait for the round-trip, and then you can go back to clients. It's super slow. There's tons of jitter. With Raft, what we can do is like, no Postgres, no database in the hot path, because we've got Raft.

If we didn't have Raft, and did this whole database in the hot path, it would be scary. What if we just didn't persist to Postgres and do a bunch of SQL? Here's what would happen. Without Raft, if I'm running a service, I'm running my matching engine, it's going super-fast. It's handling things. It's not even waiting for a database, it's just responding. If that just dies, the hardware dies, or the whole thing dies, you lose data. You give up on persistence, people start complaining, and you're not happy. If you've run copies of your compute, if you've replicated it out with Raft, then what can happen is, instead, I'm going to just make sure that at least two out of my three, or three out of my five machines have gotten that request message. Before I process it, let's make sure that most of my cluster has it. That's called quorum.

Then, when the hardware dies, I'll switch my main writer to one of the replicas, and continue processing messages. What will happen is you'll probably get a 500. However, you will not acknowledge anything that you've forgotten. That's really important. You don't lose data. Yes, maybe you fail a request or two, but you don't lose data. Then, if a hardware failure comes, you're like, I'm going to figure out how to restart and inflate the cluster. In the cloud, definitely run with five. We've definitely been very happy that we ran with five, and not three. Because if you ran with three, you can only suffer the failure of one machine before you lose that replication. If you run with five, two machines have to die for you to be scared. You can suffer the loss of two, and then three machines have to die for you to be broken.

Another thing this enables, and this is really important for the scaling horizontally part, is if you're running in a cluster like this, you can do rolling deployments. You can deploy your code without downtime that impacts your persistence SLA. What does that mean? The X on the service basically looks like shutting the thing down and restarting it. I could be shutting it down due to a machine failure, or I could be shutting it down due to my software development life cycle. This effectively unifies your resiliency with your deployment. It makes the unexpected normal. You can think of this as like, ok, let's shut this down, start a new version up. There's a new leader.

Then I can shut another follower down, and then I'll restart the leader last. That's how you can do blue-green deployments and have effectively no downtime, on your exchange. This thing allows us to push updates to production a lot faster than traditional exchanges. Traditional exchanges maybe update their core tech once a quarter, maybe more sometimes. We can get out there week by week or even more. I think in general, there's a reason not to just stream changes immediately because on an exchange, you can't really feature flag features. I can't just give one user this thing that makes everything better and not give it to everyone else. All the deploys are big bang, but if I'm not taking outages or whatever, I can just go ahead and stream them in as soon as I have confidence. Why do rolling deploys matter? It means our markets can actually operate without long stretches of downtime, no maintenance windows, none of that. This is really important in the crypto world because things go wild at any point in time.

The last thing you want is over the weekend, your Bitcoin has gone way down and you can't sell because the markets are closed. The world does not stop turning at 4 p.m. Eastern. Things continue happening. Furthermore, when you have these long downtimes, you create weird financial discontinuities. Users who are trying to use the exchange aren't able to submit orders and do stuff, and then they have to rely on some weird side channels or whatever to manage their own financial risk. It adds complication and adds financial risk for everyone participating on the market. We have to get to 24/7. It's just so funny, not being 24/7 seems like such an anachronism in backend computing.

If you're not dealing with hot trading, presumably your systems run 24/7. Any sort of website, plenty of things with much less engineering investment run 24/7. It's great not to be 24/7. It's great for my work-life balance. It's just incredibly easy when you can shut everything down, verify everything's on the same version, and all of your data has been replicated, and then start everything back up, but you've effectively pushed the cost of your own engineering savings to the customer.

System Layout

What does this actually look like? I claim that we've got a system with a Raft cluster, and I can deploy to it. How does this actually look like? The Coinbase International Exchange runs in a region in Tokyo, and the matching engine is set up as a Raft cluster 5, not 3. Then we also run a couple of things that arbitrate access to the exchange. These are your API gateways. I think a lot of systems have API gateways. API requests come in through messages. They could also come in through REST calls. Basically, the idea is that your gateway is a fairly stateless, simple thing with respect to business logic. It's stateless with respect to business logic, but maybe it does rate limiting and other things to make sure you're keeping stuff stable and correct. What we want with these gateways is we want dumb, fat, fast pipes into our matching engine. Fairly simple. Those processes just run. They don't have any state. I'm going to change up some colors on you. We talked about how we have some input messages. I'm going to color those blue. We have output messages. I'm going to color those red.

If you think about it, input log plus the live state should completely determine the output. We can do some fun things if we do that. Let's say Alice submits an order to buy two Bitcoin future contracts at some number. What's going to happen is our gateway is going to receive the message over TCP or whatever, and it's going to convert that message into an internal representation, one that's super-fast and simple, and then we're going to send that onto the matching engine. Let's say that Alice, with that order, traded with two people. What comes out is a mouthful. We got to tell Alice that her order was successfully submitted. We got to tell Alice she traded. We got to tell Bob that he traded. We got to tell Alice that she traded again. We got to tell Charlie that he traded, so if Alice's order hits two other people's orders. That's a big mouthful. What we try to do is we try to send this stuff back to users as fast as possible, asynchronously. They're not polling us, getting my trades. They just have an open connection, and we're just sending them events. Everything in this system is asynchronous, message-passing based.

One thing to note, though, is that the blue thing that caused it, is much smaller than the big mouthful of events that come out. What we find out when we replicate our events cross-region for long-term storage or edge computing access or anything like that, it's pretty expensive. That's how they get you with the cross-VPC stuff. What if, instead of doing that, my system is deterministic, why don't I just replicate the request, have the request percolate to my downstream systems that care about the output, and then I just rerun my matching logic? Now I can send that mouthful of stuff over IPC, no network cost. That's something that's pretty powerful. It allows us to really control our network egress costs. It also allows us to have what's being transported as our request stream and not the output. I'll give you another example here. What if that was a SQL and the red stuff we were sending was the write-ahead log? What if someone submitted a request to update blah, blah, blah where ID equals anything? One request comes in and may create thousands or tens of thousands of write-ahead log entries. If your system is deterministic, I'll just replicate that really annoying query instead of suddenly having a spike in my change data capture. Then, let's send the change data capture or that junk over IPC. Let's send it to stuff in the same box. That's what's happening here.

I don't have to stop here. I can keep going. I can run copies of my matching engine, my logic, local to my gateways that talk to clients. Why would I want to do that? This also allows us to handle queries in a way that does not perturb the writer. The purple arrows are basically the get some data and we'll send some data back. Notice that in those boxes, they're operating slightly latent. They're slightly latent, but they're not at all super far behind. If you were doing an RTT to the master, you would double your latencies. Instead, we have these strongly consistent replicas that are streaming updates and they can serve your order request basically in the microsecond. It doesn't even hit disk. It just hits memory. I can also do edge computing. I don't know if we have a bunch of non-latency sensitive analytics queries or back-office queries. I can go receive those in another region.

Then maybe a common way you would handle this is maybe you cache your state in Redis or something so it doesn't hit the database or whatnot. With this approach, if your logic is deterministic, you don't need an arbitrary cache. You can just rerun your core logic, and serve those requests directly from your core logic with speed. Here's another very nice setup with this architecture. Your DR is done by construction. Let's say something wipes out all of the beige whatnots, like all those boxes go down, you cut the replication link, you promote the core logic over there, and suddenly you're DR'd in probably as fast as it is to get the approval to do it. It is possible to do it in an automated manner, but people got to be sure before they do a disaster recovery. This is really important. I'll talk more about disaster recovery and what that means for architecture later.

How to Make the Hot Path Fast

This is the system. There's lots of boxes and arrows, but, fundamentally, the scalability of it is again tied to the performance of the hot path. Because it's all handling everything in just a single thread. The faster your single thread runs, the more orders you can handle. How do we do that? How do we make this stuff fast? A simple way to make stuff fast is just get rid of stuff. Don't do things that you don't feel are mission critical. Most important thing that you can do to make things fast is to avoid blocking the hot path. If your hot path is waiting for something, you want to minimize any reason for your processor to just be waiting on some action. How do we do that? Let's really be rigorous on what that hot path is.

In our world, the hot path is just the order going into the gateway over there and then going to the cluster of core logic and then back out. That's the hot path. Nothing else really matters. You can do whatever you want on the other stuff, but let's really make sure that hot path is quick. What that means is that cluster, those three boxes, they should be really close to each other, because what we want is we want to be sure that at least two out of the three of those boxes have the data. What I'm telling you is, don't bother putting those three clusters on different AZs. Don't bother, because for scalability and responsiveness reasons, when you send something to the leader, you go back to the follower across the AZ, suddenly your RTT is milliseconds. It's already over by the time you're sending messages across AZs just to go do your main transactions.

In my mind, you want all of the boxes in the hot path to be as close to each other as possible in cluster placement and whatever, because that's what matters. You're already running like five and a bunch of redundancy, if you get just a full total outage of the entire region, do your DR. The other thing to note why this is ok is the folks who care about performance, they're going to be right next to you. Anything that's going to impact you is going to impact all of your clients, and so when everything is down, what you're going to do is you're going to pick up your clients, your customers and you're going to do your DR hand in hand by doing exercises with them.

If they don't do your exercises, they really don't care. What I'm saying is, don't bother with multi-AZ, locate all of your stuff for highly contingent transactional systems together to make sure it's just simple and your response time is good. Because the alternative is, if you're sending messages beyond the AZs and taking 3 to 4 milliseconds to respond, it's literally an outage every message. I'd rather have an outage every few trillion messages than an outage every message. Really think about that when you decide to take your Kafka or whatever and spread it all over the world before you handle a message. Don't block the hot path, make it fast.

Data Ingress and Egress

The next thing, let's talk about the data that's coming in and coming out. Often, you want to support customers who have very unique names, super long names, super short names and all of that, but your transactional system does not need any of that information. Avoid arbitrarily nested, arbitrarily length stuff, because what happens is at the worst time someone's going to give you a blob of bytes and you'll wonder what the heck happened.

The other thing is that CPUs are actually really great at advancing down the bytes of a message sequentially. Our internal representation of messages, we use simple binary encoding. It's basically just a bunch of bytes next to each other. The CPU knows that 6 bytes after the front of the message, that's the message type. It's super quick to decide if I even care about this message. The next 8 bytes, that's the ID of the thing I want to trade. The next 8 byte is the price. The next 8 bytes are the quantity. You can fit a lot of stuff into 64 bits. Don't even bother with UUIDs, just use Snowflake IDs. You can fit a bunch of globally unique IDs that are sortable by timestamp in 64 bits. This is much better than having deeply nested protobufs, where to even figure out if you should handle that message, you've got to remarshal it, build up this whole palace of Legos that represent your highly nested message before you even can tell what the thing is. Please at least for your message types, just use a simple offset based message type.

Then you can put a lot of services on that log and they won't mess with each other. The really nice thing you can do if you have little messages that are byte after the other is you can get a facsimile of structs on the JVM, which is what we use. JVM, like what? Why JVM? Java is like slow. If the JVM runs a garbage collector or is running a concurrent one that lowers your throughput, aren't you already done? The garbage collectors on JVMs pause for milliseconds at a time. How are you building these super responsive systems on the JVM? Your garbage collector won't run if you never call new.

What we're saying is, provision all of your stuff ahead of time. You should not be allocating new memory on an order by order basis. Have buffers of data. We do things like basically put SBE structs in an off-heap map, so that we don't need to go back and forth and deal with spooky action from a distance from the garbage collector. Garbage collection is great. We can have totally fluffy test code and totally fluffy web code and totally fluffy analytics code that's not latency sensitive, but in your hot path, don't allocate. Even if you've got a concurrent collector, allocation can often be incredibly computationally intensive. While you're over here allocating memory and page faulting out, you're still pausing all the orders before.

If you're going to run single thread, excise those extra news. The thing we want to do is we want to get rid of the operating system. It's like the government, get out of my body. The operating system will randomly decide to switch out your CPU core, your compute for something else that may not be business relevant at any time if you let it. Instead, what you want to do is you've got this nice single-threaded thing, configure your operating system to pin it to a core, and then move all the interrupts off of that core so that that thread is just while loop.

If there's new stuff, do the new stuff. If not, keep spinning. Don't give up your compute if you care about responsiveness in a single thread system. Pin your thread to a CPU. I also am telling you like avoid arbitrary length data. I'm also asking to see if you can avoid arbitrarily length compute. You should be very suspicious of for loops that iterate over an unbounded structure because that is how you get pathological P99.999s. We can use great patterns that we get from database stuff. I don't want to scan through an entire table. I'd much rather have indexed access into my data. You can do the same in simple memory bound code as on a database. Use HashMaps, don't loop over everything. Use stuff that's simple and can't just randomly explode. If you do have really big operations that you got to do, break them up. That's pagination. That's a common database best pattern and applies to low latency responsive systems as well. Do a bunch of work, send it back out, and handle other requests and then do the rest. Do continuations instead.

Finally, get rid of chatty clients and junk data from your core. I know it's really tempting to just put everything in the core and let it handle messages from everybody. What's going to happen is you're going to have this massive thing and then you're going to have to decomp it later, and that's going to be years. Please rate limit. Don't incentivize your users to send you economically irrelevant transactions. Don't give users an incentive to spam you, and it'll be good for everybody. The best optimization for a transaction is one that doesn't need to happen.

What We've Seen in Production

With all of this, what we've seen is we've been able to have a system in the cloud that spikes up to six figures transactions per seconds, no issues, no pages. We're able to keep P99 response times under that millisecond to our customers. If your boxes are really close to each other, the cloud basically can do things pretty quickly these days. There's nothing we did, that's the bleeding edge catching up to what's on-premise.

Then, we can deploy code. We can deploy changes to our logic to allow people to trade however we want with minimal downtime. We can do it much faster than long waterfall release cycles. My favorite is, if something's weird, I can do full reproduction of the functional issues because everything fits on one thread. That means the entire memory set fits on one machine. It means I can run it on my laptop. If I go ahead and find something weird in the market, I can go download the logs, the request log, replay it, and replay all of that production logic in a debugger. I will literally see what each register is and all of that. That is absolutely possible. It's invaluable for when we need to debug strange behavior and decide, it's probably fine. We'll wait for the next release. Let's just go fix that immediately. It's also really good for identifying situations where you end up with pathological performance issues. Like, why did this buffer get really big? You just replay the stuff and you can run things through a profiler too.

The other really cool thing, I'm going to belabor this because I really like the log replay stuff, is you can stream your request log to another stack and perturb that thing. You can run experiments on your production data and do pre-production testing of your rolling deployments, pre-production testing of your configuration changes, all because you put all your logic on one nice little request stream. You can take that request stream, republish those as requests to something else, detach them, and run experiments. This gives us the ability to do complicated topology changes in the cloud, which is terrifying to anybody, but we're able to literally run that with live production streaming load. That has been a superpower for a lot of us.

The Engineering Ideal of Simplicity

All of this is basically chasing. Everything you get here is chasing the Eng ideal of simplicity. Because if you're simple, it's really easy to make that simple thing stable because you can just do a lot of different things with it. You don't have to worry about all these different variables. Just run five of them if you want it to be stable. You get performance. I can control what the operating system is doing. I've already simplified the data going in, data going out. It's going to be fast. Also, because everything is fast, everything's easy to test, I can deploy changes super quickly. The thing you get is if you want to just remove a lot of the stuff in your architecture diagram, there's probably easy 10X opportunities in your system right now. That's really it. Simplify your systems, everybody.

Questions and Answers

Participant 1: Your input stream persisted. I see the core logic on the gateway. Is that just purely in memory? If so, what bounds do you have? How do you know that that is guaranteed to fit in memory?

Frank Yu: We don't store the output. We just have to store the input in an append-only manner. The state, that bid and ask, has to fit in memory. If all of your stuff is not fixed length, if an active order state is about 250 bytes, you can fit a lot of active orders in a process. RAM is way cheaper than it was back in the day. The state of the system is not persisted. It's just in memory. What is persisted is the input. The input first is available to the state as soon as it gets in. As long as you're happy that three out of your five machines have received that message, that's your SLA. Then what's happening is your replication is writing the stuff to EBS and eventually to S3 and eventually to Kafka for super durable on-disk stuff. That's not in the hot path, basically.

Participant 1: It's in memory and then persisted to different data tiers.

Frank Yu: Correct. My point is that the input log goes into something that is flushing to disk just in an append-only way. You can do creative fsync orderings to make that happen.

Participant 2: Still on the persistence thing. You persist the state of the system later on, database, whatever. When you want to replay the input, you need to restore the system to the state right before the input came. How do you identify, this is how the state was when this input came so we can replay that?

Frank Yu: In some sense, snapshots are an optimization. You don't actually need them if you just had everything from zero, but that's not realistic. This snapshot right here, you can take that whenever you want. You can take it every hour, every 10 minutes, depending on how much you're replaying and how often you're replaying. Because you're deterministic, this snapshot, which scans your entire memory space, does not need to happen on your right leader. It can happen on a follower, off-band, and not impact your throughput or jitter. Periodically, take that green state, write to a file, write some binary blob, write it in S3. You're also going to say that state represents T5. I didn't put T5 on the state, but that state is what it was at T5. You can load that state up and then start replaying from T6 on.

Participant 3: So far, we talked about the state and inputs and outputs. I'm going to focus on the core logic. There will be changes to core logic as we go along. It could be bugs that we fix, or it could be changes to the core logic. How do you coordinate those rolling deployments? Because you don't want to get into a state where something is out of sync with something that you've not yet deployed.

Frank Yu: You should decouple code deploy from behavior change. Deploy your code. Your new code's got to do exactly the same thing as the old code to stay deterministic. Then what you do is you send a request into the input log, enabling the new behavior.

Participant 3: My understanding is scheduling is evil, like network hops are evil. A lot of stuff is evil. Do you vertically scale your core logic boxes? Do you just go big on number of CPUs and memory? Do you go beefy on the machines to reduce the number of machines you need to run?

Frank Yu: Yes.

Participant 3: Which is a bit of an antipattern from how Google did things. Also, CPUs are super cheap now, and so is memory.

Frank Yu: What you can do is you care about clock speed, but you don't care about a lot of cores. You care about clock speed, so you can just have a small number of really hot machines, more or less. You get the nice Moore's Law thing. You get basically improvements for free as you upgrade your CPUs. Yes, you don't need boxes with hundreds of cores. Maybe you just want like 16, 32 so that your operating system has extra cores to do other things. Really, you just have one hot CPU doing the hot work. If somehow you have a highly centralized transaction processing system that needs to go more than millions per second, then it is possible to shard. You can do all of this, get up pretty high in the cloud with no sharding. Then you just have a super-simple thing that can fit in the palm of your hand.

Participant 3: You said you don't like scheduling, so just spin on the CPU, like busy spins.

Frank Yu: Literally busy spin, yes.

Participant 3: Then your CPUs are pegged all the time?

Frank Yu: The one that's hot is pegged, yes.

Participant 3: When you say hot, what it means by that.

Frank Yu: The only thing that needs to spin is the core logic that is handling orders as fast as possible. Everything else is just doing what normal CPUs do, which is sit idle.

Participant 3: In that world, and I've seen this in things like Redpanda and stuff, how do I know if there's actually available, like bandwidth on that CPU? Because the CPU is spinning. How do you know when you're actually pegged? Are you looking at IOPS or something else?

Frank Yu: How do you know that you're at compute capacity? What's really cool is you get a constant load test when you're replaying. Because if your replay is like, say 10 minutes behind or something like that, that thing is going as fast as it possibly can. You just always know where your red line is. I can confidently tell you 300k, that's my red line. Do some code optimizations to get higher. It's replaying as fast as it can, no network, anything like that.

See more presentations with transcripts

Recorded at:

Apr 23, 2026

Frank Yu

InfoQ Software Architects' Newsletter