BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Trust Deterministic Execution to Scale and Simplify Your Systems

Trust Deterministic Execution to Scale and Simplify Your Systems

Bookmarks
41:58

Summary

Frank Yu discusses how to make a mission critical business logic deterministic and fast, providing both intuitive and not-so-obvious architecture choices.

Bio

Frank Yu is an engineering leader at Coinbase, focusing on distributed low latency trading platforms. Prior to Coinbase, he served as Principal Engineer and later Director of Software Engineering at FairX, leading the design and build of what would become the Coinbase Derivatives Exchange post acquisition.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Yu: My name is Frank. I lead the engineering team that built and runs what is now the Coinbase Derivatives Exchange. We'll be talking about leveraging determinism, or trusting deterministic execution to stabilize, scale, and simplify systems. This talk, I've got two things I'd like you to come away with. You've got some mission critical important logic. Make it deterministic. I'm going to plead that you all do. If you've got deterministic logic, don't be afraid to actually replay it anywhere on your system for efficiency and profit. We'll largely be going over principles and practices directly from my experience, and our experience running financial exchanges. We don't rigorously prove any of the concepts we're presenting here, and we're bringing any references to the canon or any prior art, that we reference is inspiration, and it's all a case study. It's all n equals 1 here, and it's all anecdotal.

About Us and Our Problems

Let's talk us and our problems. How many of you all are familiar with what an exchange does? Some of the nuances is, again, we built and run what is now the Coinbase Derivatives Exchange. Formerly we were a startup called FairX. We built a CFTC, some government agency licensed trading exchange for folks to trade financial contracts with each other. You can submit us an order, to buy or sell something for some price, and we'll let you and the market know asynchronously when you successfully trade and what price you get. Think of exchanges as financial infrastructure. Everyone who participates in finance relies on exchange to provide real prices of things trading on our platform, and relies on us to fairly assign trades and values for folks who want to exchange stuff. First thing to think about for an exchange is mission critical code. A bug on our system could result in catastrophic problems for not only ourselves, but generally customers. All the clients on our platform deposit millions of dollars of funds millions of notional because they trust an exchange to be correct. Potential losses we risk are multiple orders of magnitude more than any revenue we might make from a transaction. In a previous life running a Forex exchange, our clients would trust us with orders for millions of dollars, and we'd maybe be happy to make 0.00001% on it, give or take a zero. Our business logic has to be rock solid. Exchanges absolutely have to be correct.

The second thing about running a financial exchange is that many of our partners expect us to operate as close to instantaneously as possible, and also to be as consistent. Exchanges, if we're the ground that folks stand on for financial transactions, we've got to be predictable and fast. In general, you want to expect p99 latencies to stay under a millisecond. That's to us handled it back to customers in under a millisecond at the worst cases. Finally, you've got to keep records for regulators, clients, debuggers like me to actually be able to find out what happened on the system up to maybe even seven years in history, so that folks can figure out what went wrong and why they might have lost some money.

How Can the System Evolve Safely and Efficiently While Performing?

You might ask, if risks are so high and correctness bugs are catastrophic, and performance bugs get instantly noticed, how do we actually add features to the system in a reasonable manner? We've got to make sure our core logic stays simple. In principle, you want to minimize the number of arrows and boxes in your architecture diagram as much as possible. If you can remove something, remove it, if you can avoid a dependency, avoid it. What we don't want is a bunch of services that can fail independently and partially when it comes to a financial transaction. What we want is actually a very hardened kernel, and something that is quite well tested. In general, we want to make sure that our mission critical code lives in a monolith. Other thing we want to avoid is concurrency. We want to avoid concurrency at all costs. In the last 10 to 20 years, multi-processor systems have offered programs an interesting deal with the devil, run stuff in parallel, you get some more throughput and performance. You just have to make sure that you don't have any race conditions, and your logic is correct for every possible interleaving, of every line of code in your concurrent system. You run that code over again, you can't count on the same things happening, because you've left things up to the OS scheduler. I claim that simplicity through deterministic execution can often get you even more performance and throughput than concurrent launcher.

Deterministic Execution

Let's describe deterministic execution. Think of a program as a state machine. You have program counter pointing to which line of code it's executing at, and then line of code modifying the system. The next slide is from a talk Martin Thompson gave about event log architectures. New inputs come into the system that has some existing state and creates a new state. Also, inputs might come into your system, and then you might get some output. In practice, the terminology we like to use is input, we think of as requests or API calls, and when they impact state, they might change the state of the system. Inputs might also come into a system, and we might rename outputs to be events. You'll hear me say inputs and requests interchangeably, and output, that's interchangeable. Another one is command for input and events. Here's what you can do. Let's assume you've got ordered inputs, meaning that buy this, buy that, sell those, sell that, if you have a sequence of lots of inputs, you apply deterministic execution, the idea is that at the end of your ordered set of actions, you should get the same state and outputs. Again, in practice if I sequence my requests, I apply determinism to save a few syllables. I'm going to get replicated state and events.

Benefits of Determinism

We'll talk a little bit about the benefits of Raft. High availability is the main reason why you might want to make use of consensus. Let's talk about what it looks like when you don't have consensus. If I have some service, and through maybe no fault of my own the machine crashes or something like that. The service is down, I'm getting paged. Not particularly happy. With Raft consensus, I have a service. I handle some messages. I have to replicate them to a follower, to make sure that the data is on multiple nodes of my system before handling it. The benefit of that is that when it eventually crashes, I can now elect a new main leader to handle subsequent services, and I can continue on and maybe just handle the hardware failure and try to start up a new system. There's a huge benefit if your system basically needs to be up all the time, to running in some redundant manner. When you have a deterministic system, you can replicate your state machine and actually just run multiple versions of them and be resistant to failure. You might ask if exchanges have to be really fast, what is the overhead of this message replication? It just so happens that there happens to be some open source technology out there that actually replicates a Raft log through a very simple binary encoding, that gives virtually zero overhead and allows us to basically respond to customer requests after reaching consensus in less than a millisecond easily. On the cloud, I think we're in the mid-hundreds of bytes or even lower. This is pretty great, because traditionally, in the past, exchanges have needed to make sure that data is on at least two machines. Because when a hard drive crashes, or a system fails, you can't have a complete outage. Like the London Stock Exchange, New York Stock Exchange, they cannot go down during business hours. What they've traditionally done is some sort of a primary-secondary model, where I handle a message, I put it in a sequencer, I don't do anything with that message until I've guaranteed I've replicated it into a secondary, and the secondary did that thing. That basically was the standard in exchanges the state of the art for quite some time, but it's not super rigorous. For something that is the bedrock of systems, we lucked out in that a new open source technology came up for us to use.

What do we get with determinism? We get fast consensus. How do we get fast consensus? We get to make use of single threaded performance. When you basically handle a bunch of messages, and you're able to pin that single thread onto a port, you'll get significantly better latency and throughput, than if you are passing data between threads on a concurrent system. I think, in the past, Java and a lot of web servers built a system where you have a pool of threads that you pass work to, and it seems super scalable. Realistically, the cost of all that context switching and data copying, and all that dominates the benefits you get from the parallels. Run your stuff single threaded, it's actually quite fast. The only thing that's nice about having a single threaded system that's deterministic is that testing is quite straightforward. When you have a concurrent system, you have to build test cases for all interleavings of all combinations of all the code running on all your parallel threads. That's a cross product of test cases to get right and is basically intractable. If your system is literally single threaded, your test cases read very much like stories, you can do behavior driven development, TDD, and stuff like that, and it just reads straight through and it's quite nice.

The other great thing about single threaded determinism is you have great tools for troubleshooting. Let's say I've got a service running in production, and some weird behavior starts happening on the system. It hasn't crashed yet, but weird behavior and possibly buggy. I'm getting paged or something. I see the monitoring, I'm not very happy. Maybe it's gotten into a pathological situation. What do I do? Let's say I was running a deterministic system backed by a sequence to an input request log, what I would do is when the weird stuff starts happening on my service, I'm going to copy that entire request log to my laptop.

I'm going to rerun that service on my laptop, and the weird behavior is guaranteed to show up because the system is deterministic. As long as you make sure your business logic is deterministic, those business logic pathological cases will show up. To figure out what went wrong, run it in a debug, so that being able to view all the variables and everything will hopefully help me dispel the question. I'm happy, but I still have this weird stuff in production that I got to fix. At least I've figured out the root cause and well set up for any RCAs or anything like that. Yes, deterministic stuff, operational benefits, performance, ease of test, and sign us up.

Monolithic Architecture

When we built an exchange from scratch about three years back, yes, we decided to jump on this latency open source fast Raft cluster. We built something like this. There are three monoliths, which sounds like an oxymoron, but all of our core logic and stuff that matters is in that. What you're seeing on the screen is our two environments. We are running the really latency sensitive stuff on bare metal, with some basically API gateways, or you can even think of them as services that demultiplex stuff from clients and send them into our matching engine and our logic. We also send out a bunch of events to the cloud for storage, and analysis, and all the business intelligence stuff. This is your pretty, maybe standard, hybrid cloud setup. Let's trace a message through our system. Let's say Alice wants to buy two Nano Bitcoin contracts at 20,000. Alice will send a request to our gateway, that request comes into our Raft cluster, which has all of our logic. We go ahead and acknowledge that request, send it out to the public, and to Alice. This is done inside the metal data center over like multicast UDP, so it's like pretty fast. Then, finally, we also go ahead and send the event, and folks who know that non-sourced event streaming, this is the pattern. We'll send an event now to a replicator and this replicator could be something like a Kafka or something like that. We use Aeron for our communication. Then it actually writes that event that Alice submitted an order will get written to the database, the order gets written to the database, and non-latency sensitive consumers or high throughput consumers will receive it later. Sounds great. Blue arrows on the screen here are requests that come in, and red arrows are events that come out. The relative size of these arrows is important.

Let's talk about what's actually in those requests. Alice sends us a request to buy two Nano Bitcoin contracts, 20,000, it's like 5 pieces of data. That's about 100 bytes maybe. Then, let's say Alice successfully buys those two Bitcoin, if she trades with Bob and Charlie. She buys one Bitcoin from Bob and one Bitcoin from Charlie. What are the events that will actually get replicated over? When Alice submits an order, we tell Alice that her order was submitted. When Alice trades with Bob, we tell Alice that she bought one Bitcoin contract. We tell Bob he sold one Bitcoin contract. We don't tell Alice she bought one from Charlie, but she bought another one, and we tell Charlie, you sold another one. One request, five pieces of data, Alice buying something resulted in like a mouthful of events that came out. We were a startup at this point, and we ran it more standard. We wanted to keep our burn rate low. What we're finding is that's quite expensive. The compute in AWS was fairly reasonably priced, but this network egress and ingress, quite expensive.

What do we do? Can we optimize? What was the crux of the problem? Business logic is pretty fast, but writing and reading data is slow and expensive. In general, I think most folks' business logic is reasonably fast. The code you're writing, regardless of what it is, is fairly quick these days, but storage is a pain, and lots of roundtrips is a pain. How can the fact that our system is deterministic help here? Here's one little piece of a remark, but it was like, here's something you can try. What if when Alice submits this request, we go ahead and send that request. Right now, what happens is, you get a mouthful of events that come out and we burn a bunch of money on network, and we load the network and everything. How about instead of that, I don't send that, I actually just continue replicating the request. I just send those 5 bits of data, 100 bytes over. Then I just replicate the requests all the way to my downstream services. Let's say that doesn't work, you need the service to turn the request into its effects. The database writer is not going to write the request because it doesn't include the context of whether it's been validated, what actually happened. Just something we do for our system is a gymnastic. Why don't we run another instance of our monolith on our downstream systems, and then send the really fat events over IPC? This basically allays all that network cost, and so now I'm only replicating requests. We go from something like this, where red is the big block of stuff, and blue is the little stuff. We're going from this to lots of data. We're going from this to this kind of nice, just replicating the input request. That's cool. Maybe let's try it. We hadn't heard of anyone really doing this other than I think LMAX did it I think a little bit ago. We were already in production at this time, so making this topology change was a little scary, conceptually, but we were reasonably confident because of our good testing infrastructure. We did end up deploying this change. For whatever it's worth, we'd saved I think it was like 5x burst network cost, but like, more like 3x. In general, why does this work? You could theoretically do this on the other side, too. Let's say, instead of us running in a bare metal data center, you can totally do this in the cloud. If your main writer logic is sitting in one region in the cloud and then you have a lot of readers in other regions, you could replicate the Raft log, and then run little versions of your code to convert the input to output.

Replay Logic to Scale and Stabilize

Why does this actually work? Why is it cheaper? Why did it save us some money? One thing we realized is inputs are often smaller than outputs. This isn't always the case. In general, your input is contextualized by your existing business context. Then events come out, that's what handling a request means. We have the business context, when you give us a request, we know what to do with it. We add some stuff onto the request. The data that comes out should be a union of our context plus your intention. That's why inputs are generally smaller. Here's the other thing, inputs can be more consistent than outputs. The size of inputs can be more consistent than outputs. We always talk about 99 percentile latency, 99 percentile performance. What if we all considered what's your 99th percentile network load? Let's imagine that Alice submits us a single request to sell 1000 Bitcoin contracts at any price, whatever price, and let's say we've got a deep market, there's like 1000 people willing to sell Bitcoin. What's going to happen is just a ton of outgoing events. That small request comes in, and it's going to result in a ton of events. If we want to talk about this in database terms, imagine if I go onto your Postgres database, and I say, update big fat tables set column 2, 3 in new workloads. You're just going to get a lot of wall options. The input is actually quite small, it's like 5 pieces of data, about 100 bytes. Even MySQL request, maybe it's not that many bytes if you parse it down to the tokens or whatever. I don't actually know how to store it internally, but it's probably not that much. All the wall, all the trade events that come out is possibly unfounded, we have no idea what the real behavior in the system might do depending on what we know based on the features we implemented. Input sizes and rates can be validated and rejected. If you submit me too much stuff, I'm going to write through all of you, and I'm not going to let you fill up my input log with too much stuff. Events, they're hard to validate and reject. You don't reject that something happened. I think that's a big thing in the whole BDD event streaming thing. How do you reject an event you're just not happy it happened or whatever? Here's the other thing, let's say you have systems that read your events, and then do more things in response to those events, and put more events into your big, gigantic enterprise message bus/Kafka system or whatever. If you have any runaway positive feedback loop, that's how you get a thundering herd. What we can do here is we can maybe protect a little bit from thundering herd, if you put all your business logic, your key business logic in one place, and you're only reading from the inputs, then you avoid crazy positive feedback loops and you will be protected from thundering herds a little bit. Thundering herd is how you take down these gigantic autoscale cloud scale, web scale stuff.

Summary

If you replay your logs in production by literally running more instances of your services, and just replaying the ingress log in production to do things, you scale your system by decreasing the total utilization, probably like maybe 2x, 3x, it's not that much. The really nice thing is you stabilize it, because you can now say, I generally have a cap on how much input data I'm sending to the system. It's basically the summation of all of my rate models on all of my gateways, all my API servers and all that. You have a model, whereas, generally, nobody has any idea how many Kafka events are going to be streamed, depending on some volatile situation that happens. In markets, things get crazy real quick. That's one reason to do this thing. Another reason to replicate compute, because that's what this is. We replicate the ingress log, so that's data. We're also basically replicating the compute using deterministic execution. We actually replicate that compute, it's a spectrum. There are tradeoffs on either end. Replicating compute can also simplify some downstream code.

Deduplication is just an optimization, it's not architecture. I think we're past DRY. That's not our architecture, but sometimes you want to deduplicate. It helps. It's still fun. All this is mostly optimization, so tooling. Let's take a look at what that looks like. We've got our system at this point. I'm only replicating my input log everywhere. It's extreme, to just show the point. There's lots of ways, you can move it from spectrum to spectrum. Let's handle GET requests. It's purple. We get a request and we have to send something back immediately. The whole point of it is, it's like output, basically, but this is also input. A GET request goes into the Raft monolith. Alice wants to know what her active order is. We handle the request, we send it to our business logic, and our business logic which owns the state sends it back. We can also fetch requests from other regions, and maybe through a private link, a cross connector, a VPC connector, whatever, we can handle that on our monolith and serve a GET request over there, too, the query basically. That's annoying. I don't really want to be sending too much stuff over the Great Plains, this could be an ocean between the AWS regions. Maybe I cache. I have some sort of cache that keeps query data, and then I populate that cache by sending it output events, because the cache needs to know the state of the table, not the state of the request. I have to send read events to a cache. Effectively what you're doing is materializing the business context into your cache, in your Redis cache flow. Maybe you could make the cache like a read through the first time and then cache at later times, stuff like that. What if we didn't want to do that? I literally have more blue things attached to these gateways already. The obvious thing to do if I already have a completely deterministic state down there, is just request it from the instance of my state machine that's running locally, and therefore I can get all the state over IPC. Or even process if you want to inject an instance of your main services library. This is better. I'm no longer impacting the writer with my reads. In general, you want to separate reads from writes in your system. That's important for scaling. Yes, now I can get rid of my cache, and I've got this great system.

What if, at this point, I'm adding a lot more services, a lot more features, they all have a bunch of different requests? I can get active orders. I also want to get my balances or whatever. There are all these requests coming in. Do I want to run a little monolithic thing on each of them? I don't have to if I put the cache back. We already said we don't really want to send events into cache. Instead, what do we do? Why don't we just run like something that actually serves the data over inside VPC, because network traffic inside the VPC is actually quite cheap. It's not bad at all. That's literally how microservices are, so they send messages inside the VPC to each other. We can do that. We can replicate the request log from the main writer, the main Raft monolith on the left, and we can send it to our read monolith. Then all of our gateways can go and fetch it from there. This largely illustrates that the spectrum between replicating data and replicating compute, you can adjust your sliders to whatever is necessary. Another benefit with deduplicating code here is that normally, what ends up happening is if you're writing your data to Redis, or you're maybe changing your database table to do something with like change data capture, you're ultimately reimplementing your business logic, again, for read, and often it's maybe different teams that are reimplementing the core team's business logic. What's nice about this is it's actually just this exact same code. That same code just is used up and down the stack, and it's one implementation of logic. That can be well tested, and it can move in lockstep. That's how you get deduplication.

Challenges and Considerations

This is a weird way to run things, so there's certainly a lot of considerations that you have to apply here. Here's a listicle. You have to make sure that this core code, your core logic, is very well tested, because otherwise everything about the code will replicate. Again, this applies for financial stuff, because it really matters, the harm we might create is possibly greater than the value you're providing to your employer. If you have a bit of code that is incredibly well tested, this is a good strategy. Here's the other crazy thing, it has to be deterministic. What does that mean in the face of software changes? Like I add new features, or I change the behavior or I fix a bug. What it means is that when there is a bug, your bug fix needs to be gated on a feature flag. Because all of your downstream systems are eventually consistent, because they're replicating that request log, but they're replicating it at their own speed. This is unlikely to happen, but possibly you could have a bug on Thursday, on the left side, the main monolith. You fix it on Friday, but it's possible that some of your downstream systems that have upgraded to the new code are still replaying logs from Thursday. You can't have it change behavior. You can't rewrite history. Don't do that. That's why replicating data a lot of times is a lot easier, because TCP won't just change its behavior on you, when you change data, maybe. Old behavior has to be respected when replaying inputs. You've got to enable new behavior with a proper request, a state change request into your writer to enable new behavior. What this actually does is actually decouples the changing of behavior with deploy. Deploy should not change your behavior at all. You can actually run continuous checksums on the behavior of your system. Changing the code should not change that checksum until you submit a request to change it. Then once that request to change it gets replicated all down, all subsequent requests will have that new behavior. It effectively, again, decouples behavior change with code deploy.

Randomness, we all need random stuff. Randomness is super useful in systems. Make sure you use the seed, so that you can get deterministic pseudorandom outputs. Another issue is that when you're single threaded, while you're handling one message, it's blocking all the requests before it. A tactic you can apply is you divide large chunks of work into stages, and pass it back into your request log to do the next thing. Some of you will realize that this is basically adding concurrency to the system a little bit. Your code is generally pretty fast, unless you're iterating through a bunch of stuff. Probably don't need to do this, but if you have something that would cause your system to be unavailable because it's too busy handling a big request, consider it. For our Raft cluster, it should all fit in memory. This is something that sounds scary. It's like, how is all of my data that serves my customers going to fit in memory? How is all the mission critical stuff going to fit in memory? In general, we use a lot of fluffy stuff. We use a lot of variable length fields. We use stuff where we don't know what the size will be. The limit of a size on a string is the limit to which it crashes your program. There are implicit limits on everything. The alternative is, if you use fixed length data for your mission critical code, generally dealing with numbers and stuff, all the static kind of create, read, update, delete stuff, can go live on a database somewhere. If you have some stuff that matters, is generally numeric, or probably numeric, and maybe there are UUIDs in there somewhere, that all fits. You can fit a lot of customers' data in memory, if you issue variable length stuff, and issue arrays in requests. It also simplifies your programming model a lot.

The corollary, generally, systems run really fast. A lot of the most important financial systems in the world basically have their critical path run on one box, one thread. You can go up to millions a second, easy. Of course, you got to keep your 99s and 99.9s down, because those cases block everything else. The harm of those is increased. You want to apply some craft, when building out your mission critical system. Finally, you got to protect that mission critical code from chatty clients. You have to rate limit on the outside. You don't even need to DDoS. It's like one chatty client will denial of service your system. Have sensible rate limits in place, circuit breakers. Make sure you don't end up in weird feedback loops.

Conclusion

What we like to say is that, simplicity, it gets you stability, performance, and development speed. I'm sure you all have heard about the benefits of simplicity. Mentors and everybody have said, keep it simple. I'm not trying to sell simplicity. Hopefully try to make the case here that deterministic execution gives you simplicity and gives you a bunch of these things. Again, if you've got some important logic, make it deterministic. If you have it, don't be afraid to trust it, and use it everywhere. It's pretty awesome.

 

See more presentations with transcripts

 

Recorded at:

Mar 08, 2024

BT