InfoQ Homepage Presentations Evolution of Financial Exchange Architectures

Architecture & Design

Evolution of Financial Exchange Architectures

Bookmarks

View Presentation

Speed:

Download

53:18

Summary

Martin Thompson looks at the evolution of financial exchanges and explores what is considered state of the art today.

Bio

Martin Thompson is a Java Champion with over 2 decades of experience building complex and high-performance computing systems. He is most recently known for his work on Aeron and SBE. Previously at LMAX he was a co-founder and CTO when he created the Disruptor. He blogs at mechanical-sympathy.blogspot.com, and can be found giving training courses on performance and concurrency.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Thompson: Ten years ago, actually, at this QCon was the first time I talked publicly about LMAX and the Disruptor. Dave Farley and I gave our first talk. We had this attention-grabbing headline at that time where we said, we could do 100,000 transactions a second at under 1 millisecond of latency. That was pretty cool at that time. Most systems didn't do that, if you've seen that. Some financial exchanges were in that space. There were very few applications running at that speed. We're not just talking about the core of the system. We're talking about external requests coming in through gateways, processing it, making this thing reliable and responding back to customers in under a millisecond. We would just run about that, sometimes we were a bit worse. Sometimes we're a bit better. On average, we were around there at that time. Ten years later, is it 10 times better? Interesting. Are we now running at a million transactions per second in the financial exchange space? I'm not at LMAX anymore, so I'm not talking about that.

Generally, in the exchange space. Some of the better ones are at that level of performance. We have gone up in order of magnitude and throughput. What about latency? Have we come down from 1 millisecond? Yes, actually a lot. Are we down below 100 microseconds? The best exchanges around are well below 100 microseconds, down to a few tens of microseconds typically. I can't say exactly what speeds some have, but they're very low. There's been a lot of really interesting changes over this time. That's what I want to talk about and go through that.

Four main topics. Looking at how things have evolved, particularly how design has evolved. How do we make things resilient, fault tolerant, robust in this space? That's also really important. Then there's been some fundamental changes there. Performance has stepped forward significantly. How we deploy these things has also quite significantly changed in the last 10 years. These are all wrapped up together. They're not really distinct. They all interact with each other. There's some interesting ways they play together.

Design

Let's start off first of all with design. If you look at an exchange, they're all designed in a similar way. Typically, they're state machines. You apply a set of inputs to a given state that will bring it to a new state. This is pretty old computer science. It's been around quite a while. If you're looking at the types of state machines, there is Mealy and Moore. Most are typically Mealy State machines, where we're applying inputs. The state of your outputs or inputs to state, and we get a new state. That's how things tend to work, pretty simple. Most people don't build applications thinking like this, but most of our programs are state machines, whether you like it or not. They're probably just not particularly well defined state machines, in how they work. These systems need to be much stricter about that.

Replicated State Machines

Particularly, we want to look at replicated state machines, because we want something to be reliable. It needs to be on more than one node so we can tolerate failure. From there, we can then recover and go forward without losing customer data, and state, and things like that. To do that, we've got to order our inputs. Every state machine must get the same ordered set of inputs to come to the same result. The execution of those orders within that, as the events change the state, that change must be deterministic. There's some quite surprising things in there. Most people here don't know that if you program with a HashMap in Java, that if you iterate over it, it's not deterministic. You get surprises in some of these things. Lots of data structures are quite surprising that they're not deterministic. We need to make sure all our state changes are deterministic. We need to be able to deal with that. If we have got that, we end up with the same state and the same set of outputs across a number of replicated state machines. That gives us our reliability, our ability to tolerate failures at the end of this.

Distributed Event Log

We end up building a distributed log. Most people will take a log of the inputs, deal with that and replicate it. Then we can get the state machines to the same point. That's the bit that's formally studied and well understood. There are all ways it's played with. You probably have heard of event sourcing, where it's a little bit mushier. It's not as formally studied. We'll be looking at changes and replicating the changes. Databases tend to do a similar approach to that but they do that much more formally. I get to see many event sourcing systems that fail because people don't follow the right rigor and do it formally enough to make it work. I want to talk about replicated state machines as formally understood. Those ones can get us to nice, reliable systems.

Rich Domain Models (DDD)

The cool thing with this is once you've got the ability to process these events in a nice deterministic way, we're not dealing with databases. We're not dealing with all of this horrible stuff like ORMs and things. You have these really nice rich domain models, which is actually a great thing in finance. It was one of the cool things you get to work on. Things like matching engines, risk systems, surveillance systems, all these. They're very rich, interesting domain problems to work on. We can work on them with quite pure, very elegant models. It's a fun space to work in as well. Within these domain models, we've got lots of relationships between our objects. For this, we're going to represent these relationships with something, and that's where data structures come in. You get to do quite a lot of cool computer science, either picking data structures, and choosing the right ones, or even designing data structures from scratch. To make the best matching engines that are out there, there are not off-the-shelf data structures to do this. You end up designing data structures specifically for this purpose. It works really nice as a result. If you work in that space, you get to do quite cool things. In fact, some of the best people in the world at working on matching engines, that's pretty much all they do. They get to design these and work in a cool space. It is a computer science problem and a domain modeling problem.

Time and Timers

Within this space, this is something I've seen evolve over the last 10 years, particularly, how we deal with time. Time is a really interesting problem in this space because the time needs to be consistent across all of the machines. You can't read the system clock. That's not something that ever happens here. It shouldn't ever happen. Sometimes it does, and things go wrong. We have to catch up and deal with it in other interesting ways. We usually are timestamping events as they come into the system, maybe even timestamping them at the hardware level by putting a timestamp onto a network packet using very accurate clocks, disciplined to atomic clocks that are GPS synchronized. We can be very precise at that level, which is a nice thing to do.

What about if I want to set a timer? If I place an order into an exchange, and that order goes in. Ten minutes later, if it's not matched, it must be cancelled, or maybe even 10 seconds or even 10 milliseconds later, you may want to cancel it if it doesn't fire straight away. We have to set a timer for that, and wait for the timeout to occur. If it fails in the meantime, you're not going to fire that timer, you want to cancel that timer. If it doesn't fail you want to have that timer go off and deal with that. If you've got a replicated state machine, what do you do here? A typical approach is to pulse in timestamps. You see that a certain amount of time has gone past. Then you're reading that, you're looking up and go, are there any timers that need to expire? Got a really interesting computer science problem there. How do you manage a lot of timers and deal with it efficiently? Things like timer wheels and priority heaps. There's all sorts of cool data structures for dealing with that, well studied in the operating system space. You can draw from there.

These events going in, end up in your log. You end up with a lot of events in the log if you want very high resolution on your time. You want to have microsecond timestamps, you will end up with having a million events per second in your log, just dealing with the time. You need to deal with that in some interesting ways too. There's ways of doing that. It's actually better if the infrastructure is built to cope with that problem. I have seen people go through the different ways of doing that. It's now pretty much acknowledged that you need to do that in the infrastructure to do it well. There are solutions to do that. Just being aware of some of this.

Fairness

There's also been a big evolution in fairness in this. Anybody who's worked in finance will know, it's pretty much a rigged game. It's been rigged by the banks and large corporations for a long time. We now have got retail customers dealing, and we've got lots of new people coming into this market. Fairness is becoming something that people want to have. It is changing. We're even seeing in markets where we'll get different market makers are playing off each other. They're not playing by rules. High frequency trading shops do a lot of this. We need to work out how to make it more fair, how to make it more transparent. This has caused a big difference in the design of some of the exchanges.

Here's one example here. Ten years ago, an exchange would have looked something like this. We have multiple gateways facing off to customers. We probably had two matching engines at the back of that. One was there to take over from the other one if it failed, so we got that level of redundancy. We got these multiple gateways, people can connect to those gateways. That starts to send orders into the system. If you're a market maker, and you want to get a competitive advantage, what you can do is you could connect to every single gateway that an exchange has. Simple technique to do. You can start measuring, which is the most responsive gateway that is there. Then from that you can start putting your orders through the most responsive gateway. Makes sense. That's a pretty fair technique. You could do that and that will work. We're fine with that.

What if you want to be a bit more unpleasant? You're not just wanting to compete, you compete, but you also don't mind putting other people down. How about you work out, which is the best exchange, or which is the best gateway to connect to? You connect to that gateway. You can start sending your orders through that. What about all these other gateways? What if you start stuffing orders through those other gateways that you don't care for matching? You put them at prices off the market, so they're not good prices. They're not going to be taken. You stuff loads of traffic into them. They're now stuffed up with traffic so other people can't compete. Then you go through the gateway that you have set up which you know is fast, and you're going to send your orders through nice and quick. That's not fair. That's what started to happen. A lot of that stuff started to happen.

How do we change that? What if we only have one gateway, people can't do any nasty things then. Everybody has to go through the same gateway. This is where a number of exchanges are now moving to having a single gateway that you connect to. You connect to that single gateway, and everybody goes through it. This has got some interesting implications. We don't get the games playing. We don't get all of that extra traffic stuffing in orders through them. That's actually quite nice. That steps down the load on the system quite significantly. You got a single point of entry for all traffic going through there. It has to be really fast. We're seeing this consolidation now in the gateways the way we used to see consolidation, and matching engines, and risk systems where everything went through the same point, and we needed the traffic on there. We would go wide on gateways. We can't go wide on gateways anymore. Performance has started to really matter. As a result, we can send the traffic through that, but we have to get it right.

We'll move to this model where we have a single gateway, will be connected to a matching engine. There's also something else that's going on, we're realizing that the primary, secondary means of fault tolerance is not sufficient. We get multiple matching engines. How do we deal with loads of load especially dealing with the internet? We do have multiple gateways, but we're having gateways by classification of customer. It is perfectly acceptable under the regulation to treat an internet customer different from a co-located customer. They are a different classification of customer and they can get a different level of service. Within a classification, they should all get the same equal for a service. One gateway may be the internet. One gateway may be remote people. One gateway may be co-located people at the exchange. We can separate them out in various ways like that. That gives us a little bit of an ability to steal. We're still having to get all this traffic through. We're having to focus on the performance of that.

One of the other ways we can scale up, is we can scale up the matching engines and stuff themselves by sharding by instruments or asset class or something where we don't have the relationship between that. Some markets are interesting where there's correlation between instruments and we'll deal with liquidity across them because some things are fungible, or can be fungible via calculation across that. We get a little bit up and sometimes it's actually easy just to shard by dealing with the instruments or different assets at that point in time. You may have indexes on one. You may have FX on another. You may have equities on another. You may have some of the biggest equities in a different exchange. We deal with that separately. We're seeing this evolving layout of the infrastructure.

Migration by Asset Class

We're also seeing a move of asset classes to being exchange traded. We have a lot of over the counter trading. We have a lot of proprietary trading and broker based trading in the past, more of it's moving to exchange, because on exchange, it's fairer. It's equal. There's better ways of dealing with that. We're getting a lot more activity in this space, and an explosion over the last few years of cryptocurrencies and other crypto instruments that are being traded in this space as well. Lots of the new exchanges out there, this has started happening. One of the things I actually like to see this start to happen, is we started trading energy now. I got to work with a company doing an energy trading exchange for Europe over the last few years, where renewables has become an interesting problem. You've got wind. You've got solar. They're not predictable. You may have one part of Europe's got a lot of wind, another part of Europe doesn't have a lot of wind. We need to move that energy around. We need to cooperate. We need a means of trading that really fast on the fly. This stuff is starting to go on to exchange now as well. Really cool and interesting things happening here.

Resilience

That's where the design space is going. Where are we going for resilience? How do we make things fault tolerant, particularly? By fault tolerant, can we tolerate a fault and continue and go forward? Ten years ago, everybody in finance was doing primary, secondary, where you've got a node that's your primary node. All your traffic is going through that. It's backing up to a secondary node. In the case of the primary failing, you can go to the secondary node to be able to continue processing. Not a very good approach, when you actually look at it from the service you want to offer customers. A couple things in that. If you got two nodes, and you get a fail there. You can't automatically decide which one should become the new primary. Typically, you end up with a failure at 2:00 in the morning. Someone gets out of bed, tries to do something about this, and screws it up because they haven't had their coffee yet. I know. I've been in these situations. You want the means to be able to automatically choose which node is the ideal node to be the primary connection and go forwards.

It's actually bigger than that. You start having a conversation with a lawyer around why do we have fault tolerance. If I'm trading into a market, particularly, if I'm allowing retail customers to trade into a market because they deserve more protections than professional traders, because they're not aware of the risks to the same level. If I'm trading into the market as a retail customer, and the primary dies, and we go to the secondary. Seems fair, we get to continue. You're now running on the secondary. What if the secondary fails? You now have no data. You have no way of reputing what has gone on. A lawyer is not comfortable with that situation. If you're going to go from primary to secondary, the only point you should go to secondary is to allow people to trade out of their position so they can go flat and be safe in the market. It is not a system that you can continue operating on except to clean out.

If you go with a different approach of consensus. Let's start off the simple case of three node consensus. If you have one node that dies, you can continue trading because you still have another node beyond the primary, and you've got a primary and a backup. You're still trading safely at that point. You've also got a really nice case of you can work out which one should be the primary. Let's say we have a net split. We've got split brain going on in a network with three nodes, they couldn't communicate to each other briefly. Then maybe two of them can and one of them can't. The two of them that can, can determine that they can form a viable cluster and continue. The single node on its own can't. If you have a system whereby you've got primary and secondary, and you get a net split, you cannot work out which one automatically becomes the leader. We've got algorithms that allow us to do this. We've seen this move forward to doing this. What's interesting in finance, I'm still seeing a large number of people doing primary, secondary, even though it's known to be a flawed solution. We get loads of excuses but we don't want to spend the money on the hardware. Hardware is so cheap these days. It's expensive in many financial organizations, because the IT staff are shockingly bad and inefficient in how they run it. How many folks have asked for a test server and it's taking four, or six months, or nine months to get it, and stuff? That level of service within an organization is going to die. It's the way of the dinosaur. There's much better solutions to this now.

When you look at these techniques to do consensus, it's been around for a long time. Three of the main proponents in this, Leslie Lamport, Barbara Liskov, and Ken Birman, did the majority of this work in the 1980s. Leslie Lamport wrote a seminal paper in '84 on time versus timeout for working with distributed systems that spawn some thinking around this. Barbara Liskov has been the first one to actually come up with a consensus algorithm after that. Paxos is a bit better known, but it came out later than that. Ken Birman did a lot of work in virtual synchrony. These were all simultaneously coming out. They give us the ability to have consensus and fault tolerance properly in a market. This was worked out in the 1980s. I find it shocking today that stuff that we know and we do it really well, we're still not using it properly in 2020. People running primary, secondary in 2020, you really are a dinosaur. You need to wake up and find out there's better ways of doing some of this stuff.

What was interesting was all those approaches were hard. They're really difficult to understand. Which is probably one of the reasons why they hadn't taken off. Paxos probably the best known, but infamously difficult to implement and get right. Then the Raft paper came out in 2013, 2014. This changed how people thought about this. Because there's an interesting goal with Raft. The goal was to build a consensus algorithm that was easy to understand, not to build one that was perfect or trying to deal with all of these extra as conditions. The primary goal was to build one that was easy to understand. This changed and revolutionized the space. Having implemented a bunch of these, it's actually not that much easier to implement than Paxos or some other stuff, but that's a separate thing. A lot of the groundwork for this was actually Barbara Liskov's work in Viewstamp Replication. Raft is closest to that than some of what's gone before. It has changed how we think about stuff.

Raft Safety Guarantees

The cool thing is we get all these nice properties, whenever you implement Raft correctly. You get election safety. At 2:00 in the morning, when you get a node failure and you get another one elected, you don't have to call the support people. It just deals with it. I remember one of my customers was talking about how they'd come into work one morning, and during the night they'd had a node failure, and they just had an event in their console saying that it happened. Another node was elected and it continued on. There was no drama to it. That's nice. That's often not the case whenever these things have gone. I can remember situations back 20, 30 years ago, where in the middle of night, I'm tearing my hair out working with administrators trying to get the system to work again properly, and we're making the mistakes because we're tired. We don't want to be in that situation. We've got this. Things have gotten better as we've moved forward.

How It Works

How does some of this work? We elect a node. That is designated the leader within it. In this case, the middle one being their leader here. These are usually managed by, in rough terminology, a thing called consensus module. Something manages the consensus within the cluster to work out where we can safely process any data to go forward. We start dealing with a number of counters that are interested in this space. This is a distributed system, and distributed systems, everything has a different view of what the state of the world is. Just like all of us, we don't have a global shared memory. We all have our own local memory in our head, and we communicate with each other to reach some consensus on a shared understanding. This is how these systems work. How does this work out? The numbers here could be the number of bytes through the system, the number of messages through the system, whatever. We're dealing with a distributed log that's deterministic. It's ordered. We're going to send the same log to the module. How far have we got through that log? That's the interesting question.

One of the things that's interesting is what is known in rough terminology as the append position. Is, what have we appended to disk? Append index or append position? What have we safely got on disk in some stable storage, so that if we have a crash, we can get that back again from the perspective of that node? That's one of the things. In this case, the middle one being the leader is at position 29, whereas the others are 22 and 23. Because that could be lagging a little bit. That could have had a little pause, various things going on at different points in time. Typically, the leader will be ahead of the others because it is the one that's doing the sequencing, but that's not guaranteed. There is another index that matters and that's what's known as the commit index or commit position. That is, what is the index the cluster has reached from the perspective of consensus? What is the quorum of the cluster agreed to be, it can safely recover from? The quorum is defined as the number of items in the cluster divided by two plus one. A quorum of three is two. Everybody says, is quorum of two, one? No, it's not. This is why when you go from a two node cluster to a three node cluster, you go in lockstep. You must go in lockstep to be reliable. Divide it by two plus one gives you whatever the quorum is of something. In this case, the consensus module has agreed that quorum is 25, because it's got 29. The one over on the left has 25. It's determined it, but it hasn't maybe communicated yet to the others, that it is 25. They're running behind. These are the problems we're dealing with all the time in distributed systems. They all have slightly latent views of where the world actually is. We have to deal with that.

Then we've got, what have we processed? The service that's running on top of this infrastructure, how far has it got? There are various levels as well. We cannot let the service go beyond the commit index, the commit position of the cluster, otherwise, we can't tolerate a failure and recover our data. We're not fault tolerant if we go beyond that point. We need to get all of these things to deal with it, because we could just lose a node at any given stage. What's actually really interesting is, we're going to cloud a lot more often now. In cloud, we lose nodes way more often than we used to lose them in our typical in-house infrastructure. We're dealing with these cases. We've lost the node. We need to elect a new leader. We'll tend to elect the leader that has got the most up to date view of the world under Raft. We've got a new leader. We're talking to that from the perspective of our clients at that point. We're replicating to this case, the one on the right.

The one in the middle, if it burst into flames and died, we don't see it again. That's fine. What's actually much more difficult is zombies. What if that took a big huge GC pause, and the others thought it's now dead, but then it comes back again. It's got a different view of the world and it still thinks it's the leader. We need to be able to deal with that cleanly and deal with it well, and make sure everything resolves correctly. Because whenever you look at split brain issues and failures, way more often than not, they're due to pauses, not absolute failures. Partial failures are much more difficult to deal with. There's a lot of theory around how we deal with this properly. The core of it is that we got to use monotonic sequences and deal with it around that. Every leader gets a turn in office, and we need to track those. We know it's from a previous leadership term, not one from the current, so to be able to deal with this and move forward.

Importance of Code Quality and Model Fidelity

We've got lots of theory in this space. Some of this theory has been around since the 1980s, inspired by work even further back than that. I've implemented these systems many times, and talked to other people who have done this many times. What we've learned is, the model is important. Is it formally proven like Raft, Paxos, some of these other consensus algorithms have been formally proven? That's useful. It has value. More common than not where these things fail, is in how they're implemented. We all make mistakes. What is our fidelity with that model? Have we programmed stuff up in code to actually meet the model correctly? Even though we think we have, and we have a good understanding, we all just make mistakes. This is one of the things that really gets to trip you up. If somebody says they can knock you up a consensus module in a few months, walk away from them. It's generally known, in most operating systems, that if you develop a new file system, it's 5 to 10 years before it's stable. Don't trust it for 5 to 10 years. I'd say with consensus systems, somewhere in the region of 3 to 5 years. If somebody just puts something together, it's going to take that time to reach the right level of model fidelity and code quality to where you can really trust it, just because these are complex beasts and you got to shake the bugs out of them, even when the model is formally proven in the first place. There's lots of interesting things to think about here.

Robustness

One way of getting resilience is the fault tolerance. Robustness is another thing that helps us. We're talking about meantime between failures and meantime to failure. How often we're going to fail. Robustness is about how robust is your code to inputs that are invalid? How robust is it to dealing with things that can go wrong? We often don't give this enough thought. How well does your application handle errors? There's some great studies out there that show that by far, the most common cause of production outages in systems is unhandled errors, and unhandled exceptions in people's code. There's loads of examples of where people have researched these and found that there's exception handlers with to-do's inside them, please go fix this in production. This really matters. Change your programming style so that if you've got an entry point to a system, validate all inputs before you do any mutation. Make that as good practice. Bertrand Meyer talked about this way back in the Eiffel days, about determining the preconditions, post-conditions, and invariants before you start executing anything. Validate all of your arguments before you start to do any mutation on your objects. That way, you'll end up with a much more robust system. Always check error returns and things, and have a sensible strategy that you're going to apply when you deal with this. This will really increase the robustness of your system.

Performance

Performance was the highlight whenever we talked about some of these things 10 years ago. Got people interested in what was possible. It has changed. Things have moved on, and some quite cool stuff. One of the biggest things I think I've observed in the last 10 years is this awakening about the distribution of latency. We used to talk, 10 years ago, about averages. Some people still talk about averages today. Again, wakeup call. You're one of the dinosaurs if you're talking about averages, because averages hide all sorts of evil. You got to be looking at histograms and percentile distributions to see what's going on in your systems. Because averages are just so misleading. Throughput matters. Throughput is useful. What's your latency to given throughput as an average? It's interesting. What's really interesting is, what is the distribution on that?

Systemic and Queuing Events

There's an awful lot of systemic and queuing events in systems that we got to think about. Systemic events, what am I talking about? GC pause, NUMA node memory reclamation, a defragmentation on an SSD, various things happening, biased locking revocation in Java. There's loads of examples of where something systemic can happen in your system. That's it. You just pause for a period of time. This completely screws up your response times, and being aware of that. People are aware of these. People will hunt them down, but able to track them. It really helps.

The other thing is how you measure. Especially, when you start measuring throughput over units of time that are not applicable. You think about how many of us have all said, what is your throughput per second? Or, what's your throughput per day, or something like that? That's a classic one. The big web companies talk about that all the time. What is the number of transactions per day? You start doing the math on it. It's really not that interesting. Even per second, it's not that interesting. If you look at real world systems, you're looking at what is it like in burst scenarios? What happens in that 10 microseconds when an interesting event happens in a financial market? Or, an interesting event happens like market open, market close, nonfarm payroll release? These things is when within a number of microseconds, a huge amount of traffic goes, and if you extrapolate it over a second, this becomes huge values. This is where throughput really matters because you want to have incredibly high throughput to get through that burst really quickly so you're not queuing. Because if you're queuing, you're waiting. You're not making progress. Usually, the way you win, and for algorithms at that point is to be able to amortize. If you can batch well, you can amortize your way through stuff. You don't want to be having a linear relationship between input events and IO events to disk, or network events, or anything that's going to be a large expensive cost. You need to amortize into those to do really well. It's a hugely interesting subject to go through that.

Garbage Collection

Garbage collection is one of the systemic events. That was the big one that most people started to realize. You run on something like parallel old in Java, or most of the various garbage collectors in C#, and you get these huge pauses. CMS was probably the better choice in finance for a long time, but it still had these full GCs that were nasty, and stop-the-world. People went with one of two approaches, either a huge memory space, and you just hoped and prayed that you didn't get in an event that was nasty before the end of the trading day. At the end of the trading day, you triggered a full GC and you cope with it. Or, you just write-no-allocate in your application. Those are the major approaches people took.

Then things start to change. People discovered C4, the Azul's Zing, concurrently compacting collector, which was just revolutionary, where we could run at high allocation rates, and not get really nasty GC pauses. Part of the history of that is that was originally designed for the Vega CPU where you're looking at up to 1000 CPU cores. Things like Amdahl's law was in your face all of the time. Stop-the-world events really got to hurt you. They're designed to cope with this and have a very responsive system. That's changed things a lot. The nice thing is it's inspired other things going forward. We've got GC Shenandoah. G1, just ignore in this space because it still is not really a proper concurrent collector. That's since C4 leads the way. ZGC is very much inspired by it but running way behind. It's not generational. It's got various other issues. Shenandoah is looking nice, especially on smaller heaps. We're getting progress in this area. It's changing how we think about this. We don't have to do zero allocation, like we did in the past.

Memory Access Patterns and Data Structures

When you really care about performance in this space, you start measuring cache misses more than anything else. Within your domain model and your business logic, your performance is almost directly correlated to the number of cache misses you've experienced. I get to build these things all of the time. You start to see, what is your memory access patterns? Are you going to have the working sets fitting into your L3 cache to work really nicely as a working set? If you're going beyond that, are you going through memory linearly so the prefetchers are helping? You have to work really friendly. Data dependent loads are your enemy at this point, otherwise you don't get performance.

The fastest matching engines out there are written in C, because of the control you have over memory layout. This is one area where Java has really suffered compared to other languages. C# has moved forward. Java is eventually starting to move this way. It's not there yet, and still has quite a way to go. You don't see so much of the really high performance stuff in this space because it's purely about memory access. Unless people are putting stuff in big arrays or putting it off heap and dealing with it directly through Unsafe, because that's the only choice that you have. We're stuck in that space. Data structures is a fascinating subject in this space. You get to see the cache misses. You'd run profilers, and you can see what's going on. You can see the traffic through your cache subsystem into your main memory. You can see how your instructions are playing out against the number of cache misses that you're getting.

Binary Codecs

We've also started to get this nice move eventually toward binary codecs in this space. There still is a lot of FIX. FIX is a horrible protocol that's mixed up at so many layers and so many levels. We're processing tag value stuff in ASCII. Eventually, in finance when I start to move forward, so I was involved in SBE. There's various other things that are out there like ITCH, MITCH, SAIL, and various other protocols that have all gone binary, because it's so much more efficient than doing this in text. What's interesting is outside of finance, the rest of the world is still living in La La Land and doing JSON, and XML, and all sorts of horrible stuff. It's like to just burn the planet down and waste CPU energy, and let's process everything. It's one of these, I want to run two by two. I always keep hearing, it's human readable. It's not. Nobody can read UTF-8. I know a few people that can read ASCII but nobody can read UTF-8. We have editors that allow us to do it. It's a tooling problem is what's going on there. We need to move to have nice binary protocols. In the network space, it's a done deal. We know about it. We moved across. Even the old internet protocols, we're all moving towards binary protocols in this space. We still have people obsessed with JSON. Now we've got the god-awful thing that is YAML that's going everywhere too. Please stop this stuff. Binary is much better.

Spectre and Meltdown

We've also had the advent of Spectre and meltdown, all the speculation bugs coming with the side channel attacks and various things. What's interesting, 10 years ago, we were starting to see the free lunch spin over really starting to hit. Things really slowed down. Our CPUs were getting faster but really at a very slow rate compared to what they were before. We had this really nice curve, and then it went like this. You've seen things got a lot slower as all the mitigations started to apply. I've seen huge costs increase when things got much slower. Actually, a lot of the game now in this is working out, which patches can you not apply? Because then you get better performance. Do you know your system is save from this type of vulnerability? There's a whole really nice world consultancy where you can work out, for this processor, I don't need this particular patch to be applied. Or, actually, I know where I'm running in this environment and what can talk to it. I'm not sharing anything else on hyper-threads, or stuff. I can deal with all of that as well. It's a weird and interesting world.

Certain things have really increased in cost because of this. Things like system calls. When you make a network call or a disk call, things like that. It has gotten much more expensive. Stuff that's more subtle, like page faults, because there happened to be interrupts going into the kernel for dealing with this. You got context switches associated with that. It has a lot more cost. I've seen people reading memory map files, just get horrendously more expensive all of a sudden. These sorts of things. Also, we're paging out our applications as part of a systemic problem, and then we're faulting them back in again. This all starts to become difficult. We have to look at locking memory and dealing with all sorts of interesting problems to get around this. This whole cost space has gone up. We need to think differently. We particularly need to really amortize. If you're going to put a small amount of data in a network packet, it's incredibly wasteful. It was wasteful before, it's even more wasteful now. If you're going to go to disk, you want to fill that disk block.

We can get around some of this, making it better, by going to huge pages. Set up huge table FS, and you map your pages in there if you're dealing with memory map files, making sure your applications are all running properly. Watching out for some of the issues that come with NUMA and transparent huge pages. You actually want to make sure you've properly allocated huge pages and you've dealt with that correctly. Watch for the swapping issues that come around with that.

Advances in Hardware

Hardware has advanced quite a lot in this period, and in really interesting ways. Disks I think are totally different than they were 10 years ago. We had really nice SAS spinning disks that could write sequentially, nice and fast. That's where these logs worked out quite well. They're not good for random access. We get huge pauses, especially when we're looking at our page faults that are going on, or we're swapping stuff out, or anything there. You try to avoid it. If you did get this you got huge pauses, because with the spinning disk, you're dealing with milliseconds to go seek somewhere else on the disk. SSDs have changed this. They have really significantly dropped our latency. We've gone from milliseconds down to tens of micros, or hundreds of micros in these average cases. The really good ones are now down to tens of micros. Then we've got Intel's Optane memory that came out, which didn't quite give us the throughput advantages we thought we're going to have, but has really helped with the low and predictable latency. Although, we're still finding over again, that if we access things in random or an arbitrary pattern, we have much more cost than we deal with it as a nice linear pattern. Main memory is very nice if we access it linearly because it prefetches, disks prefetch. We start dealing with how we write down and batch nicer. Disks fail less often if you deal with them in a sequential pattern than if you deal with them in a random pattern. Still having that understanding of what's going on under there. Networking's advanced hugely as well. I think the HFT and the whole thing that's been happening in the finance space has been wonderful for networks. We've got much lower, much more predictable latency now in our network stacks and we're getting to benefit from that, especially as we've now gone much more distributed. Finance has been good at that.

CPUs have not gotten much faster. That's what's interesting. If you actually look at the summary of what's happened in the last decade is we've had minimal improvements in latency and response time from our hardware. Disks being the exception, but everything else, like memory CPU, latency is the same. Throughput has massively changed. This has flipped around how we got to think about design. We spent four decades of progress optimizing to not use as much throughput and bandwidth, and trading off latency because that wasn't a problem. Now it's flipped. Bandwidth is abundant. Latency is not getting better because we've squeezed it to the end. Generation after generation, we're getting doubling or even orders of magnitude improvements in bandwidth. We need to change how we're thinking around that.

New IO APIs

Part of that is our new APIs particularly for IO. The BSD sockets library has just run its course. It is not a good API for dealing with modern network cards, we need to go to much more asynchronous models. Things like EF_VI going to Solarflare cards, or Intel went forward with DPDK. We're starting to see things like even the Linux kernel now, because the system call overheads and getting the packet rates, io_uring, and where is it going? That is the only way to stress some of this hardware to what it's capable of. Similar on disks, we need to go asynchronous with these APIs because that way, we get around the system call overhead. You can only do so many system calls a second. When a system call is taking more than a microsecond, it means you're going to be doing less than a million of those per second, which means you can't do more than a million packets per second or IO operations per second yet the underlying hardware is capable of significantly more. We need to deal with it asynchronously to deal with that. We need to move forward that way.

Mechanical Sympathy

Mechanical sympathy is more important now than it's ever been, as we're using abstractions. If we don't understand what those abstractions are built upon, and we use them correctly, we just are wasting cycles in this. Interesting question is, does programming language matter for a lot of this in this space? I've seen exchanges and trading systems built across all sorts of languages: Java, C#, C, C++, a lot of Rust and R recently. This thing is changing. I think it's actually not so much the language, but more the culture and the design around the language.

What do I mean by that? I love Java. I think Java is useful for many things, and the JVM is a great thing. The design of some of its APIs are so hurtful to performance, they allocate too much. They have far too much data dependent loads, and not enough respect for memory layout. It starts to really hurt. C# is a bit different in some other ways. C# is a language and its control is better, but its VM and runtime, is not as good as the Java one. We get to see this. The higher end of performance is going back a lot more to C, C++, and Rust, because we have so much more control. You can build nice data structures, but it's not just that. It's getting access to these new APIs. Some are from Linux languages are just so slow to even consider where our hardware is going. We're not getting to make any advantage of that, so we end up missing out.

What's happening is we tend to get much more polyglot style systems, where you'll develop some parts of systems in one language, some parts in another language. I see, ideally, they can all work and interoperate very fast and efficiently. That's the ideal world. We get to see that quite a lot more now, with things that aren't so high throughput, aren't so response time critical, will be done in other languages that maybe have a nicer environment, and easier support, and easier access to many more developers.

Deployment

What's been changed in deployment? It's an interesting space. Continuous delivery has had a huge impact. I think that was one of the things I was very proud that we were able to push forward at LMAX, and working with Dave Farley was a big advantage there. We just wanted to be able to turn things around fast, get that fast feedback cycle. The secret to that is automation and actually looking at how you get feedback as quickly as possible. I still see people debating about, should they have a good test suite? Should they have the ability to do one click deploy? All of these things. You still get some financial organizations that can only deploy every six months. It's just insane. You just don't get the feedback. You can't react to that. We need to have that ability to work that way and progress much more quickly, because the feedback cycles is what feeds innovation.

If you can have an idea, test the idea, and find out if it's good in a very short space of time, including, find out if it's bad. You don't get as wedded to it. One of the big dangers with anything that's got a long feedback cycle is things become our pet projects, that we become wedded to an idea, because it takes us so long to push it through a system then we can't almost let the baby go. We want to see it succeed. It's much better if we can very quickly work out that something is not a viable idea. We can go a different route. For that we need to think about it. People talk about agile. I really don't care whether you do standups or not. That's all just cargo cultism. The key to agile is feedback cycles. It's all based on Little's Law. If anybody says they do agile and they don't understand queuing theory and Little's Law, they are not doing agile. That's about feedback. We have to have those fast feedback cycles to make good decisions and move ourselves forward. Think about how we can continuously deliver.

24/7 Operations

We've seen a lot more move to 24/7 operations, where we get new exchanges now that are going right around the clock. It was typical before that we would shut down at the end of the day, and we didn't need to carry state forward. This has made it very different from, how do we keep the state? We end up doing stuff like having to have snapshots of our state and recover from that. This also allows us to do rolling upgrades in live systems. This is where clustering and consensus is a better approach. You can take one node out of a cluster, upgrade it. Get it back into the cluster again. Do another one, and you roll these things. It allows us to do the hot upgrade and still be responsive. We're getting a lot more move from that. Especially, as we go to retail, retail expects to access their systems at any time. We're also dealing with things globally. We're getting a lot of change in that space. We need automation. We need good monitoring. We need lots of things in that space to be able to do that and do it well.

Flexible Scaling

We also need to scale flexibly. A test I have for any system being built today is, can you run it on your live production system and use the same deployment tools, the same setup to run the whole thing on your laptop and debug it? If you can't do that, you've failed, which has some interesting implications. Hardware, load balancers, specialized hardware, and stuff, it doesn't really have a place in some of this anymore, because it restricts what we can do. We need to be able to deploy everything to a laptop. We need to be able to take threads and run them on cores. We need to be able to collapse them down and schedule them well. We need to split things apart and maybe have them communicate via IPC, because we get to see this where we're now going from machines that had 4 or 8 cores, or whatever, in a server 10 years ago, to now having 100 cores, no problem, on a server. What's the obvious implication of that stuff that used to take up a whole rack now just takes up one machine? You want to move everything on to a machine, make things much more efficient by talking via IPC at that point. We've seen that now with going to the cloud. I'm starting to see new exchanges being built and deployed in the cloud. You need to take advantage of the hardware that's there, take advantage of the new APIs and be able to work with that, and go forward.

Wrap-up

That's the last 10 years. What's the next 10 years going to hold? If I see any current trend that's going on, I think a lot of the performance race and stuff is over. It's all about refinements. It's about quality of execution now. It's, how well does your system stand up to failures? How well can you observe and deal with it, manage it in production, run it 24/7, and face off to customers? Service is where things are really going in the future. You got to be reactive to a market that's changing so fast. If you're in a cycle where it takes six months, nine months to do something, to find out, you're just going to lose to the competition. You need to be much more reactive.

See more presentations with transcripts

Recorded at:

Jul 21, 2020

Martin Thompson

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?