Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Maximizing Performance and Efficiency in Financial Trading Systems through Vertical Scalability and Effective Testing

Maximizing Performance and Efficiency in Financial Trading Systems through Vertical Scalability and Effective Testing



Peter Lawrey discusses achieving vertical scalability by minimizing accidental complexity and using an event-driven architecture.


Peter Lawrey is a Java champion and the CEO of Chronicle Software, driven by his passion for inspiring developers to elevate the craftsmanship of their solutions. As a seasoned software engineer, Peter strives to encourage simplicity, performance, creativity, and innovation in the software development process.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Lawrey: I'm a Java champion. I have 13,000 answers on Stack Overflow. I was the first gold medal holder for memory, file-io, and concurrency in any language. That gives you an idea of my interests. This is my first computer, nearly 40 years ago. It had 128 kilobytes of memory. In that time, I used it for many years and never found a use for the whole 128 kilobytes. Each of those 8-inch floppies had a whole 1 megabyte of storage.

Chronicle Software

As a company, our vision is to provide 80% of the common libraries and infrastructure that the trading system needs, so that the client can spend their time focusing on where's my business value. What will make a difference for my company in terms of making money? The intent is that they can spend more time on that, and therefore get more efficient use of their developers. We're used by 16 banks and exchanges, who are paying clients. We are downloaded by over 200,000 different IP addresses globally each month. We find that around 80% of banks are already using us, as an open source product, in at least one of their core trading systems.

We have a significant team of IT specialists with over 30 years' experience each. We have a highly specialized team. We've got a lot of experience in both developing and supporting trading systems. Some of our client systems that we've supported for 5 to 10 years, just while we're at Chronicle. Chronicle Queue is our flagship product that's downloaded 80,000 times a month. As I've said, it's open source so a lot of people are using. It gives you an introduction into the low latency space in an open source product. Our key value is that we were easier to work with many of the other alternatives. You often find that when you get into the low latency space, everything is so finely tuned, and it's not very easy to work with. Then that causes time, take you longer to get to market. It's harder to modify and harder to maintain. We focus very much on making sure it's still very easy to work with, very easy to develop and test, even though it's also highly performant. One of the ways we achieve that is by having a sliding scale. To start with, you have something that's reasonably performant, but very easy to work with. Then once the system is stable, you've got options for tuning even further to get down to the latencies that you need.

The offerings we have are durable messaging, which is the Chronicle Queue or Queue Enterprise. The free version is in Java, but there's a C++, Python, and Rust version. We have restarting strategies for microservices and high availability for them. One of our biggest products commercially is a fixed engine. This gives you some background as to who we are, and where I come from. I have a team of developers working for me, that have over 30 years' experience themselves, as do I.


In today's talk, we're talking about maximizing efficiency and performance in trading systems. However, a lot of these things are not unique to trading systems. They're more prominent because these things matter much more. These are techniques that are broadly useful. They're not specific to trading in general. A lot of the users are in gaming, or a lot of transactional type information where you need to have low latencies rather than just going for high throughput. We'll talk about why allocating objects could be up to 80 times more overhead than the cost of GCing. Why 99% of your latency might be accidental complexity rather than essential complexity. A lightweight technique for 64-bit timestamps that are unique across the cluster, so then you've got a unique ID. Reducing the latency of your source of truth by a factor of 10 or more. Lastly, have a technique for generating 90% of your data driven tests, so that you're not having to create and maintain all of them yourself. Some quote from Grace Hopper is, "The most dangerous phrase in the English language is, 'we have always done it that way.'" She actually led a team developing one of the early COBOL compilers. She's credited for developing the technique of linking applications after they're compiled. You go over compile stage, and then they get linked later. She had the rank of Admiral in the U.S. Army.


How do we picture how a resource can fail to scale? One example is you can start with an office. You imagine an office with a small number of rooms. The analogy being each one of these rooms is a CPU. People need to be able to go between these rooms, in order to do their work, or to go to meetings. If you're spending too much of your time walking from one room to another, it's not only an overhead, it's also a bottleneck, because there's only so many people that can be moving from one room to another. What you want is to be spending as much time working in each one of those rooms so that that shared resource, which is the corridors and the doors to those corridors are not your constraint. If you've got each of your CPUs working independently and very efficiently, they're not contending on a shared resource. One of the problems is that a lot of tests out there are single threaded, so if you got JMH, for example, that's a single threaded test. It's essentially testing how much time does it take for one person to go from one office to another? It's the simplest test. That's a baseline. It doesn't show you how does it behave when the office is really busy. Are you doing something that works very well, when there's only one thread doing it, but doesn't work so well when you've got many threads doing it.

We have a benchmark, this was an open source benchmark, and this gives you an idea of how we structure our microservices. We have an interface for messages coming in. We have another interface for all of the messages and events that can go out. In this benchmark are 16 clients connecting simultaneously, which uses up all of the logical cores of a 32-core machine. There's a flag in there which says take a copy of this small object. It's only 44 bytes. That itself doesn't matter, it's to show how much difference it makes having even a small allocation rate per message in terms of your throughput. For a small number of connections, or threads being used, it scales very well. The latencies are only around 18 nanoseconds to allocate an object. That doesn't seem like a big deal. That's what something like JMH will show that allocating objects really is very cheap, especially if it's small. Even for a small number of threads all trying to allocate at the same time, it's still not a problem. However, as you scale up, and you're trying to use more cores in your system, you're trying to use all the cores of your system, they start to contend on your L3 cache in particular, and also your memory bus, because that's a shared resource between all of your cores. Really what you want is all of your cores to use it as little as possible. Obviously, in Java, you have very little control over this. The main thing you control is how quickly you're allocating, because allocating, it's only short-lived objects, is literally creating garbage and filling your caches with garbage and forcing out any useful data into memory. It also creates a bottleneck.

How does it scale as we have more threads allocating objects? This is a simple test, where all I'm doing is allocating, just to show that as a standalone. Up to about 4 threads, it's only risen from 15 to 18 nanoseconds, which doesn't sound like very much. However, you'll see that as I go through more threads trying to allocate objects, the time to allocate just increases. In fact, the number of objects being created per second hasn't increased, because I've hit that threshold. Then, if we look at it from another perspective, for different sized objects, you can see that even for the smallest object, it's around 150 nanoseconds. I mention that because that comes up in the other benchmark. If you look at it in terms of number of gigabytes per second, you can see that the number of gigabytes per second saturates. The object size does make a difference in terms of how many gigabytes per second you can produce. It sits between 8 and 18 gigabytes for a fairly typical server, as to how much your allocation rate is. For the benchmark I mentioned earlier, we're able to achieve 67 million events per second over TCP. This is to an echo service. I'm sending a message to an echo service, and all it's doing is just sending the message back again. These include serialization, and obviously the echo services don't do any real work, but it gets around 4 billion messages per second, which is a fairly high number. You can see that if I allocate one object per message, that slowed it down by around 25%. In fact, the average that you see across the board is roughly a little over 150 nanoseconds that we saw for the standalone benchmark earlier.

You'll note that Java 11 seems to be a little bit better than Java 8, at least for this benchmark. We see mixed results. Sometimes it's better, sometimes it's not. I have not seen an occasion where it was worse. Certainly, going to Java 11 or higher is a good idea from a performance point of view. For comparison purposes, what was the time spent in GCs? The time spent in GCs, because this was lots of very short-lived objects, they're very easy to clean up. The GCs was only 0.3% of the time spent on the GC. If you looked at the GC logs, you'd say that was very short pauses. It's not very often, so actually my GC overhead is very low. In fact, the amount of time lost through allocating them or creating them in the first place, was much higher. In fact, it's roughly 80 times higher. This is a benchmark that you can run yourself. Without going through all this analysis, is there some rule of thumb you can use? It used to be the rule of thumb was about 10 gigabytes per second. I'd say more modern machines is about 12 gigabytes per second. If you add up the allocation rate of all of the JVMs on your machine, and this is physical machine, so if you've got virtualization, and you've got noisy neighbors, this is not going to help you at all. It's the total allocation rate of every JVM on your machine. If you combined them, and if it's around 12 gigabytes per second, then you've probably hit this threshold. At least one of them, probably the simplest thing is redeploy one of the apps onto another machine. If you're in a position to tune your application, reducing the allocation rate will almost certainly increase the throughput that that system can achieve.

I've actually done that for some projects. There was a project I once did where every time we reduced the allocation rate by 10% per message, the throughput went up by 10%. In fact, the allocation rate didn't actually stay the same. We had to reduce the allocations by 40% before we even started to see a drop in allocation rate, because all that happened was every time we rerun it after each optimization, the throughput went up instead. The allocation was just the saturation point there, stopping this going any further. On this laptop, you see a very similar behavior where, whether I allocate or not for a small number of threads doesn't really matter. Then as I try to use more of the machine, the allocations become the bottleneck, and I cannot drive the throughput any further or utilize more threads, because I'm not actually getting an advantage by having more threads, because that's not the restriction. If I lower my allocation rate, then I can get increased throughput by using more threads.

Accidental Complexity

Another quote, "Perfection is not achieved when there's nothing more to add, but rather when there's nothing to take away." This is Antoine de Saint-Exupery. He was a French commander in the French Air Force, and he pioneered transatlantic postal service. He also wrote a number of books. This is often quoted as an example of an engineering mindset of not just trying to add everything possible, but actually taking away all the things that aren't really necessary. Certainly, that's the mindset in a lot of low latency trading systems. You go faster by doing less. One of the big areas that you're looking for is accidental complexity. Essential complexity is complexity that is required to solve the problem that you have at hand. You can't solve the problem any simpler because that is the nature of the problem you're trying to achieve. However, we often have a lot of accidental complexity, usually in levels of abstraction, practice, how you've gone about solving the problem. This is a feature of the solution. It's a feature of how you did it. You can actually reduce accidental complexity significantly by just approaching it in another way, solving the same problem, but doing it another way. It's very difficult to measure accidental complexity, because you don't necessarily know how much you could make it go faster until you actually do it. One approach is to compare multiple ways of solving the same problem. If they also meet your requirements, then the difference between your fastest solution and what you're doing now is probably accidental, because you can achieve the same result with something that's less.

An example of accidental complexity is driving from St Paul's to the Globe Theatre in London. London in particular is not very logically arranged. This layout was originally set up by the Romans in about the first or second century AD. There was actually a great fire in about the 17th century, and they had plans to redesign it all actually in a grid layout. Then, the problem was that took so long that they'd started rebuilding by then. The layout has never changed. Some of the widths of buildings are the same standard Roman width, to give you some idea of how anachronistic the whole approach is. As a result, driving is not generally a straight line in London. You have to go through this fairly complicated route to get from A to B. That's an example of what you actually get in software, as well, is that, to get from A to B, you can do it but it's having to do a lot more work than is entirely necessary. That's not obvious to you, because you don't see it like a layout, you don't see it on a human scale. Just walking is much shorter. It's still not quite a direct line, but it is much shorter, because you've got freedom to take more routes.

A similar problem is in software. This is a round trip latency using Kafka. A message is sent to a microservice and their response comes via a queue, so they're going to be sent over a queue and comes back via a queue. This is difficult to appreciate, because it's 10 times faster than what you can actually see. Even if this latency was 10 times higher, you still wouldn't be able to see it. It's about the frame rate of a classical movie, which is about 40 hertz. You can't see one slide changing to the next in the movie. That's about 25 milliseconds. You can't see this either. This is 10 times faster. You think, that is really fast. It's not on a human scale. It's also the distance you could send a signal between New Haven and Washington, this is a fairly long way. For comparison purposes, this is about half most of the benchmarks I've seen for Kafka for the 99th percentile. They often quote about 5 milliseconds. Generally, we don't quote anything that says milliseconds. That's just a point of embarrassment for us. If it's in milliseconds, we fix it, or we don't publish it. We wouldn't even mention things like that.

I did a performance profile using flight recorder. You see a very standard pattern that you see in Java applications where it's GCing very quickly, like I refer to this as GCing like mad. This is actually a very standard thing. For whatever reason, flight recorder shows you the bars for every 2 seconds. It's 2 seconds worth of allocations, but it's actually doing about 16-and-a-half gigabytes per second. Interestingly, this is about half the problem. There's also the broker, which I'm not profiling. Probably produces about the same allocation rate. If you add them together, you're getting about 13 gigabytes per second of allocation. It's close to what I would say is you're reaching your allocation limit of the machine. If you compare this with Chronicle Queue, which as I mentioned, is open source, it takes about 3.7 microseconds to do the same round trip, which is about the distance of walking across The Battery. It's a lot shorter. The other thing is that GCs are a lot quieter. This is in Java 8. The reason that's significant is that pretty much all of this garbage is flight recorder itself, not the application. This is doing half a million messages per second, at this point, but it doesn't really show up. In fact, with Java 17, they've improved flight recorder, and you don't see anything at all.

The reason I've left this slide up here is that in a real low latency application, this is more typical that you will see. If it's not about some garbage, it won't be very much, but it is something you can actually see. It's probably at about this level. This is doing about 300 kilobytes a second for garbage. 300 kilobytes per second is about a gigabyte an hour. At a gigabyte an hour, you've got this nice feature that if you have, say, an Eden space that's 24 gig, then it will take 24 hours to fill up your Eden space. If you've got an overnight maintenance task to do a full GC, you can plan your GCs, so you'd have one minor click and one full collection, say 4:00 in the morning, that's your GC for the day. Then the next day, you can just watch your Eden space slowly fill up. You can have trading systems with this allocation rate that don't GC all day. Your GC pause time isn't an issue, because it becomes irrelevant, because this is not happening. That's one of the benefits of having that low an allocation rate. Of course, there's no particular need to be even lower than that. If you could have it, it wouldn't actually achieve very much necessarily. It wouldn't change how often you GC because you're not GCing anyway, at this point. Also, it doesn't matter so much which garbage collector you choose, because again, you're not triggering collections in normal operating mode.

Even though Chronicle Queue is pretty fast, that's the open source version. It's the green line in this case. We have closed source versions, which are based on C++, which have even lower latencies. More consistently, it is more to the point. The three, four, and five nines are still very low. At the five nines, we get about half a microsecond. This is sending a message from one process to another via a persisted queue. It is much lower for high percentiles. It depends on whether you have that use case, as a requirement. For a lot of Java applications, if your 99th percentile is under 2 microseconds, that's usually pretty good. In short, this is a ratio of latency across different percentiles. If there's another solution to what you're using that's 100 times faster, then it's likely that 99% of all the latency involved is entirely accidental to your requirements. This might be surprisingly high, because you might assume, I can get a 10% improvement, I can get a 20% improvement by tuning. You might not be in a position where you think to yourself, actually, I can make it 100 times faster, just by using a different approach or a different technology. It depends on your use case as to what that would mean and what that will be. This does actually happen. This is something we tend to see again and again, because we're asked to do performance consulting. We can find another way to solve a problem and get a much faster solution.

We had one client that was particularly challenging, because they said we got a month, and we want you to speed up the 99th percentile but we don't want you to change the hardware or the software. They originally wanted me to guarantee how much improvement I could put in. I said no to that. I said, "No, I can't guarantee anything at this point." I thought to myself, I'd probably have to change something but we'll keep it to a minimum. It turned out that all that was really needed was analysis. Where were their latency holes coming from? The reason it only needed analysis was that, in each case, someone had already written the solution, in some cases three years ago, but they'd not been able to put forward the case as to why they should switch, for they've not been able to give them the data to say this is what you should be using because it's better designed or it uses a better technique. Because obviously a bank being risk averse, and so while it works now, unless you can show me a good reason to switch, I won't switch because why break something that's already working. I was able to give them the analysis. Then it turned out through a whole series of configuration changes, switching implementations that already solved the problem, they were able to reduce their three nines by a factor of 25, without any additional code. That's the only time this has ever happened. Usually, you'd have to change some code as well.

Distributed Unique Timestamps

"Code is like humor. When you have to explain it, it's bad," Cory House. Distributed unique timestamps is a fairly common pattern, or distributed unique IDs. You've got a cluster of systems and you want to be able to have an ID that allows you to trace a particular event, or an order, or some particular transaction across your whole application. One technique that some places still use is to have a microservice, which distributes out IDs. That is often backed by a database that gets like a chunk of IDs, and then they're fished out. The problem is, this takes around about 100 microseconds, which is a surprisingly long time, especially for clients who then also say they want to be low latency. A simpler approach is to use Unique UUIDs. It's built in. You don't have to do any extra work. They are statistically unique. The chances of you producing two IDs that are exactly the same is extremely low. We can use them as effectively unique IDs. They only take about 0.3 microseconds. They're much more lightweight. There's no network involved. You don't have to be running a central service for them. There is a problem, though, that they're quite opaque. Even though they have a timestamp in them, you can't read it very easily. You'll probably end up having to create another timestamp as well, which adds a bit of overhead. That's a bit needless.

The solution we take is to have a timestamp that has in the bottom two digits, a host ID. As long as you've got less than 100 hosts, you want uniqueness between, you can have a nanosecond resolution timestamp, and have it unique across all systems. You enforce that it's monotonically increasing, so that even if two timestamps are super close to each other, we artificially say, actually, there'll be a different time. That very rarely happens, we just need to guarantee that it never happens. Now we have a 64-bit value which is very easy to translate into something human readable, and therefore gives you a bit of information, and you don't need an additional timestamp as well. We can do that using shared memory. Using shared memory in Java is very central to a lot of what our product do. This is using Bytes, which is an open source product. Essentially, what it's doing is that it's getting a long value that's stored in a memory map file. That memory map file is shared between every process on the machine, and therefore we can enforce monotonically increasing timestamps for that host ID. Then each machine has to have a different configured host ID to guarantee uniqueness between machines. As you can see, the code is very simple. Then there's a loop. There's an optimistic path, where we're hoping that there's no issue and there's no contention. If there is contention, it will loop around until eventually it gets an ID, which is greater than any one previously.

Even though we're using off heap memory, shared memory in Java, the code isn't that complicated. It's also extremely lightweight, in that it takes around 40 nanoseconds. It's about eight times faster than doing a unique ID. You don't need an additional timestamp as well. You can also store it in any 8-bit value so it can then be used as an ID if it's a long. There's no object creation involved. It has a lot of other benefits as well. This is an example of using a different technique to produce a much more efficient but also, in this case, a more usable solution as well. "Software comes from heaven when you have good hardware." Ken Olsen was one of the founders of DEC Computing which developed one of the early 64-bit microprocessors. I studied the Alpha quite a bit when I was at university, and later years, because I was fascinated by that architecture. I was a big fan. Unfortunately, it didn't win out in the end. Nevertheless, I always thought it was a pioneering product.

The Source of Truth

Despite everything we've talked about, a lot of the biggest challenge is determining what your source of truth should be, because, in many cases, the source of truth is your bottleneck, is the biggest source of latency. It's not easy to swap out data. Do you need a database? Do you need a durable messaging? Do you need redundant messaging or just persisted so it goes to disk? These are the questions that will determine often the biggest source of latency in any system. Making sure that that is appropriate for your use case is often the biggest decision you can make. Reference to Little's Law, where the amount of concurrency your system needs to have is the average time to perform a task multiplied by the average arrival rate, so the throughput. You take the average latency multiplied by the throughput, and that is the concurrency that you get, or you require. As an anecdote, I was consulting for one client, and they had this very distributed multi-threaded application. It was all very complicated to manage and deploy. I asked him, what is your average latency? He said, it was about a millisecond. What is your throughput? It was 1000. You go, 1000 times 1 millisecond. On average, you have one thing going through your system at any one time. Maybe you don't need quite so much concurrency. In fact, it may be creating overhead stopping you from going even faster. I'm not sure they liked that answer. It is something that it's useful to get an idea of. It's, what is the level of concurrency you're actually achieving in your system? All you have to do is multiply the average latency by your typical throughput, and that will tell you how much you're actually getting through. How many things are going on at the same time in your system?

The reason that matters is that, if you've got a high latency, the amount of concurrency you need to achieve the same throughput goes up. If your latency is 10 times higher, then you need to have 10 times as much concurrency to achieve the same throughput. Conversely, if you lower your latencies, you can achieve the same throughput with a much simpler system with lower amounts of concurrency. The reason this is particularly significant for source of truth is often it's not very concurrent. Often the source of truth is single threaded, or has a very low degree of concurrency, because it often is transactional. The order in which events occur are important. Then that becomes something that's essentially single threaded, to a large degree. Exchanges are a very common example of this. In exchanges, you can have trading on lots of different symbols at once. In reality, at the end of the day, usually one symbol dominates all of the others. For example, in fixed income shorts in London, one symbol, which is the 3 months short fixed income bond, has 90% of all volume. It's all the others combined, and only another 10%. In reality, the only thing you have to tune for is this one symbol, which you can't make multi-threaded, so therefore, that needs to be low latency if you're going to achieve high throughput. A lot of exchanges are like this. There may be thousands of symbols, but in reality, at any given moment, there'll be certain symbols that are really dominating all the volume.

Durability Guarantees

One way to look at different types of guarantees is, where do I need to get assurance from in a distributed system to feel like it's guaranteed or it meets my business requirements? Do I need to have a copy on another server or a redundant server, or does it need to be on disk? These are the sorts of decisions that are taken care of for you. If you use a particular database, it will have a particular way of doing this. Broadly speaking, this is a generic problem you're solving. Do I need to have a redundant copy or should it be on disk, or both? What we find though, is that having a redundant copy on a second machine, as long as you have a decent network, it's often a lot faster than forcing it to go to disk. Again, this is supported in the open source. We have a simple interface where you can trigger a synchronous call at any point. It also creates a record in the queue, so the reader knows that at that point a sync was called. You can have confidence at that point, everything up to that point has been written.

This is how much difference it can make to the throughput. The top blue graph is using an M.2 drive. A lot of banks aren't using M.2 drives yet, SSDs are quite a bit about five times slower. Fairly soon, I imagine the M.2 style drives or those drives that have this performance profile will be used in banks in the next couple of years. The main advantage is often they look like, we can do a large number of IOPS, or I can do a high throughput. If your system is dependent on making sure things have been synced to disk, it'll be the time to write that to disk that's most important. That can be determining how fast your database can run, for example. Typical latency is around almost 2 milliseconds. The 99th percentile is about 20 milliseconds. That's about as good as you can get if you have to write to disk, each record. What we advocate, though, is considering doing it programmatically, and not doing it based on the timing. Some products already do this. They'll say every n messages or every n seconds, the OS says that after 30 seconds, it will get written to disk. You can tune that.

The problem is that's an IT metric. It's not a business metric. From the business point of view, it's not how many megabytes I might lose, it's how many millions I might lose. They only care about the financial risk. They don't care about how much disk space that uses. Having it programmatic based on, I've got in a $50 million trade, I want to know that's synced to disk, if I get a request for quote. If I lose that request for quote, maybe it's not so important because that doesn't have so much value. In this example, I'm assuming that 10 times a second, there's a message that comes in that absolutely has to be synced to disk. You see somewhere the best of both worlds where the typical latency is not really impacted by this, but obviously the high percentiles are, but not to the same degree. Even if you've got a disk where you occasionally have messages where you have to have them sync to disk, doing it programmatically based on business risk can give you a more optimal solution.

BDD and AI (Generating Data Driven Tests)

"Code should read like a story, not a puzzle," Venkat Subramaniam. We have event driven systems. That's the model that we use. To test them, we create YAML tests in, and the YAML results out. Each one of these events or messages translates into a method call. It's like an RPC. You got a very simple translation from what appears in our data files, and what is generated as an output. One of the benefits of this is that it can be manipulated programmatically. For example, if we just compare the output, even a very complex output against an expected result, it's just a multiline text comparison. We can very quickly see what its preferences are. A simple thing to do is if we like the result, because this was an intentional change, all we have to do is copy the actual result and overwrite the file that contains the expected result.

We can even do that automatically, have the tool do it for us. Even though behavior driven test, the style is usually to put what the test is, and the expected result together, this can make maintaining it much more difficult because if you have to go through and alter many tests, hundreds of thousands of tests, in some cases, that can be quite tedious to go through. You have to do that manually. There is no easy way to do that programmatically. Whereas if the output is sitting in its own standalone file, it can overwrite the output. We've got a system property to do that, we just call regress tests. Because it's just data, we can manipulate it programmatically through very simple text alterations. We can explore things like, what happens if a field's missing? What happens if a field's wrong? What happens if messages are out of order? You just rearrange them. Then what we can do is say, all of these combinations might be useful but probably aren't. What we do is we filter them and say, if that doesn't result in a new message, a message no other test produces already, then we'll drop it. In this example, the simple actually has quite a range of possible variations to consider that it all generated automatically. The other ones do not. These are the only examples where it generated an original message.

That's good for exploring all the things that can go wrong. How do I generate more happy path tests? Because that's what they're based from, is that you take a happy path test. You break it and alter it in certain ways. You look at all the ways things can go wrong. How do I generate a new happy path test? In this case I used ChatGPT-4. It actually did a pretty good job from a template. I said, based on the following example, generate tests to create multiple accounts. It generated another test suite. One of the nice things is because we use comments in our YAML, the comment gets copied directly to the output. When you're looking at the output, you've got a frame of reference to check each message. It was actually able to update the comments to reflect the message as well, and kept them consistent. In one case, I thought the comment was wrong, but actually, I'd made a typo, and it turned out the comment was right. I'd made a mistake. Having it in English as well as in YAML, I was able to pick that up.

I was able to go a step further and I said, can you create 20 more accounts and 20 transfers? It thought, actually that sounds pretty boring, I'll write a script to do it, or at least that's what it did. I used a YAML template, so it generated a templated for loop in it for creating the account, and another for loop for doing the transfers. I was impressed by this because I didn't think of doing that. It came up with a suggestion of having the code to me. By using both of those, each one produced another 11 tests with variations. I've gone from 3 basic tests I've written myself, to now 55 tests. I also asked Bard to do it, just for comparison. It was able to do it but it was a lot more work. Sometimes AI will give an answer which is not quite what you expected, so you have to revise it and you have to be more specific in the prompt to get it to do what you want. I found that is generally the case more with Bard. I have to work a lot harder and it's usually not my first attempt that I'm happy with, but by about the third or fourth attempt, I'm happy with the result. I think certainly at the moment, ChatGPT-4 has got the advantage for this kind of thing.

Using a couple of other mundane tests to check the DTOs, I was able to achieve a very high degree of code coverage with very little effort. I can see that this is going to be a more popular approach using AI in general to generate tests, in particular maintaining all of the really mundane tests that often don't get created. Taking a more realistic example, this is from EFX trading system, which is a BookBuilder. It's a consolidator of market data from different sources. I identified 57 tests. This is not all the tests but the ones that were suitable for this. When I explored all the combinations, it went up to about 4000, but the problem was it took about 14 minutes instead of 6 seconds. I'm quite pleased with the fact that the services are so lightweight that starting them up and shutting them down is very quick, so you can run through a lot of tests very quickly. We're keen to keep that. After filtering, which is looking for messages that are unique to that test, and not produced by any other test, had it down to 426. It's about a factor of 10 reduction. It's a very simple and effective technique for cutting down generated tests. It still only took 34 seconds, which is adequate.

Chronicle Channels for RPC

We've also got a remote RPC API, which is again an open source wire, where you can send messages and receive them over TCP, through a configuration, you can switch it to using shared memory. In fact, this is the library I used in the previous benchmark where we're going 4 billion messages a second. The typical latencies are around 8 microseconds per TCP hop, and about 2 microseconds per shared memory hop. This is the highest-level API. This is pretty much the worst result we'll get. We've got lots of different optimization choices to go more low level, more consistent latencies. It doesn't need a big rewrite, you just focus on the things that need tuning. That's our general approach.


See more presentations with transcripts


Recorded at:

Feb 01, 2024