Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Achieving Low-Latency in the Cloud with OSS

Achieving Low-Latency in the Cloud with OSS



Mark Price explores the improvements in cloud networking technology, looks at performance testing and measurement in cloud environments, and outlines techniques for low-latency messaging from an application and operating-system perspective. Finally, he compares the performance of the latest cloud tech with a bare-metal installation.


Mark Price is an experienced technologist, with an enthusiasm for all things performance-related. He writes a popular blog on JVM and Linux topics, and is a regular speaker at international tech conferences. Currently, he works as a Performance Engineering Specialist at Aitu Software.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Price: My name is Mark [Price]. I'm here to talk to you about achieving low-latency in distributed systems and cloud deployments with open-source software. I've been working on high-throughput low-latency systems for about 15 years. Most recently with a company called Adaptive. The project we've been working on is to build a low-latency trading exchange for one of Adaptive's clients. During the process of developing this system, we thought we would evaluate what kind of latencies we would see if we deployed the system to the cloud rather than to a bare-metal installation in a data center. In terms of the latencies we're talking about, we want to achieve double-digit microseconds on the order of tens of thousands of requests per second.

Properties of Financial Systems

To start with, I'd like to talk about the properties of financial systems that we would like our systems to exhibit. First off, we want it to be fast. Fast is a moving target or a sliding scale. If we're talking about a system, it needs to be fast from a human perspective. It might be ok for that system to respond within 10 milliseconds. If we've got screen render time to take into account or a network connection, possibly over a mobile network, 10 milliseconds might be fine for our application. If we want our service to be fast from a machine's perspective in terms of high-frequency trading algorithms, then 10 milliseconds is far too long and what we really want to do is keep our response times clamped to some tens of microseconds.

To do this, we need to take a slightly different approach to system design. We can't really use things like databases because we wouldn't get constant enough response times. We tend to keep our working set in memory and do no IO on the fast path and use other techniques like a lot of free programming. We also want our systems to be durable. If I shut the system down to perform some maintenance operation or the system crashes due to a machine fault, I should be able to bring the system back up and I shouldn't lose any data. It goes without saying, especially in financial systems, you don't want to be losing people's money or the fact that it ever existed. We also want our systems to be redundant. If I have a particular instance of a service that fails, it should be possible to bring in another instance which will take its place and the system as a whole will continue to function. The system that we've been working on follows these design principles in order to operate at the speed that we require.

Exchange Architecture

I'm going to talk about a reference architecture. This is this financial trading exchange that I've been working on and just as a way of explaining our system design. In this case, we have clients who are connecting into the system, and they will be connecting to gateways, which are horizontally scalable, stateless, and services. The gateways will forward request to a cluster, specifically the leader and node of a cluster which will then perform some processing on it where the actual business logic lives, and then respond back to the gateway where the response will be sent back to the client.

The platform that ties all this together is a product that Adaptive has working on called Hydra. It's built on top of some high-performance open-source libraries namely, Aeron for transport and clustering and Artio for the FIX gateway. What we've tried to do with Hydra is make an opinionated framework, which you can use to accelerate application development. If your problem domain is suited to this architecture and you want it to be fast and you can code it in the right way, then the platform should help you achieve that quite quickly and add a whole load of features on top of the open-source libraries we use.

To explain the resiliency side of it, the gateways are stateless and horizontally scalable, so if they die, we can provision a new one. We have resiliency in the cluster and we also have this other concept called persistence. Because we keep all our data in memory, we need to bounce that to some degree. In a trading exchange we're basically dealing with orders to sell an asset or to buy an asset. Within the cluster lives the main business logic which is a matching engine, and it matches up people who want to buy something with people who want to sell something and generates trades.

We need to store a record of all the orders that we've ever processed, but we can't keep that in memory because that would grow potentially unbounded and exhaust the physical memory on the machine. We have other components called persisters which are listening to the outputs of the cluster and writing relevant information to their own local instances of a database.

We also have this resiliency here in that we can run multiple instances of these persisters. They all have the same source data and they all write to their own local database. We don't have to have a clustered database which will add complexity and potential latency. We're going to revisit this picture a few times, but that's pretty much the schematics of it. We're dealing really with the fixed protocol in this particular application, which is a fairly verbose ASCII-based protocol. One of the things that gateways will do is to take this large text message which might be about 150 bytes and convert it to something much more efficient like say around 50 bytes binary message.

Durability and Redundancy

The way that the cluster components achieve their durability and redundancy is by using an append-only log. This is a concept that is also present in database systems, we have a right ahead log. The cluster when it starts up will elect a leader node and the leader will then essentially start listening to an English channel that clients can talk to, and in this case, the clients of the cluster are the gateways. When the leader receives a message from a client, it will, first off, append it to its append-only log, and the leader will then transmit that message out to its follower nodes. The follower nodes will receive the message and appendix to their append-only logs, and then they'll act back to the leader saying that they've received that message. When the leader receives the acknowledgments, it can then decide to advance what's called the commit position in the log and then the application itself can process that message. Also, in parallel, it will be sending a message out to the followers to say that they can process the message also.

Use this technique to make sure that by the time the application processes a message, it has been received and appended to the log of a majority of followers in the cluster and to make sure that the followers are gated on the leader receiving their acts so we know that a follower can't go ahead of a leader node. This gives us durability because if we now suffer a system crash or we stop a system to do maintenance, what we can do is just replay the log into the business logic and we should get back to the same state that the system was in before it was stopped. This has the caveat that our business logic needs to be fully deterministic. If we apply the same messages to an empty state of the business logic, it should always end up in the same state. We can't have any external inputs into the system that aren't a message in the log. For instance, we can't refer to the current system time anywhere in our business logic because in a replay scenario, this would be a different value and then our state would be different after replaying data, or indeed, you might have different value for the clock on your follower nodes, so your business logic state would start to diverge, which is not a good place to be in a cluster.

Given deterministic business logic and this appending to the log in the act protocol, we've got our durability. In an exchange where we're receiving maybe 20,000 requests per second, we might be growing this log by tens of gigabytes a day. When we want to replay the log, after stopping, it might take us, say, 15 minutes to replay a day's worth of log. That's probably ok from an operational perspective if there's a little bit of human involvement in there and you need to provision a new machine, then, possibly 15 minutes is an ok recovery time. If we have a month of log that we have to replay, then our recovery time is about six hours. I don't think anyone wants to wait six hours for their services to come back up.

We pair the append-only log with the concept of snapshotting. Snapshots are an immutable record of your application state. They must also be deterministic such that if the system requests your service to take a snapshot, it can write out the entirety of its memory state and should also be able to rehydrate its memory state when given a snapshot. To keep this uniform across the cluster, the message to snapshot is actually something that's in the log as well. All the leader and all the follower nodes will be snapshotting exactly the same point. Given deterministic state and deterministic algorithms and they should all have the same snapshot content.

In order to keep things consistent, we need to do the snapshotting on the same thread that's actually processing the business logic. We can end up in a situation where we induce a small latency hit when doing a snapshot, so we choose to do snapshotting at a time of day when markets are quiet in terms of financial markets. There is another opportunity that isn't implemented, but we've talked about where we could just do snapshotting on the follower nodes so that the leader does not see that latency flaw.

Given logging or appending to a log, and it's a slightly overloaded term, but this isn't an application log, and snapshotting, we get to the point where our cluster is durable and we have bounded recovery times. Because we have multiple nodes of the cluster all of which are deterministic in terms of their logic and replaying exactly the same messages into that logic, we know that if one node was to die, then we would be able to switch over to another node and it would be elected leader and the system as a whole could continue processing without any loss of service and without any loss of data. The important thing about the design of the system is that we can only really keep things fast if we're making sure that we are consistent and that our business logic is done in - I want to say, a thread-safe manner, although a lot processing is done on a single thread and always on the same inputs and that means that our systems can cope with failure in a transparent fashion.


Given a well-designed system that is durable and redundant, how do we optimize for the lowest possible processing times? Here I've grayed out the parts that aren't really important. We care about receiving a fixed message from a client at the gateway, converting it into a binary message, which is more compact, sending into the cluster which will then replicate out to its followers and then responding back via the gateway. Now, we're trying to optimize for this path. The whole frequency traders and market makers who we want to be pricing into the financial markets will want to be doing this many thousands of times a second with very predictable and low latencies. This is what we're optimizing for. To a large extent, the grayed out portions at the bottom of the persisters are more of a throughput-driven target as long as they can keep up with burstiness from the cluster then they are ok.

When we are processing a packet, what's happening is that our gateway network card is receiving a piece of data from a cable. It might then be firing an interrupt, which is service by an operating system, kernel thread, which will copy the data from the receiving buffer of the card into a socket buffer. Your application is then pulling that buffer and accessing the data in the socket buffer, doing some work on it. In the case of the gateway, it's going to be passing that data from the textual representation and creating a new object to send to the cluster. When it doesn't send, it's writing to a socket which is copying the data into a socket buffer possibly notifying the operating system via interrupt that some data is going to be there, then a kernel thread is going to wake up and copy that data back out into the network card, output buffer. This is happening every time we have a network call. Because we need to distribute these messages to the follower nodes before we can process them, we've got four network ops at a minimum that we have to go through here.

If we're using a cloud deployment or a normal bare-metal deployment, we want to place our systems as close to each other as possible. The names for these are different. One particular cloud provider allows you to have something called placement groups. You can have cluster placement groups, which is effectively a guarantee - that might be a bit of a strong term - an indication that the machines within that placement group are going to be very close to each other in terms of the topology of the network that they share. If they are in the same cluster placement group, then you get guarantees about the amount of bandwidth that you can have between those nodes. I think of it as analogous to those machines being spun up in a single rack with a top of rack switch. You get the minimal latency between them.

If you're trying to deploy a system such as this in the cloud where you want inter-node latency to be as low as possible, I want to use placement groups to keep them nice and close together. If we're not deploying to the cloud, and in this particular project, it will be a hosted data center, we would use kernel bypass technologies. You end up paying a bit more for your network hardware, your cards, and along with them use a user-space networking library, so you end up cutting out the thread hop and the packet copy that would normally happen in the kernel to use a LAN handoff.

What happens when you're in your application and use a LAN and you are pulling a socket? Instead of pulling a socket buffer which is being copied to by kernel thread, you're just pulling the receiving buffer on the network card. It means that you can do things faster and with less scheduler jitter, essentially. We can't necessarily always do that in the cloud or depending on which technologies you're using. The best thing we can do is use placement groups to get things close together. The tests that we ended up running as part of this project did that.

Even if we can reduce the amount of the distance between our programs in terms of the network copy path, we still need to be kind to the computers. We tend to just treat them as workhorses before everything up, but in the case of our processing this fixed message, we could convert the whole thing to a string and then use regex to cut it up into bits because it's quite delimited. Then we could create a new object representing the binary form of that message and then send it off to the cluster. All that would induce cache misses, induce garbage collections at some point in the future. All this would put jitter into our system. What we're trying to do is keep our response time latency as low and as consistently low as possible. We ended up using all the techniques like object reuse, essentially.

I'm just trying to make sure that we're not unnecessarily churning through memory when we don't have to because every allocation, if it's not stack-allocated, will be a cache miss. What we want to do is try and reduce everything we're doing inside each box, this logical box, that's a service, as well as what we're trying to reduce in between them. We need to have our systems built in such a way that they are easily replaceable when they fail, but can communicate at speed, and these are the techniques we can use.

Testing Latency

If we can get to that point, then we now have the point where we want to start testing latency. This is essentially the process of sending requests into the system, waiting for the response to come back out, and timing how long it took to come back. It's not rocket science. What we want to be doing, though, is constantly doing this as part of our continuous delivery pipeline. When I'm starting on this project as soon as I've got two services that can talk to each other, that's the point at which I'll start writing my testing harness. It tells you a lot about how your system behaves even from the word go, you can start applying load to it and seeing how it falls over and how it fails.

As the system gets built out a bit more, we can put this into the CD pipeline so that as we release new software, we're constantly executing load against it, measuring what results we get from it, analyzing those results, and any profiling data that we collect along with it. Then if there are any hints or signals in there, we can make modifications to improve the performance and reduce the latency. Then, of course, we need to make sure that those changes made have actually done the right thing. We haven't induced some perturbance elsewhere in the system, which can happen.

When we're doing this kind of work, this is a really good place to start thinking about the monitoring and profiling tools that we want to use to understand our application behavior in the path test environment. It's worth here spending the time to figure out how the system falls over, what its failure modes are, and also, how to recover from them. It gives you a warm, fuzzy feeling if you know that you system falls over 50,000 requests per second and the first component to break will be X and when it does, you know you can recover it by process Y. These things are good to find out before you end up in finding out in production. Any tooling around this testing and monitoring that we do pre-production is worth polishing and turning into production-quality tools which we can release with the software.

Measurement in an Ideal World

The process of measuring latencies, facetiously, just foreign request and tell me how long it takes to get the response back. It can be slightly more nuanced because what you don't want to do is affect the system under test in any way, which is another drawback if you're depending on the techniques you're using. When I say an ideal world, I'm talking about a physical deployment in a data center where I can walk into the rack and I know what switch it is and I can use port mirroring on the switch or install fiber taps on the network cards on the actual machines that I want to monitor. I might go and install fiber taps on the gateway ports as they send messages back out to the clients and receive messages from them. Then I can write a processing application which is listening to those streams and you will be able to get very accurate and precise nanosecond level timestamps from the network hardware and that will help us to get a very accurate picture of what the system latency is. All we need to do in this application is figuring out how to match up an inbound request with an outbound response, and then aggregate that up over time.

Because it's all out of bound and all done in hardware, this is quite nice. It means that your processing app that you're using to calculate latency can fall behind in burst scenarios and you're not going to have any unwanted effect on the application that you're measuring. You just need basically a large enough buffer so that when you do get burst, you cannot lose any data. Obviously, we don't have the ability to go and store fiber taps in the cloud deployment. It goes without saying. There are some commercial solutions that you can use. I saw one the other day and that is using, I think, an IP forwarding to use the networking infrastructure in the cloud providers stack to do this port forwarding for you. I don't know what the actual overhead is. I haven't really looked into it, but we took a slightly different approach.

One other option you could use something like libpcap, which is the library that TCP dump uses. You could have your own software tap essentially on each of these boxes that you cared about. The problem with this is that you end up taking processing resources or compute resources away from your application that you're actually trying to run and you have additional packet copying costs, so you could end up degrading system performance just in trying to measure it and this is something you definitely want to avoid.

Measurement in the Cloud

The way we go about fixing this in our cloud environment is by using a load testing tool, which is also a measurement client. What we're essentially doing here is taking a current timestamp from the system which is running the measurement client. You want the most accurate and precise timestamp you can get. This would be system manage time in Java which is basically just a read of the TSC counter. We need to stash that in the request message that goes through the system. That's fairly easy in the fixed world because you get to set a client order ID for every request that goes in and that automatically echoed back out to you as part of the protocol. When we receive the response, we just look at what the timestamp is in the message, subtract it from the current time and then that gives us our end-to-end latency. Then we're storing each of these latencies into a histogram structure that we can use to make sure that we capture all their samples and will give us a nice statistical view of what the system latency as well.

Because our component which is delivering load is now also measuring latency, it needs to be just as highly tuned and well-designed as the system under test if not more so. You can imagine that maybe we use some off-the-shelf load generator tool, but it allocates some objects during its operation. Our system under test might always respond within 200 microseconds, but if the measurement client happens to do a five-millisecond garbage collection pause, let's say, and during the time that a request is in flight to the cluster or the system under the test, then it will end up recording a latency time of some five milliseconds or so. This would be completely inaccurate. All the monitoring and profiling that we do of the actual system we're trying to record and test needs to be applied to this measurement tool as well. You have to be very careful that you're not measuring stores in the measurement client itself.

One of the nice things about having such a highly-tuned tool for testing and measuring latency is that we can then package it up and push it into production. If this tool is used for getting an external view of latency in your testing environment, there's no reason why you can't run it in production assuming you have decent data isolation in your actual system and give yourself a live view that's more client-centric than any monitoring that you can run within your system yourself. It's another case of tooling which is useful in performance testing which actually with a little bit more work can be used in production to give you useful information.

Latency Measurements

In order to try and apply enough load into the system to really stress it, we need to find out the number of measurement clients or load generators that were using. This also helps to simulate having hundreds or thousands of sessions connected. What do we actually want to record or report, more importantly? We might say that we want to report the average latency of our system. If you've ever seen other people's talks about performance, you know the average is generally frowned upon as being a not-very-good metric. We could say, "Yes, the average latency of our system is 100 microseconds," but that might be hiding some shocking 10-millisecond outliers, and also the average doesn't necessarily tell us what time period we're talking about.

A better way of expressing the latency that we expect and that we want to see is by talking about percentiles. We might want to say that 99% of requests will be service within 100 microseconds, 99.99% of requests will be serviced in under 200 microseconds, and the max would be 500 microseconds over the course of a day. Given that kind of definition of the latency profile that's required or that we're attempting to achieve, it's very easy to say whether we've achieved it or not or whether we need to do more work. There's no hiding space. Certainly, in this business domain where people or your customers are expecting very consistent low latencies, telling them that they may have had a bad day and loss of money but average latency was fine isn't really going to wash. You need to be able to put some limit and better more precise phraseology, should we say.

We use HdrHistogram to store our measurements which is another nice open-source library. It allows you to store a configurable precision in your histograms and your latency measurements and has a very nice compact binary format. Also, you can sum these data structures together, which is useful when you have multiple different measurement sources. If I have a 99th percentile in my first measurement client of 100 micros and a 99th percentile of 150 micros in my second client and the 99th percentile of 1.5 milliseconds in my third client, what's the overall 99th percentile? It's not the sum of three divided by three. It means that's just a meanings number.

The nice thing about HdrHistogram is that we can add all those up essentially because it records every single profile and then come up with an accurate number for what was the system-wide 99th percentile. In our testing environments, we're storing all this into a database for posterity longer-term analysis. As we are continuously doing performance runs, we'll have a time series database where we're storing the percentiles just so we can chart them and look for regressions, but we also store the source histogram data as well so that we can use it later for analysis.

Along with the actual system latencies, we are monitoring both on the measurement clients and all the actual systems that partake in this message flow the any jitter on the system or hiccups down to time to say point pauses or something similar. This helps just to correlate any latency outliers, because it's usually something like, we've accidentally started allocating some memory, so we've got garbage collection pause and that kind of thing is much easier to go nail down when you've got a record of all these hiccups.

Cloud Tenancy

We have a fast system, we can measure it. How do we go about deploying it? There's a fair bit on the slide, I'll unpack it. When we deploy to the cloud, we're not really in control of what resources we have or what we're running on. What we have here is a schematic of a server-class machine. We've got a network card down the bottom which is shared by all the VMs on that machine. It's a two-socket system, so it has two newman nodes, to banks of RAM, two L3 cache and there's a bunch of physical CPU cores which share the L3 cache. Then on top of that, you've got your virtual machines each of which has some number of CPUs assigned to it.

Our special snowflake application is on the left here. We want it to always have CPU available, we want it to be nice and responsive. Then maybe we're unlucky and someone has decided to deploy a Bitcoin mining program on to the virtual machine that just happens to be hosted next to us. Let's say this is a very CPU-heavy workload, we might see flashing of the L3 cache that is being shared between those containers. We could experience higher rates of cache miss than we normally would. We might get CPU throttling effects. If this red box is constantly hammering those CPUs, then we are unlikely to ever enter turbo mode on our CPUs.

The problem with this is, it's very hard to determine whether this is the case from within the context of your virtual machine. You can't really see what's happening on the rest of the box, so you might not even be aware that this is happening. There is a solution to this problem, which I like to term by the whole box. It's not terribly efficient, of course, but what we can do and what we did for these tests, because we wanted to ensure that we weren't getting any jitter from other virtual machines is, I can go and look at the largest widest box that I can buy or rent from my cloud provider, look at the CPU model, so maybe it's an Intel Z on something rather with 96 CPUs. Then I can go to the vendor's website and say, "Ok, that chip model definitely only comes in or this highest core count is 96 CPUs." If I buy that whole box, I can be fairly sure that there aren't any other virtual machines running on it and no one is rushing through cache and I have the whole thing to myself. If we really care about latency that much, this is the extremes we'd want to go to make sure that we are getting all the CPU results to ourselves.

It's not cheap. The machines that we were using were, I think, $5 an hour, so that's about $43,000 a year. At this point, you're really into the scale of whether it's worth just buying your own hardware and installing in a data center. I don't have those numbers on hand, but I'm sure you will pay it back in not too many years. Also, it's wasteful. One of the ways we would get away around this problem with the services we're designing is that we would pack non-latency sensitive processes onto the other newman node where we weren't running our high-performance software. If we really want to buy the whole box and then we don't have to worry about having noisy neighbors.


I've pinched this slide from one of Brandon Gregg's talks. I think it's a really good visualization of the improvements that have been made in these virtualized environments where we've gone from a few years ago having things like the networking stack, the virtualization of networking done in software to the point now where everything is done either in custom silicon or been offloaded to something else so that we're not really taking any resources away from your application or in fact, to the machine that you're renting in order to do virtualization technologies. The key point here is that the second to bottom line says the bare-metal systems that were brought out a couple of years ago are supposed to be equivalent in terms of virtualization overhead to having no virtualization whatsoever. The system performance should be comparable to that of bare metal.

This is really what I wanted to test when we came up with this idea for running representative workloads and seeing how things stand up. It does also inform the machines you should be buying or renting if you care about latency. These tests we ran based on M5D metal instances, which are as close to the metal as you can get in the natural systems. Obviously, if you really care about CPU and network, you need to make sure that you're getting the fastest and lowest overhead machines you can get.

Minimising Latency

We need to do a few other things to make sure that our systems are fast. There's quite a lot on the slide, so I'll unpack it a little bit. What we've got is like a microcosm of the larger system. In the larger system, we've got machines which are talking to each other. Then when we zoom into one single machine which would be running a cluster node, we've got a whole load of other little components which are actually passing messages to each other, but instead of using their network to pass messages to each other, they're using the L3 cache.

When we're tuning the operating system and the machine for lowest latency, what we want to do is isolate some CPUs from the operating system which will run our application on. Then we will turn off hyper-threading. When you look at your system monitor and you see how many CPUs your system has got, it's actually telling you how many hardware threads you've gotten on most, say, Intel systems. It's actually half that number of physical cores and the hyper threads are sharing the L1 and L3 cache on a physical core. What we don't want to have happen is our application to get stores because its cache is being invalidated by another thread. We switch off hyper-threading which means that each of our important threads now gets its own whole physical core and all of the L1 and L3 cache on that system.

When we're running our core business logic in the heart of the cluster, we have a number of components. First of all, we've got media driver which is the network send and receive. We've got the archive component which is responsible for taking inbound messages and it will be appending it to the log that we've already talked about. The consensus module is pulling for inbound data, doing the right to the log. Then the media driver is then copying messages back out to the followers waiting for the X to come back in, and finally, our application can then process the message. Importantly, all the data structures that we're using are very nice, either reusable circular buffers or very predictable append-only structures which are CPUs and page caches and disk subsystems are very good at trying to do predictive loads on.

If we're lucky and everything is working harmoniously and we don't have anyone else flashing through our cache, what we should get is this really nice behavior when we receive a message into the L3 cache from the network card and it gets copied to the log appended there, and then copied out to the follower nodes, and then the application will be able to process it from the same place. There's a minimal amount of copying going on. Because all these threads are located on the same newman node and they're sharing the same L3 cache, all the communication between them should just be going through the cache. It helps to keep things a lot faster so we don't get main memory access latencies.

The important thing is that we can't just think of our systems as a closed backbox. There are always messages being passed between something. At lower level there's messages being passed between the L1 and L3 cache, but that's a bit further than I need to go. For us, the other bit of tuning we do is to make sure that each of these threads that we care about has its affinity set so that it always runs on the same physical CPU core, and that means that if it's going to use a data structure that is just accessed, it's going to be in its L1 cache rather than the possibility that that thread has been moved elsewhere by the operating system and then gets woken up on a call with cold cache.

Low Latency?

I'm going to come to some results, I should start with the caveat that this is extreme baseline workload. What we tried to do was measure what was the lowest possible latency we could get going through a system built on this platform where we had a fixed client sending some messages through your gateway, round trip through your cluster and back out again. Any business logic that we download on top of this would be extra. We're doing the absolute minimal amount of work and there's nothing complex in that. Although, we would hope that the business logic we add is going to be a fair, not very much overhead on top.

In terms of results, we saw 5,000 requests per second. It’s not a particularly high number, but this was just a number we picked out there. We got three nice latency of 121 micros through the cluster - this is on the cloud deployment - four nines of 223 micros, and a max of just under 600 micros, which I thought was really incredibly good actually considering what my preconceptions of what virtualization would add in terms of overhead. On our LAB hardware that we used to get as a baseline without the overheads of using the kernel network stack, we saw max latency of 342 microseconds. This is the big difference between using the kernel network stack and using kernel bypass. It's still pretty good at only two times and well within a millisecond, which is pretty awesome.

RTT Latency

For those who prefer to consume the numbers of our diagram, here we've got the green line is our LAB hardware using kernel bypass. You can see it's much quicker. It's by far the lowest latency mainly because there's a packet copy which has been taken out and a couple of thread hops. We're on the same tests on the same hardware using the operating system, the kernel network stack. We can see that it's a bit more expensive. At the top end, you get this extra jitter associated with having to sometimes wake up a thread via interrupt or something similar and the extra overhead of copying packets between socket buffers.

The purple line is very obviously a bit slower in all cases. It's a bit hard to draw any real conclusions about the difference between the blue and purple line. In our LAB, we're able to fiddle with the BIOS of the system. We're running a different kernel or a different operating system, so it's not fully apples to apples. It might be that there's work that can be done to bring the cloud results in a lot lower. Also, we can get away with fiddling with things like the BIOS because if we break the system, we can always just go and restart it, whereas in the cloud, not so good. We can see, the main problem when you're not using kernel bypass is that you get this jitter at the top end. It was also pretty good. I was very happy with those. In terms of the actual client we're doing this for, they weren't willing to accept two times latency for cloud deployments. Just fair enough.

The Future

In terms of the future, this is interesting. The hardware will continue to improve, I think, and the cloud providers will continue to do more stuff in hardware and try and reduce the overhead of whatever virtualization there is there. Operating systems are always trying to improve, so there's always work going on in the kernel to try and increase the throughput and reduce the latency of the packet processing path. I think the interesting point will be the interaction of the transport libraries that we rely on and how they interact with technologies available in the cloud.

Certainly, two of the largest cloud providers already give you access to Intel's DPDK which is a kernel bypass technology allowing you to get around that backup copy, but the problem is it doesn't have a nice sockets API. If our library authors who we rely on for our network transport were to implement DPDK support, then we would effectively get a free launch where we can flip a switch and suddenly start using kernel bypass in the cloud, and then we would see that cloud latency probably dropped below the level where our LAB hardware bare metal latency was. I think that will be a very nice thing to have when we get there. We'll see when that comes to pass. Essentially, the cost to use to consumer will be zero because we've already got the technology there, it's just a case of flipping the switch to turn it on.

I should say, we built this platform on a lot of open-source tools namely Aeron for the clustering and messaging and Artio for FIX, HdrHistogram for metrics. We use a load more open source tools, and obviously, these are the ones that are important on the fast path. For anyone who knows that whenever you publish any benchmark numbers, you should also publish what kind of test harness you're using. These are the machines that were running things on. There are quite different. There's no apples to apples comparison here. I think if you really wanted to do it scientifically, you would buy exactly the same hardware as you were using in the cloud and run that locally, but we didn't. That could explain some of the differences.


See more presentations with transcripts


Recorded at:

Oct 22, 2019