Transcript
Barker: For this talk, I'm going to survey a bunch of APIs that give you alternative approaches to accessing your networking infrastructure, ideally, with the goal of improving performance. This is a mechanical sympathy track. What I hope people will get out of this talk is either learn some interesting new things about certain specific APIs that could be applied to your own software and applications. Or even just learning a little bit more about some of the ways to potentially deline your software such that you can take advantage of some of the existing APIs that are already there, in order to improve performance and efficiency in your systems. As we run across this, you will notice there are some certain common themes amongst these different APIs. You'll see those as we go through.
Outline
I want to do a bit of an overview of Aeron. What Aeron is and why looking at some of these alternative APIs are important. I'm going to quickly look at the existing BSD Sockets API. Talk about why it's expensive, and what we're trying to gain from taking different approaches. I'm going to look at a whole bunch of APIs that exist in Linux that are sending message or receiving message via io_uring and a few others. Then I'm going to talk probably in a bit more detail around DPDK, which is one I've been working with quite a bit recently. Then I'm going to make a bunch of honorable mentions about some of the others.
Aeron
First off, what is Aeron, and why is this important to us? Aeron has a goal of being the world's fastest messaging bus, and we're trying to make it such that it works that way regardless of deployment environment. We wanted to make it work really well on premises, and we want to make it work really well in the cloud. I'll talk about some of the details of those as we go through. Aeron itself, you can consider a layer 4 protocol that exists at the same level as UDP and TCP, but shares properties of both, what we think are the most appropriate for building applications. First off, it is message based, which is like UDP, as TCP, which is stream based, it's a whole bunch of bytes. There's no message delineation with TCP. It is connection oriented, though, like TCP. There's a protocol to establish a relationship between a source and an endpoint. It's reliable, so it does negative acknowledgments, and redelivery unlike TCP. It's multicast, so we can actually do very efficient delivery to multiple endpoints. It also has some aspects like flow control and congestion control. It actually does flow control over multicast, which is a very unique [inaudible 00:03:18]. Because we have this very strong goal of performance, we're actually going to look at some various ways to get the best out of our network infrastructure, in the most mechanically sympathetic way.
BSD Sockets
BSD Sockets. If you haven't programmed in C, this will probably be fairly new to you, if you have, this is very old hat. I just wanted to set a brief tone. The way that we get a socket in C, is a socket memory call. It's the parameters we specify determines the nature of the socket that we open. In this particular case, we're looking for an AF_INET socket, which means IPv4, and we're going to send socket datagram, so there's the message orientation part. We're also able to bind that particular socket to a specific address, so we're saying that this particular socket wants to receive data from this particular port and address up here. We then can set some options. The actual venue I've used here is not particularly relevant, we might be setting buffer links, that sort of thing. Then we can use the call send message and receive message, plus data to the operating system to then forward it onto the network on our behalf. There are a couple of other APIs as well, but I just wanted to focus on those two.
Where are the costs in this model? The cost exists in two places. The big one really, is this transition from user space to kernel space. As the data transitions across here, it all has to be copied. There's a lot of overhead of the operating system, a bunch of bookkeeping that it has to do. This has got more expensive more recently, with Specter and Meltdown. There's additional work that the operating systems are doing to mitigate those particular issues. Most people don't really seem to notice, but when you're starting to build very fast applications that do a lot of network integration, the cost of the system call overhead can add up. We also had some data copying through the send buffers and receive buffers. Some of the approaches we look at are trying to amortize or mitigate or remove those costs all together. Why? Let's justify it. I've said lots of nice things about where these costs are, let's actually look at them very briefly in some numbers.
This particular test is using a card's own kernel bypass library. I've done two runs, one with an existing using standard network sockets, and one with the same networking infrastructure, but this vendor supplies a library that looks exactly the same as your standard POSIX APIs. It's implemented all of that stuff in user space. It's not doing this kernel transition. It's able to talk directly to its linked network cards via DMA and various other bits and pieces of magic. You can see here, they're on this particular network, which is quite an efficient one. The end-to-end latency of a socket library is about 17 to 18 microseconds, but with this bypass, it drops to about 7 or 8. There's quite a decent benefit to taking this particular approach.
Linux recvmmssg/sendmmsg
The first API I want to talk about is sending message and receiving message. This is an old API, it's actually been around in Linux for ages. It's probably not new to many people. Unfortunately, we haven't written it in Java. I do a lot of work in Java, and one of the most annoying things is you can't actually access this particular API in Java. It's a bit of a shame. It'd be nice if we had widened the layers such that we could take advantage of this behavior. The reason I wanted to look at sending message and receiving message is it starts on one of these themes about amortizing cost. The idea was sending message, for example. If we're trying to take multiple messages, that's what the extra M stands for, and pass them to the kernel for delivery onto the network in a single system call. If we got five or six messages, they get copied and then passed over, but we amortize that system call cost across. When you're thinking about the design of your software, you should be looking at writing your networking code such that you can take advantage of this. One common design idea that I've seen and applied and used inside of Aeron, as well, is you have dedicated networking threads, like a dedicated sender and receiver. You front those with queues. As your application is receiving data, you just plumped it into a queue, and the other side is a thread, which is picking up those messages as fast as it came in as large batches, not as it can, but whatever things will tune well, and passing those over in bulk, through a sending message style call.
One of the important ideas here is you then need to make sure your application code is asynchronous to take advantage of some of these [inaudible 00:08:23]. A lot of systems I work with do work that way, and do get enormous performance benefit. What happens is when your system is a under heavy load, you've got bursts of activity, the system becomes more efficient. That's very important.
Linux - io_uring
The second API I want to talk about is io_uring. This originally came about for improving the performance of accessing data storage. First off, it takes that idea of amortizing costs of a system call. What it does with io_uring, you request that the kernel set up some shared memory between user space and kernel space, that's represented a bit as a ring buffer. You have a submission queue and a completion queue. You put your request into the submission queue. One request could be something like send a message with its IP addresses, and the associated data. The kernel will pick that up and provide a response back on the completion queue. You can batch a whole bunch of messages onto the submission queue. Then you call io_uring inter, to trigger the kernel to do the work to process that queue. One of the other interesting aspects of io_uring, is you can speed this up because you can ask the kernel to start a busy spinning thread on the back of that to almost eliminate all of the system call overhead all together.
This is actually shaping up the second theme that we'll see in a couple places, is that creating this more efficient mechanism for transferring varying data in and out of this kernel by creating a shared memory area. I've written some io_uring bindings for Aeron, slightly unfortunately, I haven't managed to get them to match the pace of receiving message and sending message. This could be a problem with my implementation. This could also be the fact that I'm on a slightly older kernel. One of the other reasons that we haven't gone all in on using io_uring is that it's also quite tricky to get customers to upgrade their kernels, because a lot of the long term releases are still on things like 5.3, 5.4. You really need to be looking at 5.10, 5.11, 5.12, based out of io_uring, especially for the networking side.
Here Be Dragons!
Where do we go from here? At least we still have all our nice operating system abstractions. What happens next? Here Be Dragons. This is where we will step away from some of the nice abstractions that we have, and start dealing with some of the much lower layers in order to gain a bit of extra performance. We're going to start looking at, how do we handle raw packets? What are some mechanisms for doing that? If I jump in, it might be worth me doing a quick overview of what actually a packet looks like.
If we start at the higher layer, layer 4 UDP, that's what we're currently dealing with, purview. A UDP header really needs four items, the source and destination ports, the length of the message, and a checksum that is used to validate that there's been no flipping of bits as the message has gone across the wire. Fairly straightforward. Notice, there's no IP addresses on here. That's when I speak to a slightly lower layer, the IP layer, because we can layer TCP or UDP on top of IP. You can see the IP addresses there. There's also a couple of other fields that are interesting, protocol, UDP, TCP, potentially others. You have, Time to Live, how many network hops are you going to allow before the switches and routers start checking packets away? You also have another header checksum. If you notice, there's checksums at both the UDP layer and at the IP layer.
As low as we're actually going to go in this particular case, we're going to drop down to the Ethernet Header. This is a layer 2, this is a connection. It's the data link between individual nodes on the network. Not a lot here, source and destination MAC. This is actually important because this is the mechanism in which you use to communicate between two machines. You're actually using hardware addresses. The IP addresses are actually just a mechanism to map down to things. When you're dealing with raw packets, especially in the case of sending, you have to construct all of these fields. It becomes your responsibility. Whereas previously, all of these layers were handled by the operating system.
Linux - AF_PACKET/PACKET_MMAP
Let's take our first foray into dealing with raw packets. The first interface I'm going to talk about is AF_PACKET/PACKET_MMAP. This is largely in the operating system to do very efficient packet capture. I actually wrote a particular piece of software in a previous organization that used this mechanism. What this is doing, you'll start to see some common themes again. What we're doing is we set up this particular socket that connects directly to a network interface card. It's just saying, give me all the packets of this network interface card, no filtering, no socket buffers, nothing like that. We ask the operating system to create the shared memory area, an RX Ring and a TX Ring for receiving and sending. Then, if we want to receive data, we can use a poll method to tell us there is now data available in the RX Ring. Similarly, we can flip the TX Ring [inaudible 00:13:48]. Again, the same themes, there's some shared memory area between kernel and space in user space, and the ability to amortize a system call. I haven't actually had a go at sending packets via AF_PACKET, but I'm pretty sure some of the mechanisms for doing that is similar to what I've dealt with, with DPDK.
How is this actually set up? Interestingly, the APIs for constructing these things are actually reasonably familiar. They're not new, they just use some of the existing system calls that are there. When we set up a socket, we give it a type of AF_PACKET. We say it is raw data, and we say we only want to look at Ethernet frames, we don't care about frame relay or ATM or anything like that. We set some socket options, the important one being that second one. That structure we're parsing in describes the size of the ring buffer, the size of the individual blocks in the ring buffer, various frames. Then once that's set up, we can use a MMAP call. This is where we actually establish that shared memory space, because the operating system knows that the socket has the backend ring buffer and knows when it wants to memory map it, that's the region of memory we're making available to user space by simply returning a pointer to it. You can overlay the data structures on top of that, to treat it like a buffer.
Linux - AF_XDP
The next one I want to talk about on Linux kernel is XDP. XDP is interesting, so a lot of these ones previously we've been talking about shared efficient access to data between kernel space and user space, amortization of costs, optimization of syscalls across multiple messages. This takes a different approach. This is where you're actually allowed to push your logic down into the kernel. Immediately I've drawn up with that in the ring buffers, one thing you can do with this is you can actually create ring buffers that can be shared between user space and kernel space and push data up there. Because you're injecting code into the kernel, you can do fairly arbitrary things. The code is in the form of eBPF, which isn't a complete language. It doesn't allow things like arbitrary looping, for example, and you get to a limited set of data structures. It installed quite interesting things, for example, packet redirection and load balancing filtering, which is something that Facebook does. All their packet filtering is built on top of AF_XDP, and they got significant performance improvements out of doing that, ahead of using things like IP tables, or various other load balancing tools. It's really interesting.
I've not really used XDP a whole lot, but I thought it's important to mention a quick example of using it. There's a number of interesting entry points, so I've constructed a little BPF program here. There's a couple of entry points for XDP. I've got drop, either it's received a message. If it's one I want to echo back, so this is a little echo server, so when I want to echo back, I don't pass it up to user space, I'm dropping it. What I do is I take the packet and I switch the MAC addresses, I switch the IP addresses, and I switch the ports, and I say transmit this for me please. It's quite possibly, I think, the most efficient echo server you can write on Linux. Obviously it's a bit of a toy, but just gives you an example of the types of packet manipulation and interesting logic thinking of potential [inaudible 00:17:09].
DPDK
Last off, and most interesting, and where I would like to go into a lot more detail is DPDK. This is again taking a very different approach to the other ones. What we've looked at so far is shared efficient communications between user space and kernel space. We've looked at putting logic into the kernel to try and make things more efficient. DPDK is probably the third thing, which is why we're trying to lift as much as possible out of the kernel and run it in user space. That's what DPDK's goal is. They provide a range of what they call poll mode drivers, or different sized pieces of hardware. The way that DPDK works is that it uses hugepages. Hugepages generally means static. It asks the operating system to tell it what the logical to physical mappings are for those particular pages. It does require some privileges in certain circumstances. Then it sits at the networking device in memory, so it asks a particular driver, one called VFIO, there's one called UIO as well, for the registers and the mappings for a particular device. It's able to set up the device. It's able to do direct memory transfers to this region of hugepages you've set up in user space. This is really interesting.
Then it completely used the use of interrupts. It's the reason why they call it a poll mode driver, is everything's done via polling. Data comes in, you're required to call a method to receive packets off of the network device. This is a really interesting approach. It does this with the idea that by taking the kernel out of it completely, you're able to run the whole thing in user space. The reason why this one has become very interesting, is in front of your very efficient networking applications on AWS, they've got really good support for DPDK on their elastic network devices. This is where we've been focusing our efforts. One of the things you're doing by taking the operating system away, like with AF_PACKET/PACKET_MMAP, is you don't have an IP stack anymore. That's now your responsibility.
DPDK - Challenges
What are some of the interesting challenges around dealing with it? You're actually having to do your own IP address to hardware address mapping. There's a protocol called ARP, which I learned at university, and also 20 years later actually had to deal with, very simple really, layer 2 protocol. Where you parse in pairs of IP address or a protocol address and hardware address, and you can send out broadcast messages and send responses back to tell other nodes in the network or on the same subnet, what your IP address to MAC address mapping is. It's pretty simple, but again, that's something the operating system did for you. As well as annoying steps, you have to do that first, before you're actually able to do anything else, otherwise, you can't send messages in. You can hard code MAC addresses, and that's interesting.
In AWS, they have a very neat feature, you actually don't need to implement ARP. They actually allow you to just specify the MAC address of the gateway, and so you send it there. Actually, the hardware device, the Nitro cards inside of the AWS instances, will actually figure it all out for you to send to the right place. I've tried both, using direct MAC addresses of remote nodes and using gateway addresses in AWS. It's a known performance difference. You don't have socket buffers, you just have these receive queues and transmit queues. This creates quite an interesting design approach. Inside of Aeron, we have these sender threads and receiver threads. It turns out, the receiver does a little bit of sending, of things like status messages and MACs and that sort of thing. The sender does a little bit of receiving, these status messages or these control messages.
With DPDK, I had to be very strict about send versus receive. I'm getting all of the raw packets in the device up into one thread. What I actually had to do is figure out which packets are for the receiver, and which packets are for the sender, and actually redirect them and have an internal buffer for managing this. Whereas previously, you could just open up your sockets and you'd be bound to different ports. You have to do things like manage your own ephemeral ports, you can't just say bind to port zero in the operating system, and they do that for you, you've got to do that work yourself.
The other thing is when you start using a device with DPDK, it becomes invisible, generally, to the operating system, so that tools like IP, and ifconfig, tcpdump that would normally work off of an interface, they're gone. It's quite unnerving when your device completely disappears, so you're having to debug some of these things. Whereas, normally with a network application you quite often use a tool like Wireshark. Often that thing is not available. I found batch are the best ones that are setting up the networking devices, and having one run using standard operating system sockets, and the other one using DPDK and send messages between them, so I could actually watch the data on one of them, and actually set up the interactions that way for deploying to [inaudible 00:22:42].
Hardware offloading is an interesting one. I talk about checksums. Actually, modern networking hardware can do that for you. With DPDK, you have the ability to query devices, or want to offload features they provide, and then set the device up to do that for you. That's quite neat. Operating systems, again, would normally be doing this on your behalf. In order to get those checksum values calculated for you, you really want to tweak the settings on the network card. There's APIs for doing that as well.
DPDK Throughput
The last one I want to talk about is MTU configuration. In a standard Ethernet frame, you're normally talking about 1500 bytes. It turns out, in AWS, they often use a 9K MTU. That has some interesting impacts on performance, especially because AWS networks are not as fast as your local area networks. Actually having a larger MTU allows you to have more data to cross that expensive boundary in a single shot. I'll show you some numbers based on it.
First off, when I started looking at licensing, I was doing basically a ping pong test. I found the very small messages, there's actually not a big difference between the sockets API and a DPDK API on AWS. Where I've seen a really big difference is when the data sizes ramp up. You'll see about a 98th percentile, when I'm dealing with 1K messages that stays quite nice and flat when I'm using DPDK, but I get a big jump up when I use the sockets API. This is really important for Aeron because Aeron is trying to very aggressively batch messages over the network. I did some tests with varying different sizes of MTUs. Interestingly, when I was running with a default 1500 sized MTU, the throughput is significantly worse than the standard BSD Sockets approach. It wasn't until I ramped the messages up, ramped the MTU up to the 9K to match the default on AWS, did I see a performance improvement. I was testing with streaming messages through. I streamed little messages at 32 bytes and streaming 500 byte messages. With the larger messages, you can see nearly twice the throughput using DPDK.
Briefly, in the DPDK, there's a couple of others waiting about. You can see things like [inaudible 00:25:17] is now owned by, they have a Toggl API, which is specific to their network account. On Windows there is one called Registered IO, that again uses this submission queue, completion queue model. There's also a thing called InfiniBand Verbs. Again, that's another queue based one, which is based on that idea of moving all of the networking code up to user space. You set up these various queue pairs, and you're doing what's called RDMA, or Remote Direct Memory Access to other machines. Often, you need a bit of IP infrastructure to set up the connections between the machines. Then once you set up these queue pairs, you can send messages between them.
Hopefully, you've seen some interesting, different approaches to networking. Hopefully, you've seen some of these themes carry through with these things like DPDK, and Libibverbs where you're trying to move your networking code into user space. We see things like io_uring, and packet MMAP, where we're trying to share things very efficiently between user space and kernel space. You see a lot of these things trying to look at ways to amortize the cost of system calls across multiple interactions. Then we see the really interesting ones, with things like AF_XDP, which allows us to push code down into the kernel in order to get greater performance.
The Overhead of Implementing ARP and Checksums
Montgomery: I definitely want to circle back to the biggest benefit is eliminating interrupts when you were talking about DPDK. The question was about the overhead of implementing ARP, checksums, and so on.
Barker: Checksums are fine. Most hardware does it for you. DPDK gives you the option for enabling offload. There's a whole bunch of setup you need to do with DPDK, when you connect to your device. There's a bunch of flags, which are all the hardware offloads the device supports. Then you've got to say, I want these ones, and switch them on. You might not need some of them. There's things for handling TCP and UDP checksum. There's things for jumbo frames. Some things would do some security as well. There's a whole graft of different potential offloads. It's just something you don't really think about when you're just using the networking API, because generally kernel's looking after it. Checksums are fine.
ARP's an interesting one. In the cloud, it is fine, you just need to resolve the MAC address of your gateway, and you just use that. You don't actually even have to worry about ARP. Typically in a cloud environment, and the AWS recommended approach for using DPDK, is you set up two network cards, two interfaces, and you make one normal, and it's available on the network and you create the other as DPDK. That way, the one that's connected normally will actually resolve the default gateway, will have an ARP table entry. What we're looking at doing is when the system starts up and just read the system ARP table entry for the gateway and where it will go.
I've got some code working on an Intel network card. The overhead is actually quite small, because you just need to resolve it the first time, you might want to time it out and retry it every now and again. Once you've resolved it once you've got an address you can send to. The interesting thing is the way that Aeron works because Aeron handles reliability, if we try and send a message to a particular address, we don't have the MAC address for it, what I actually do is just drop the message and send the ARP request. Then Aeron will take care of doing the redelivery for you. Normally, this is actually happening during setup, so there's setup protocols that's saying set up this particular connection, you just missed the first one. That could mean like a sub-millisecond delay in the initial setup. Maybe every 30 seconds or a minute or something you want to check to see if the IP address has changed, you don't have to delay, you just send a new ARP request. If the mapping happens to be wrong, the data will just get dropped on the network, and eventually, it'll end up being redelivered. Actually in a very high performance network, when you're trying to get the best out of it, you're not going to change the IPs and MAC address mappings very often, so it's very unlikely to be an issue.
Questions and Answers
Montgomery: Not unless you want that hit.
DPDK is for specialized software only, meaning it's not suitable for general purpose use. I was thinking of software based firewalls, but because it is not using standard network stacks it's out of question. With that question, talking a little bit about where DPDK may fit and where it may not fit?
Barker: I believe the main goal of DPDK is the load balancer firewalls, those sorts of very network oriented applications, where you're just wanting very fast redirection of packets. Most of the examples you look at are all about taking a packet, fiddling with the values of that packet, and then passing it back out. It's less usual to do something like what we're doing in Aeron, which is to make a surface up for the application. It works really well for Aeron, because Aeron uses UDP, and UDP is such a simple protocol that we can write the UDP stack ourselves. Where it gets hard, is for applications that are relying on TCP, because if you're using DPDK, you've got to write your own TCP stack. That's probably in complexity is very similar to writing the messaging side of Aeron. You've got to deal with redelivery and acknowledgments, flow control, congestion control, and all of these things that the operating system has spent many years developing, you then have to find an implementation of that. I believe there's a few user space ones off the shelf in various places, where you have to write it yourself. That's where it gets hard is if you're trying to build TCP or IP based applications on top of it, I think we find the most difficulty.
If it's just straight UDP based, I think you can actually have a lot of wins. It's not that hard to write your own UDP frame. It takes a little bit of digging around. You have to deal with things like ARP, but it's a really interesting problem to solve as well. It didn't take me a terribly long time to do it. Yes, definitely the networking space, I think something like Aeron, it's great. If you were just building a pure application, like a web client or something that's very user oriented, it may be overkill. I would say find a messaging tool like Aeron, or something similar that's already done some of the heavy lifting for you.
Montgomery: Or look to other things like OpenOnload, or other things like that, I would assume, to give you some of the benefits, but still take care of some of the stack things for you.
You showed some C, with the BSD Sockets API. It gives you some C++. If you're doing Java, or Python, or Rust, what can you latch on to with things like send a message, receive a message and all the other calls? What do you have there? Yes, you can do some foreign function interface types of stuff, but for Java you're left to your own devices.
Barker: The Java networking story has a very checkered past. With things like DPDK, your best bet is, you're probably going to have to do some foreign function interface JNI, or one of these newer JNA or something like that. Unfortunately, the Java networking stack, certainly some of the later releases has gotten a lot of love recently. Versions 15 and 16, they've started doing some work to make it a lot more efficient. I think with Java, that's going to be your bottleneck because the actual Java library with rather being able to get down to some of these lower level details. I believe there are some really good tools, things like Netty. Netty has some good operating system accelerations so that you can actually get to receive a message and send your messages that way. They also do some things a bit faster and more efficiently natively.
I think with DPDK or something like that you are limited to the foreign function interface style of things. Similarly, for Rust and all of those. I think it's interesting that C keeps cropping up. It's probably worthwhile mentioning why I think that is, is that it's the lingua franca for most operating systems. C or C++ generally seems to be where all the interfaces surface. If you want to use Rust, fine, great language, really good for systems, would be really good for something like this. Aeron would be an interesting thing to write in Rust. There's always just that extra layer, you have to have those language bindings to the C API to then transform it into your Rust style. It's one more thing, one more bit in your infrastructure that can go wrong, that can have bugs, that needs to be troubleshoot if you've got an issue that you're trying to figure out. I think this is why C is still remaining really popular and why it's actually gaining in popularity recently. It's just, we've gotten to this point with hardware, and it's not just networking, storage, and a lot of other things where we really want to get to these lower level APIs and have greater benefit from them. Using C and C++ tends to be the easiest way to get there. I think you'll see over time, other languages will pull in other bindings, it maybe even standardize them in the standard APIs over time. Yes, it's a lead time thing in that case.
Montgomery: You mentioned that DPDK is suitable for firewalls, assuming stateless firewalls only?
Barker: No, you could have plenty of state in there. DPDK has a lot of stuff that I didn't talk about. DPDK has a whole application framework for multi-threading. They've got lots of standardized libraries for maps and stuff like that. You can basically build in as much state as you want. You can have hash tables, and you could be injecting rules, so you could be applying custom rules dynamically to it. It's very much like writing a normal C application. They even do some efficient layout of threads so you have worker type lightweight threads, and that will set up a thread per core model. They'll schedule those for you. They have internal messaging queues for parsing messages, so you could have an application or state management part of the system, which is receiving messages from users that are saying, change this firewall rule, block this, redirect this, and that could be sending messages to your threads that are dealing with sending, receiving, and transmitting. Yes, it can be as stateful as you wanted it to be, and stateful as you wanted to code up really.
See more presentations with transcripts