David Riddoch on Bypassing the Kernel and Hypervisor for Network I/O, Solarflare, OpenOnload
Bio David Riddoch leads the development of the Solarflare Open Onload IP stack.
Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.
Hi. I'm the chief architect at Solarflare. Solaflare makes the world's fastest Ethernet network adapters.
On the one hand they're just good network adapters. So they do the same job that the network adaptors that most people buy today from companies like Intel or Broadcom or Mellanox. They do the same job. So you plug them into a server. They connect to a network using maybe 1 gigabit or 10 gigabit or 40 gigabit Ethernet. They move packets in and out of the host and ours will give you better performance than everybody else's. But in addition to that, we have some special software which is called OpenOnload that uses a technology called kernel bypass to make the networking go much, much faster. Now, kernel bypass is a technology that was invented in the high performance computing industry in around the 1980's.
Originally it was used in these high performance computing clusters where you have a whole load of machines that are essentially identical. They're all working together on a big number crunching problem, passing messages between each other in order to collaborate on this problem and they have to have a fast network to do that. And so that was really -- in those days, that was the application of fast networking. So they came up with the idea of basically the protocol stacks are slow. So we're going to bypass them and maybe we'll move some of the work into the network adapters. But we want to avoid all those costs like context switching, going in and out of the Kernel, interrupts, and all of those slow things. That's essentially what kernel bypass is. It's moving networking into the user space applications and allowing those applications to talk to the network hardware directly.
Now, I was involved in a kernel bypass networking research project at AT&T labs in around 1999. We had a great technology. It was 1 gigabit, so super fast which was actually, you know, that was reasonably fast at the time but the key thing was for low latency. So it could get a message from one machine to another in less than two microseconds which is actually pretty comparable with the best you can do today. It's slightly worse than what you can do today. So that was really, really fast in 1999.
But apart from being super fast, it had the same flaws that every other kernel bypass network technology had at that time which was it wasn't Ethernet which is the network that everybody uses. It used proprietary link layer protocols and other protocols on top of that. So it didn't use IP, TCP, and UDP and it required that you use new APIs or adapt your middleware to use new APIs in order to be able to use it all. So that meant that you can't just take this technology, put it on a server and use it. That means of course you can't sell it.
So we all happily got made redundant in 2002 and AT&T ran out of money. And that was the opportunity for us to start the company called Cambridge Internetworking that became Solarflare and our idea for that company was take what we have learned about high performance networking, apply it in a way that people can actual deploy easily, which means it's got to be Ethernet. It's got to IP, TCP, and UDP.. And it's got to be the standard BSD sockets API.
Werner: You basically made normal APIs fast for the rest of us who don't use fancypants interconnects.
Exactly, yes. So you can take a server, put a Solarflare network adapter in it and it'll work exactly the same as it did with anybody else's network adapter. You can put the OpenOnload acceleration technology on there and you can then select which applications you want to accelerate with OpenOnload. And the idea is they'll work in exactly the way they did before. It's just the networking calls suddenly become much, much cheaper. So that gives you lower latency and it gives you high throughput, high CPU efficiency. And it literally can be that easy. Some applications, because the environment that it runs in has different assumptions, different locking behaviors, some applications don't benefit from this technology but many applications do and it can be as simple as: install our drivers, go two to three times as fast or in some cases, install our drivers, do a little bit of tuning, go fast.
Yes. Well first off, the easy answer is the ones that are already fast benefit the most and in one sense that's Amdahl's law in a sense of if you've got bottlenecks elsewhere, we can't do anything about those bottlenecks elsewhere but if your bottleneck is in the networking, then we'll make that a bit faster. If networking is a big part of what you do, making that faster will make your application a lot faster. But a classic example of the sort of application that's unlikely to get much speed-up is one where you have many connections, one thread per connection and you're throwing around lots of small messages distributed over those many connections. What you'll find is that the cost of thread switching between those connections vastly outweighs the cost of the networking even without our technology and therefore no matter how cheap we make the networking, that application isn't going to get very much of a speed up.
That's right. Yes, so it runs on Linux. It works with any kernel. We distribute it as source. So it includes really two parts. There are kernel drivers and then there's the user space library. The kernel drivers are partly there to provide configuration set up and allocate resources. The really interesting stuff happens in user space. And the user space library gets loaded into your application at runtime using a technique called LD_PRELOAD. What this does is when your application starts, the runtime linker will load our library ahead of the standard C library.
So the places where your applications call send and receive and connect and socket and epoll and all of those things. The linker will send those calls to our library and when the application calls them, it'll come to us. We will then have a look at the file descriptors that have been invoked, we'll make a decision as to whether we can handle that call ourselves. If we can, we do, it goes fast. If we can't, then we just forward that on to libc which forwards them onto the Kernel. And so the normal behavior is preserved for everything that we can't accelerate. So that gives us compatibility. It's perfectly okay for you to do some of your networking over Solarflare and that will go fast and some of it to do over the built-in network adapter that's on your motherboard and that'll just get the normal kernel based performance.
Now, essentially what's happening at the sort of stack level. So we're implementing the sockets API. We're accelerating TCP and UDP sockets. So we have our own TCP, UDP stack inside this library which is implementing those protocols and we are not offloading those protocols themselves to the adapters. The only offloads the adapters do really are checksum off-loads and distributing load. So the adapter has to know where to send packets so they arrive at the right application. So the actual protocols are done in software and the conventional wisdom used to be that TCP was horribly complicated and required enormous amounts of CPU time and it's complete nonsense. TCP is perfectly fast as we have proven.
The issue really is the efficiency of the interfaces between hardware and software and the costs of going in and out of the kernel, so syscalls, the cost of interrupts and other context switches, it's all of those things that add up. The reason this technology gets you a big win in terms of performance is it can bypass interrupts. It does bypass system calls on the fast paths. It has more efficient locking. So there are a far fewer bus locked atomic operations on our fast paths than there are going through the kernel stack. And we get a huge advantage from just being very tightly integrated whereas the kernel stack is a very general purpose piece that supports millions of protocols and does it very well, I should say, with quite incredible performance, but that generality comes at a certain cost in terms of CPU overhead and by being integrated, we essentially cut down the number of CPU cycles that it takes to do these operations enormously.
Just to give you an idea, roughly speaking, the cost of a send or a receive call is better than one-fifth of the number of CPU cycles to execute. The effect of that is on the one hand lower latency, so you can get messages from A to B much more quickly. And we're talking about -- with our technology, current generation adapters, we have got some faster ones coming very soon, but our current generation takes a little over a microsecond to get a message from one application to an application in another machine over the network excluding switches. With a kernel stack, if you tune it really well, you can do that in about four microseconds, more normally if you just use default settings out of the box, you'll get probably something like 50 microseconds.
Werner: Pretty speedy.
Yes, and this is really important for a number of different people. I mean that low latency aspect is critically important for people like electronic traders who are essentially engaged in a race. They are looking for opportunities to trade. Whoever spots the opportunity and submits an order, first makes more money. So those guys will go to great lengths to be the fastest and this is a technology that helps them do that. But it's kind of like the nuclear arms race. As soon as one of them started buying Solarflare, the rest of them had to buy Solarflare or they were out of the game. So that was great from our point of view. That's the electronic traders.
Other people who have interactive responsive applications, that could be microservices, that could be more traditional architectures, anything where there are multiple hops involves in responding to requests. The faster each of those hops execute, the more responsive your overall application will be, more hops you have to handle, the more benefit you will get. But that's just the latency part. The other thing that we are improving is throughput. By reducing the number of CPU cycles for sends and receives and epoll calls just means you can do more of them per second and that means you can handle a much higher throughput.
So that obviously gives you cost savings. It means you can do more on one machine or you can do the same amount with fewer machines and save money. And it's not just about message rate, it's also about connection rate. So if you are running something like a load balancer or a web server, it's all about how many messages and connections per second can we handle, and we accelerate all of those things.
Werner: So that improves the efficiency, basically. It's not just about being faster it's doing more with the same hardware.
Yes. That's right. The capacity of each server is much greater, potentially, with this technology assuming networking is your bottleneck. We won't help if disk is your bottleneck but we'll help if networking is a bottleneck. Yes, so more capacity per server. You don't need that many servers.
Werner: Well that's always good because as we know, Moore's Law is history, so we have to make better use of our servers.
Yes. And we are also improving the scaling over the cores within a server as well. So as the servers get more and more calls, it becomes harder and harder to actually scale applications on those boxes. And Kernel-bypass, it's not inherent that it makes that better but our implementation of it essentially allows you to share nothing between the threads that are running on different cores. By sharing no state, they don't have any of the inefficiencies associated with moving that state between the caches on the different cores.
Werner: So no cache coherence overhead, or less.
Yes, certainly a lot less. I mean you will get those overheads from other parts of your application potentially and there is some cost to cache coherency even if you are completely local to your cache. But we scale a lot better and it can be dramatically better. It depends very much on the nature of the application and its threading model. But for example, a quite challenging application is load balancing, particularly TCP load balancing where you have incoming connections on the internet, outgoing connections into your backend services.
That's very, very hard to scale and the many-core performance if we're using the kernel stack is only roughly two or three times the single core performance if you do a really good job of tuning it, when you're using the standard technologies. But OpenOnload can scale that much, much better on a 12-core box, it's probably going eight times as fast as the single core. So it's not linear, but it's really good. You're taking advantage of those cores.
Yes. Not so much. I mean that, the integration, the lack of locking, the saving of the CPU cycles, that allows us to be twice as fast on one core compared to the kernel stack technology. But the fact that we can scale when you add more cores, that's all down to the fact that each of those worker processes running on a separate core doesn't share any state with the other workers so that they don't have any contention.
No. We don't need to. I mean as long as -- you might get a benefit by pinning. It depends. Particularly, the electronic trading guys, they are very careful to pin their applications and threads and that's primarily because they are very sensitive to latency. If you don't pin your threads, what can happen is another thread will essentially sit on the core that you want your critical path to run on. The critical path is held up, you get a latency spike that costs you money, which is bad.
For that reason, financial guys are very keen on getting that right. If you only care about throughput then latency spikes don't matter very much and what you care about is once a core becomes saturated, you want to make sure that maybe one worker stays on that core but anything else migrates off it in order to keep that core just for that one worker and then other workers can run on other cores. And the scheduler in the kernel will just do that naturally. So as long as you don't mind that you get some latency spikes, it's not strictly necessarily to pin. You will get slightly more predictable performance if you do. That's really an application tuning issue that is kind of orthogonal to kernel bypass.
Oh, so Solarflare on the one hand, so it does kernel bypass acceleration technology. On the other hand, we just sell network adapters without that technology and they are just really good fast network adapters that run on all the different operating systems that you might want to support including all the virtualization platforms, KVM and Hyper-V and ESX and Xen. And we do a similar acceleration technology for those as well. So there is a technology called SR-IOV which stands for Single Root I/O Virtualization and what that allows you to do is hypervisor bypass.
So when you go into a virtual world, you add an extra layer of management but you also add an extra layer into the data paths, the networking. So your application talks to the kernel which does the protocol and then your kernel talks to a device driver and behind that device driver isn't real hardware, it's a virtual network device. When you send and receive packets through that interface, you are doing traps into the hypervisor and in the hypervisor there's a soft switch and on the other side of that soft switch, there is a real network adapter interfacing into a real network. So you have added this extra layer of stuff into the fast path and it really, really slows networking down.
Hypervisor bypass allow us to cut that layer out and essentially get back to bare metal performance. What happens is you slice up your network adapter into a bunch of virtual network adapters, virtual channels and SR-IOV gives you a standard way of exposing those slices of the network adapter and you can give one slice to each guest in your virtual machine. So each virtual machine instance has its own real slice of real hardware so it can talk to It directly. The system IOMMU does address translation between the virtual machine's address space and the real physical address space. That's great. It essentially gives you back almost all of the performance that you lost.
And the other really cool thing is that we can do that and kernel bypass at the same time. So you can now have applications running in a virtual machine talking directly to network hardware and getting latency around the one point something microseconds mark from application to application and that might be application to application within the same physical box. So two VMs talking to each other or it could be across the network. So that's pretty cool.
Werner: That's pretty exciting.
It means the financial guys can now think about moving their high performance applications into their private cloud infrastructure which obviously gives a lot of flexibility and cost savings and all that stuff.
Werner: Right. Just from a management point of view because obviously virtual machines are nice to use but they introduce overheads.
Exactly. It's not entirely without cost. The downside is your virtual machine now really is talking to real hardware. So you can't just pause it or migrate it to another machine without dealing with that hardware dependency. So there is a small cost to that. But the people who really care about performance, that's a small price to pay in order to be able to get this really, really good performance even when you're in a virtualized environment. We also do some other special features that are particularly valuable in the financial services area but are useful elsewhere as well.
Our adapters have built in support for hardware time stamping of packets and they have a stable oscillator. This allows you to do two things. It allows you to get very accurate timestamps for every packet that you send and receive and it allows you together with the PTP protocol to synchronize the clocks on all of your different servers. The effect of that is that the clocks on the network adapter and the clocks on the servers are all synchronized to UTC to within +/-100 nanoseconds or so of each other. And that means that if you take a timestamp on this server and then you take another timestamp on another server on the other side of the network and you compare those two timestamps, you'll get an accurate measurement of the time difference between when those timestamps were taken. So you can measure latency accurately from moving messages between machines.
If you don't use this sort of technology, if you just use standard NTP, then you can't get accuracy better than a millisecond. If it only takes packets 10 microseconds to get from one side of a network to another, millisecond accuracy isn't enough. It's useless. So that's extremely useful. The other thing is that of course our network adapters are able to send packets at enormously high rates, millions and millions of packets. I think, tens of millions of packets actually now.
They're able to do hardware time stamping. You can receive packets from a 10 GB network at line rate, any packet size and do that reliably without dropping anything. And that's exactly what a capture card does. The difference is that a network adapter will cost you a few hundred dollars. A capture card will cost you thousands of dollars. So now you have the ability to take a standard server, a standard commodity network adapter. You can do highly accurate packet capture with time stamping on that platform and it's the same platform that you might be using to your other applications that's very flexible. It's also very cost effective.
So www.solarflare.com. we sell our adapters through the channels. So your VARs will be able to sell you Solarflare. We also sell through the major OEMs. So IBM, HP, Dell all have our adapters available as options to be fitted at a factory and of course we have a sales agents and system engineers throughout all of the major centers in the world.
Werner: Great. We'll all check it out and thank you, David.
Thank you very much.