Transcript
As I said in a sort of intro earlier today, the operating system is just something that we often just take for granted, and we assume that it's not really changing and it hasn't been affected by the way that we develop all the rest of software has changed. And in some ways that's a little bit true and in some ways it's not true at all. And in this talk I'm really going to try to give you an idea about what things really are changing, the directions that we're making interesting changes. And also talk about some of the things that haven't really changed and how we might try and fix those in the future.
I'm Justin Cormack. I'm an engineer at Docker, I'm based in Cambridge, UK. I used to work in UK, Unikemel Systems before that. I'm primarily a security person now, but I have a broad interest in this whole space and how it's changing. I'm a software engineer at heart although I've done operations stuff too. So I've got a background on how everything fits together. Cambridge is a sort of high tech village in England. This is just down the road from where I live where quasars were discovered back in the 50s, by Justin Bell. So it's an exciting place full of cutting-edge stuff and farms.
So has anything changed? And I think yes, things have changed actually. We've definitely made some progress and I think there are a bunch of ongoing changes that are really quite big now. And there's a whole bunch of stuff that's performance driven. Our two talks this morning explored some of that if you went to them, and we'll talk more about that as well. Operations is changing things too. There is this really exciting field about emulation and portability, we're going to talk about it this afternoon. And a gradual death of a lot of legacy things that were holding things back.
But there are also a lot of things which haven't changed. We're going to talk about the whole monolith thing, the lack of diversity, programming. We've gotten some stuff about programming languages later on. Brian's going to talk about Rust later on this afternoon which is very exciting, and security is a thing.
Performance
So performance. So these are two talks we just had. If you missed them, I highly recommend you watch the recordings. And they covered some of the themes. I want to start with this very old quote, from I think the '60s, I think. I couldn't actually find the date, but you know. "A supercomputer is a device for turning compute-bound problems into I/O bound problems." And if you look at the history of computing, at different times we've had problems with getting things in and out of our computers fast enough for different problems and actually doing enough compute on them.
And one of the things that has really changed noticeably has been networking. I mean a decade ago 1 gigabit Ethernet was the thing and it had been for a very long time. And it was easy to write code that could just handle 1 gigabit ethernet. And then suddenly, over a fairly short period of time, 10 gigabit ethernet turned up, everything switches, but then directly in the computer. And then we got this current sort of mess of 25 gigabit which is what you get, for example, on AWS if you go for high performance instances up to 100 gigabit to the server, which is still cutting edge. Lots of in-between things for people who have different numbers or fibers. And then 10 gigabit which is basically pretty much everywhere, you can get laptops with 10 gigabit ethernet nowadays.
So that changed from 1 gigabit up to two orders of magnitude in a few years. Really changed the way you had to code. And the way that performance on 1 gigabit was easy and on 10 gigabit was hard and a 100 gigabit was really difficult. There were architectural things about having more I/O bandwidth. During this period, CPU clock speeds only doubled when we had a 100 times increase in I/O performance. That's a big change. Same thing’s happening with storage with SSD and NVMe, and things Alan was talking about before.
So, you know, you can't just adapt. You've got really difficult problems, like you've got 130 clock cycles per packet with 10 gigabit ethernet. If you're dealing with small packets, you've really got to absolutely optimize your code. We spent a lot of time getting up to the 1 gigabit is easy pace, with things like epoll. There was this problem back in the year 2000. Everyone was concerned about C10K, having 10,000 connections to your server at the same time. This is trivial now; you can have 10 million connections. C10M is just about doable now, it's really changed. Alan was taking about 500K per second transactions being perfectly realistic on SSDs. We were talking about hundreds of transactions maybe on a hard drive. These are really big changes in a fairly short period of time. And you can't just run the same code to basically do the same thing.
So that was the networking that happened. And the storage, the same process is in place. Pretty much all the databases and so on have switched over to SSD. But it's looking like almost all non-archival storage will switch over to that. And then again NV-Dimm is the next stage, memory form factor, 10 times the density of RAM. Low power consumption, cheaper. But similar byte addressability and hand-wavy, similar latency-ish to RAM. These are really big architectural changes.
How to Fix it 1: User Space
So there's been two approaches to dealing with this in the operating system layer, because the big problem was basically the whole programming model around dealing with IO was you make a little buffer of something you call into the kernel. The kernel copies it to the device. Get some stuff, goes back to the kernel or goes back to userspace. All these transfers between userspace and Kernel space is slow, added latency, unpredictable latency. The first thing people tried to do was basically Kernel bypass, "Let's just do everything in user space."
Since system calls are slow, just run everything in userspace. We'll write user space device drivers, we'll do userspace networking, we'll do userspace storage. And this sounds like a weird thing. Why are people writing, totally ignoring the kernel or ignoring all these device drivers and things that are already in the kernel, and just rewriting them all from scratch in user space?
Well, one thing is it's a little bit easier than it used to be to actually even think about doing that because the number of actual device drivers for hardware - something we'll talk about a little later as well- actually went down, because hardware stopped being hardware at some point in the last decade. It became a bunch of software that you talk to over an API and those APIs became more standardized. We saw that with things like SATA instruction sets and NVMe instruction sets. And now we have things like Mellanox has a single ethernet driver for all their ethernet cards regardless of what speed they are. You don't talk to the hardware anymore, you talk to the abstraction of the hardware. And then there's the hardware, it's another CPU that's running some software that's just talking over a bus. So it became feasible to write these things. It wasn't trivial.
DPDK, was a big networking framework originally from Intel to do this. And then they came out with SPDK, for storage. There are some lighter-weight ones like Snabb which was written in LuaJIT, which is like, "Let's do all this, but let's do it in a high-level language as well while we're going to userspace," which is really good fun and very different.
And one of the biggest examples of this as a successful production framework is SeaStar, which is a database application framework, which is originally shipped as Unikernel, but it basically runs C++, DPDK, all built around as much performance as possible. Use all your CPUs, avoid locking, just use message passing ring buffers. And it looks like Cassandra, or Memcached, or Redis, depending on what kind of flavor of it you want to look at. And they give benchmarks, these are all the things. I don't know if they're really up to date. But on 20 CPUs, they're talking about having maybe twice the performance for multiple millions of transactions per second. On a single box if you've got enough CPUs, avoiding all the locking overhead and so on. So that's been arguably quite successful, but in many ways I think perhaps the second approach has recently off a lot more.
How to Fix it 2: Kernel Space
So the second approach is like, "Well, okay, we could just bypass the Kernel, also we could just bypass userspace and run code in the kernel instead." If you went to Thomas's talk earlier, he talks about this. But the context switch to userspace is too expensive. Let's run all our code in the Kernel, but not in the traditional way of, "Let's write our custom application in the Kernel," because we all know that was tried many years ago and we all decided that was a really terrible idea. The idea was to create a new, safe, in-kernel programming language that had a degree of isolation but without giving up the performance. And that was AVPF, I like to describe it as AWS Lambda for the Linux kernel. It's an event-driven framework for running small programs in the Linux kernel when things happen. And things can be network packets arrive or these are calls, a system call, or interrupts happen, or any any kind of thing that happens in the kernel.
It was originally very limited; it came from the old eBPF that was a very very small, little, tiny language. And it supported function calls recently. Function calls you think would be the fist thing you'd have in your program language. But if you're just doing performance, people will just say, "Oh, we don't need function calls." And then maybe function calls would be useful after all. And then XDP is extreme data powerful which is called a network framework that also lets you hook into doing things in user space easily as well. And forwarding, filtering, reaching and load balancing. Again most of the work that's been done here so far is in networking. But this is a framework for absolutely anything in the kernel.
So Thomas, if you missed his talk, do watch it later, this morning. But Cilium is the code base that he's working on which is all open source, full path for networking, design around micro servicing [inaudible 00:13:11] use cases. So, it can bypass TCP connections between sockets on the local host. It's really really a lot faster than trying to do a traditional data plane like Nginx or Envoy. You know, for many applications it's 3 to 10 times as fast.
These are some old like latency charts from a while back. Communication between two local Kubernetes pods. And you can see that if they talk directly, no proxy, the line just below the bottom, that's what happens if you don't have a proxy. These are the Nginx on Envoy, IT proxy. You get a tail latency at 99% case that's really quite bad. But they're good in the average case. And then Cilium sits there and it's actually faster than not having a proxy at all. That's because it can bypass actually running through whole bits of the kernel and just connect the two processes directly end to end. Because the kernel knows about things. You write what you say is TCP to go to another process over a network stack. But you can just take those bytes and stick them straight in the receive buffer for another process without going through any of the intermediate pieces.
This is really the whole thing. And tail latency is that lovely flat curve. You're not really getting a terrible tale latency situation like you do in the worst case, because you're really not going through very much of a set of abstractions and proxies, and layers and layers of stuff on top of each other. So definitely catch up with Thomas's talk if you missed it.
The networking piece is the really mature piece, it's ready for production now. People, Facebook in particular, are moving loads and loads of their stuff to eBPF. So, basically anything where you're using IP tables or anything like that, all the old way of talking to Linux. That stuff, people are starting to get rid of that stuff now, it's all or going away and just writing clean and kernel eBPF-based stacks instead. It's obviously a whole new set of APIs. You're going to learn it looks nothing like what programming in userspace looks like. And so it's definitely a real a whole new language and layer that's growing inside the kernel for doing things in a different way. And it's obviously new and different. The nice thing about Cilium, is it provides you a framework for working with that stuff that's a bit easier than diving straight in to work on it yourself. So I recommend looking at that.
Choosing User Space or Kernel
In terms of choice between userspace and kernel, userspace seems more attractive because the tooling is easier and the debugging is just a normal userspace process. It's definitely not as easy. I mean apart from DPDX, there's not a lot of library support, and support in DPDK is not that great to use. Kernel space surprisingly has more libraries because you can reuse whole bits of Linux. But the tooling is weird in comparison. Your debugging processes is kind of weird by comparison. Things will just crash and be weird and you'll just stare at them for a long time and it can be complicated when you get started.
There are sort of hybrid processes now as well which DPDK can use XDP to get the packets into userspace to avoid drawing a lot of the drivers. So there are hybrid ways. More projects recently I think have been using eBPF. There's maybe more of a community around it I think, and it's definitely been more cutting edge. But I think it's definitely becoming a really important tool.
Operations
The next thing I want to talk about after performance is operations. This picture, you might possibly recognize the people, Brian certainly would I think. This is the founder of Sun Microsystems in the, I think, '80s, with their first Sun Workstation. And the Sun Workstation defines the way we use Unix and the way we use servers, even though it was designed as a workstation weirdly. And things have surprisingly not changed very much. It was a radical decade in the '80s and '90s when Unix reached the workstation, the user, the world. And there was the really exciting concept of the 3M computer, megabyte of memory, megapixel of display and a Megaflop of compute, for under a Megapenny, $10,000.
Our computers have at least a gigabyte of memory, and a gigaflow of compute. But we still don't have gigapixel displays, but they cost rather less. So at least prices have come down. But they look much the same and that was the era when we got netBSD, Slackware, Debian, and Red Hat. All those things came out in a very very short period of time. And they were all modeled really on the Sun Workstation that was everyone's $10,000 dream, bring Mr. Cheap PC hardware. If you've used the Sun Workstation and a modern Linux machine, they're pretty much identical. More packages, they're faster, they've got more memory, but the way we use them, install packages, use the shell, none of those things have changed.
Operations has started to change a bit. We talk about continuous delivery and all those things. Operating systems are the last bastion of that, but we've had some change in this area. So, the vast majority of operating systems now are no longer fortunately installed by hand. We use to go and install every computer by hand from a terrible list of instructions that was inaccurate. Now at least they're automated. The vast majority of operating systems ever deployed almost certainly don't have a human actually log in to them; they're created by automated tooling. They do something and they go away again. Sadly, too many do have people log in still, people going and changing things. The whole the automation piece around operating systems, it really started with conflict management which was originally created in order to have the same set of cron jobs on a bunch of computers. And it grew from there, it became about doing what you would do as a sys admin, but getting the computer to do it for you. It's very much modeled on copying what the human would have done if you would've had to type in all that stuff yourself, or edit the config files and things like that. It's very much modeled on this whole workstation model.
Around 2011 was probably the first of the early signs that it was changing. With Netflix's Building With Legos article in 2011, about immutable delivery. "Let's build an operating system, ship it and then test it, make sure it's all what we want, ship it, throw it away. And if we want to change it, we launch a new one, new AMI." But the tooling was still very much based around operating systems which weren't really designed with that paradigm in mind. I want to talk a little bit about some glimpses towards changes of that.
So let's get this project that I've worked on for a few years now. It's an open source project that we built to try and bring a model of modern, continuous delivery to the operating system. It's just a little tool kit of pieces of how to run Linux. This is designed for continuous delivery, it's designed to be testing the CI pipeline. It's designed to be fast, none of this slow install you get with the Linux distributions, things like that. It is designed that you can build it in under a minute, test it locally, ship to production, build everything in the CI pipeline. Everything's designed to be quick, and fast, and small, and secure, and to fit with a modern micro service type architecture, not with an old monolith.
It's designed around the same model as actually the internals zone part in kubernetes where you sequentially start your configuration until everything's configured and then you run a bunch of services. In Kubernetes often you just run one service. And again you can do the same thing if you have a single specialist device, you can just run a single service. Unlike everything modern, it's sadly configured from yaml file, but a simple-ish yaml file. I'll show you an example in a second.
So it's built to be immutable from the start. You can run it from an ISA or a root file system or something. There's no package manager or things that you can log in and do because it's not designed for humans to log in and do things. Update it with a new image if you want to do that. You don't do all sorts of install, update, reboot cycles. It's just from scratch, "Let's design something that's designed for the modern operations but for the operating system."
There are loads of tooling around all the modern things you want to do, like create AMIs, connect GCP, disks, create all sorts of formats for different virtualizations or for different hardware for Biometal VMS, or whatever. And a simple build, push and run like where you build something, you can push it your AMI. And then you can start running room machines off your AMI, or do the same thing with ipixie on biometal or locally on KVM, or Hyper-V, or anything you like. So it's just a build, push it to some sort of artifact store, run. That's for Google Cloud, it's just very, very simple.
So let's think about the operating system as a modern DevOps artifact. We've got a little yaml config file. This is basically going to run Nginx. That's a container image of Nginx off Docker Hub, and to set it up we basically need to run DHCP because we need to get a network address. And we got a login console so we can log in. And we run a random number generator daemon to get some entropy. We want to build this. Let's get to build the yaml file. So it’s there, extracts a bunch of images which have long reproducible hashes. So we know exactly which pieces we're getting. It puts those together and it's building a full system. And the system is just being created and it stands. So that's build a Linux image that runs our one application. And then we can run it locally. I'm going to run it on the Mac because I'm running on a Mac. Run it and you're booting Linux, that's Linux booting up. And we've got some message about a bunch of services, and we can see we've got some processes. I've got Nginx running as a process where you can fetch it locally, get local host and we get the default Nginx. You run Nginx. So we've created an image, a totally custom image, we've booted it up, we've run Linux, we can shut it down, and then we can modify it, we can iterate, we can add, change versions of our software, test it remotely and on and so forth.
So we've got a real vision of what a modern way of doing operations on operating systems could look like. And I think this was a project that we did internally at Docker and open sourced it two years ago now. And it's being used by all sorts of large companies to do interesting things and it's a very much a change towards doing modern operations on operating systems. But it's pointing a way to get away from the old approach of manual work and so on.
Emulation
The next big change I want to talk about is emulation. Emulation is something that is a weird thing to talk about on operating systems, you might think. But it comes from this perhaps accidental decision that Linus Torvalds made when he was building Linux, that it would have a stable ABI. And now he's very insistent on this. "If a change results in user programs breaking, it's a bug. We never blame the users." You can run a program on Linux that you wrote a decade ago and it should basically still run.
And this basically has been the case forever in Linux. And actually people use this since actually the mid '90s to start running Linux programs on other operating systems. And they started this on NetBSD to run Linux programs in 1995. It was quite common back then that Unix would emulate software for different Unixes. I mean Linux did that to extend with other things when it started, which is why if you look at some of the architectures of Linux like Mips, they're really weird because actually they were designed for easy emulation of SGI unit because that was what everything on Mips ran back then.
Solaris started doing this in 2004. FreeBSD in 2006. It was revived in the first modern wave of this, because a lot of these weren't used that much. A lot of people used them for running Netscape on the desktop on these weird operating systems for quite a long time. And they did that and not much else, but then I think Julian[SP] in 2015 decided that they actually wanted to run Linux containers on SmartOS. And they revived this effort again doing 64 bit Linux. Windows launched the WSL emulation in 2016, although internally in Microsoft it had been around for many many years before that, maybe a decade before in various forms. But they started doing it as a developer tool and now it's becoming an important part of it being with a Windows server. It's not just going to be a development tool. And then gVisor, there's a talk about gVisor later this afternoon. This is a whole user space emulation for Linux that Google open sourced in 2018.
So it's a bit weird, but it turns out that because Linux, or Linus in particular, made this stability promise, it turns out you could just emulate Linux if you spend enough time doing it. Why would you want to do that? Not to run browsers on your desktop on a weird operating system. Actually there's a whole bunch of reasons. The first thing is that you can use it for security reasons. And I think gVisor is probably the best example of this. So gVisor was designed so that Google could safely run untRusted user code on Google App Engine, and without having to run it in a separate VM or some other kind of form of isolation. So you can emulate your operating system for security reasons. And so you get a security boundary where the user thinks they're calling Linux, but they're really calling your userspace bundle that pretends to be Linux.
There's possibility as well, so you can basically run your Linux code on windows, which is great if you have to run on a Windows environment. It also means in the future we can start running Linux code anywhere else. The SmartOS use case, we can run those containers in SmartOS, we don't have to run them on Linux. So these are all really exciting use cases where in the longer run it means we are not really tied to Linux. If Linux changes totally, if we all just write eBPF programs, we can still run our own Linux code somehow on some sort of emulation of the old remnants of Linux that we used to use.
And performance-critical software will always be written in ways that map directly to hardware and so on. But non performance-critical code is the vast majority of code. And that's going to be able to be run anywhere else. And we're getting to the point where this emulation is getting really, really good. It can really run the majority of code. There's work to be done because the Linux system call interface is gigantic, and then all the other bits of interfaces that are hidden deep inside of it. So it's a massive amount of effort, but it's something that we can do. So we can start to see things, platforms like App Engines and I think probably in the future, AWS Lambda, will probably stop running Linux. And they'll start running on new platforms that are designed for performance and security and rather than having to be Linux anymore. So this is a process that has been in place for a few years. We're still seeing how it's going to play out in future, but it's a really interesting.
Then there's a portability thing that's related to that. The vast majority of Linux, far more than 70%, is actually just drivers for all sorts of obsolete hardware, that no one really uses, or for hardware that's not very important for servers in particular. I mean Linux is obviously used for loads of other things as well as servers. But from the point of view of actually running applications on servers, most of the operating system is becoming irrelevant. Lots of applications are run on virtual machines, most people don't even use hardware anyway. So they don't care. And as we said, hardware is becoming more standardized, has standard interfaces. Servers are all 64 bit so we're really getting to the stage where we don't need a lot of the operating system to run software applications. We just need a few little bits of it. We don't need even the bits that used to be really important. Operating systems used to be these multi user systems that lots of people logged into. It doesn't really happen anymore; mostly you just run one or two applications on your server.
Users are having a bit of a weird legacy abstraction in the operating system. The real scarce resources that matter are about physical resources, like memory and I/O bandwidth. And these are the things that you need to share between applications and different use cases on anything. But the actual applications really care about things like tail latency, service level agreements, those kinds of things, and not actually the model about which person is logged in, who's typing stuff, who owns which files. All those things are becoming increasingly irrelevant in the operating system.
The Death of Legacy
We don't need all the legacy stuff. There are only a few manufacturers of 10 gigabit and 100 gigabit ethernet cards. Now maybe four or five. So in theory, building a whole new server OS from scratch, running emulated code on it and things, is easier than it ever has been. Hasn't really happened. I'm going to talk about this bit in a second, but it's surprising, as all this has happened we've actually got to this stage when there's actually fewer operating systems in general use than there has been for a long time.
There are only three operating systems at all that have any market share, which is Linux, and Android, Windows, and Mac, iOS. For server applications, Linux and Windows are the only ones that matter. And actually even from the Linux on Asia standpoint, which is probably the place where there's probably most Windows of anywhere in the modern application cloud, the amount of Windows is declining rapidly and Linux is now over 50% of Android. Other cloud providers, it's even more Linux. We've got to this point where in theory, it's the point where it should be really easy people for people to ship new operating systems and we've actually got hardly any.
Lack of Diversity
We've got to the point where we've got a total monoculture of Linux on its own, which is very convenient; everyone just codes everything for Linux. But it's very limiting because it's just new ideas. If you have a new idea about operating systems, you have to push it through the Linux contribution process. There are some other great operating systems that no one uses. Like actually really, really interesting stuff going on, and Microkernel and eBFTs and all sorts of things, but that's hardly any users.
Even back in the '70s, this is from the Ted Nelson's weird book "Computer Lib," that I highly recommend, systems people were seen as a weird, isolated bunch of people. Linus, finally admitted what everyone could tell, he’s a total jerk. And it's been terrible for contributor diversity to operating systems. Operating systems haven't had a diversity of ideas. And we've got this monoculture and the gatekeepers and not the friendliest people in the world. And this has definitely been a problem that has hampered the whole field of operating systems that they've lagged behind other software and change and diversity and so on, and they are really the last Monolith.
Linux has more than 20 million lines of code literally in one program effectively. Windows has 50 million. Linux distro is half a billion lines of code. These are very big, monolithic systems that really just aren't how we make software anymore. Windows is really scary. Windows has three and a half million files, 8 and a half thousand pushes a day. It's the biggest Git repo in the world. Linus wrote Git for Linux. Microsoft is rewriting Git for windows because it's even bigger and more difficult to deal with. It's just not an easily approachable system that you can say, "I'm just going to hack on this one day."
It's also the last bastion of code written pretty much in C. Come to Brian's talk at 5:25. Brian is going to talk to us about Rust and whether we should be rewriting the operating system in Rust. The operating system has not been rewritten in Rust yet. And it's an interesting conversation, we'll talk about that later. But again it definitely doesn't help in terms of making it approachable and easy to understand and helping people understand what our systems are built on.
Security
Security and operating systems. Security has not been a driver for change in operating systems. Linux really has preferred going fast over being secure. And it's made some progress, but it's definitely not really changed the design space of operating systems. There's a gradual demand for more security, but it's not clear that operating systems are going to be the area that supplies that. Meltdown and Spectre maybe changed that, maybe didn't, probably not. Windows is probably more secure than Linux now, it's sad to say. Bill Gates came out in 2002 and said, "Windows security is a pile of total and utter rubbish, we're being laughed at. We've got a stupid operating system no one's going to use it, if we don't fix that." And they largely have fixed that, it's a lot. It's probably more secure than Linux I'd say. And Linux is not on the leading edge of this.
Unikernels: the Radical Answer
A while ago we worked on unikernels, a radical answer to this. The idea was that to a lot of these questions, you know, "So let's build something the way we build modern applications. Let's take a bunch of libraries, link them together and have that provide the system-like services your application needs, pick and choose the ones you want, use modern languages. Let's make this a bunch of libraries." We've had some successes, but they're still a little name. Microsoft SQL server on Linux is a unikernel; it’s a unikernel that runs Windows services, but there is a layer that can run on Linux. So that's probably perhaps the most widely used unikernel project there is.
There are many commercial, internal, non-open source projects as well. I think if you talk to telco company it turns out that they've all got five or six unikernel projects. So there's a load of stuff going on. But yes, the idea was you could build stuff as libraries, build stuff from components on GitHub that you put together to build your system, which is the way we build other software. But you can't just generally take the code out of an operating system and try and use it like that. For a start, the operating system code is designed to run in really peculiar environment, with peculiar constraints. It doesn't run like normal code and so there's a lot of rewriting going on. And I think again Brian will talk about a lot of this in his talk I think.
To summarize I would say that operating systems have changed a bit but they can change more. And these performance and operations and are two key areas where big changes are really happening now. But there are these exciting areas about around emulation, unikernels, and hopefully more diversity that will really be where the real breakout changes are in the next 5 to 10 years. And these are I think, the really exciting changes that are going to change the way we actually ship stuff. And I'm looking forward to it. And I hope more people are excited about operating systems after all these years. We had decades where it was just the Sun Workstation experience, it was amazing, nothing changed. But I think we're getting beyond that now and new things are happening. So thanks very much for coming.
See more presentations with transcripts