Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Podcasts Justin Cormack on Decomposing the Modern Operating System

Justin Cormack on Decomposing the Modern Operating System

Justin Cormack discusses how the modern operating system is being decomposed with toolkits and libraries such as LinuxKit, eBPF, XDP, and what the kernel space service mesh Cilium is doing. Wes Reisz and Justin Cormack also discuss how Cilium differs from service meshes like an Istio, Linkerd2 (previously Conduit), or Envoy. Justin is a systems engineer at Docker. He previously was working with unikernels at Unikernel Systems in Cambridge before being acquired by Docker.

Key Takeaways

  • LinuxKit is an appliance way of thinking about your operating system and is gaining adoption. There are contributions now from Oracle, Cloudflare, Intel, etc. Docker has seen interesting use cases such as customers running LinuxKit on large cloud providers directly on bare metal (more on this coming soon).
  • The operating system of today is really unchanged since the Sun workstation of the 90’s. Yet everything else about software has really changed such as automation, build pipelines, and delivery.
  • XDP (eXpress Data Path) is a packet processing layer for Linux that lets you run fast in kernel compiled safe program in kernel called eBPF. It’s used for things like packet filtering and encapsulation/decapsulation.
  • Cilium is an in-kernel, high performance service mesh that leverages eBPF. Cilium is very good at layer 4 processing, but doesn’t really do the layer 7 things that some of the other services meshes can offer (such as proxying http/1 to http/2). 

Show Notes

What are you doing now you work at Docker?

  • 01:05 I have spread out a lot more [from doing unikernels] to the bottom of the stack - a lot of the driving factors was about security, and that’s been my focus recently.
  • 01:20 There’s a lot of infrastructure building around unikernels.
  • 01:30 I worked on LinuxKit which was inspired by the philosophy of unikernels, even though it is more traditional and approachable.

How as LinuxKit been going?

  • 01:50 It’s been interesting - we built it for our own internal purposes, but we released it as an open source project to be useful for others and to get contributions.
  • 2:00 It was just over a year ago - it is very different now - similar to CoreOS did originally, but it’s still different
  • 2:15 We had an initial enthusiasm when we first launched, and it’s been growing gradually.
  • 2:20 Recently we have had more people adopting it - we’re hoping to talk about it soon - but we’ve got people in big cloud providers and big service providers using it.
  • 2:30 We have had contributions from Intel, Oracle, Cloudflare, so it’s finding its home in people’s tech stacks.
  • 2:45 I still find I have to talk to people a lot about it, because it’s a different approach.

How is it different?

  • 2:55 You have to think of it as an appliance rather than a workstation.
  • 3:00 Those changes in how you use your tools don’t just happen overnight.
  • 3:05 People are finding little niches where it makes sense and expanding out to other areas.

What are people using LinuxKit for?

  • 3:30 There are a mix - but if you’re running part of your pipeline that is running a single service such as an image transformation pipeline, and you want to do that in a VM, you can build a LinuxKit to do that.
  • 4:05 We also have people using it on bare metal.
  • 4:10 We have people using it for IoT type systems as well.

Does the hypervisor see it as a VM?

  • 4:25 Yes, when it’s running as a VM the hypervisor will treat it as such, but we have people running on bare metal as well - it doesn’t need a VM.
  • 4:35 We had a lot of people running it on bare metal at first, when we thought the use-case would be VMs.
  • 4:40 It turns out that bare metal automation has made a lot of progress in the last five years or so.
  • 4:45 You can load things onto bare metal straightforwardly and easily, and you don’t need VMs to do that.
  • 4:50 We have seen that with AWS launching bare metal instances, followed by other cloud providers.
  • 5:00 It’s a real change that bare metal is not difficult to deploy to any more.
  • 5:15 One of the questions people have with containers is asking why people have VMs if they’re being used to just run containers.
  • 5:20 You have to ask if VMs are useful for your particular use case.

You gave a QCon London talk - The Modern Operating System in 2018 - why are we now starting to rethink it?

  • 6:25 The unix operating system has been unchanged from the 1990s Sun workstations.
  • 6:40 It was an operating system and an environment that was designed for developers.
  • 6:45 It was an exciting time, and led to Linus wanting to build Linux so he could have his own workstation.
  • 7:00 Linux distributions evolved to replicate that experience, and changed from proprietary to open source over that period.
  • 7:10 The design is recognisable - if you put someone down in front of a workstation from the 1990s they would know how to use it; many of the commands would be the same.
  • 7:15 We haven’t really re-thought that design, although everything else has changed.
  • 7:20 What we changed was how we deliver software; the big change has been around speed of delivery, automation of pipelines.
  • 7:35 The delivery and automation of the OS has changed much more slowly than everything else.
  • 7:40 We are catching up to bring OS delivery into that kind of world that we have for our software.
  • 7:50 Operating systems are written in C by niche teams - and that’s not how we write other software.
  • 8:00 Now just the operating system is being written like that.
  • 8:05 Coming back to LinuxKit, the requirements were that you should be able to build it within a minute and then deploy it.
  • 8:10 You can’t do that with a big monolithic enterprise Linux - you can’t build a new custom image from scratch in under a minute.

Hardware and performance have changed a lot too.

  • 8:35 The real step change that became noticeable was going from 1Gb to 10Gb ethernet.
  • 8:40 Unlike most things that go up bit by bit in computing, where you get a bit more RAM or a few more CPUs and software expands to fill the need, ethernet stepped up in an order of magnitude but much less frequently.
  • 9:05 Gigabit ethernet was something that had become normal, and the performance requirements from the CPU were not that great.
  • 9:15 When we went to 10 Gigabit ethernet, it changed quite dramatically.
  • 9:20 Most software couldn’t drive it at anything approaching line rate at all - people couldn’t make use of it.
  • 9:30 Networking really went quite rapidly from 40Gb - and we would have 100Gb now, but people are having difficulty driving 40Gb.
  • 9:40 The same thing happened with storage, when we went from spinning discs to ssd; you could take advantage of the reduced latency but the software architecture wasn’t built for that.
  • 9:55 You had a system of kernels built around optimising for slow hardware, or a set of kernel interfaces that didn’t have the performance that new hardware have.

How can you trust a distributed operating system, especially with the CAP theorem?

  • 10:30 It’s difficult, but we’re building distributed systems not because we want to, but because we have to.
  • 10:40 We need reliability, and we’re trying to solve problems that are too big to fit on one computer.
  • 10:50 Those two went hand-in-hand with the transition from scale-up to scale-out computing.

At the OS level, there are things like memory that you can’t distribute?

  • 11:20 There is a dream of a transparent distributed systems layer that sits like an operating system above everything.
  • 11:40 The dream doesn’t go away, but it isn’t a reality yet.
  • 11:40 You still have to think about whether the operation is local or remote, what the latency is and how does it cope with failure.
  • 11:55 Mesos has DCOS, and more people are thinking about that, but they’re still mostly quite early.
  • 12:10 You have to build in the reliability and the retry and fault tolerance into that so it becomes invisible.
  • 12:15 Obviously you still have to consider the CAP theorem issues, so you have to decide what the semantics are.
  • 12:25 You have to accept that either you will have eventual consistency, or you have to accept that things are going to stop happening.
  • 12:30 We are getting better at understanding what those constraints are; we have systems like Kubernetes where we have a model that works for many applications.
  • 12:35 Arguably it’s like an operating system where you have many applications running in a distributed way.
  • 12:50 We are getting to the point where we will have things that people will call distributed operating systems again, which is good, but we are wary that we have pitfalls.

Where are we today with decomposing the monolith?

  • 13:35 You can decompose everything, and have the TCP/IP stack embedded in the application like a unikernel, but some people don’t go that far.
  • 13:45 The Express Data Path (XDP) work in Linux is the cutting edge of what the eBPF world has been working on.
  • 14:00 XDP is a packet processing layer for Linux which runs eBPF programs - fast, compiled, in-kernel programs that can do things like packet filtering, encapsulation, and so on.
  • 14:15 XDP lets you filter those packets into the Linux TCP/IP stack, or into user space.
  • 14:25 It’s a hybrid which means you don’t have to do everything but you can evolve it in parts.
  • 14:30 It’s approaching it incrementally, unlike unikernels which are a big bang change on everything.
  • 14:50 The hybrid path is doing partial in-kernel, partial in-userspace for performance reasons.

What is eBPF?

  • 15:10 eBPF is a programming language in the Linux kernel.
  • 15:15 It came out of the Berkeley Packet Filter, which has been around for a long time on BSDs.
  • 15:20 It was a very simple packet filtering language which could be processed in-kernel.
  • 15:25 eBPF added an extended set of instructions and API, so that you could write more complex programs and consume more in-kernel information, both about network packets but also non-networking data.
  • 15:40 It added a JIT compiler and a path for communicating with user space, so you could summarise statistics information about the state of network, disk or utilisation, and then send the results into user space.
  • 16:00 A lot of it was originally about performance monitoring, but it then started growing a fast networking stack, lets you run algorithms on packets, and actually modify and create packets between applications or user space.
  • 16:30 If you think about containers, for example, each one has its own networking stack, but what you really want to do is switch between different containers on the same host and send external traffic off-host.
  • 16:45 If you want to encapsulate packets in a tunnelling protocol, or encrypt them, then eBPF can be really good at doing this kind of stuff.

How approachable is eBPF?

  • 17:30 It has a weird set of APIs, and you have to understand kernel data structures - which is a little bit outside of most people’s experiences.
  • 17:45 There is an LLVM based toolchain for it, so you can compile C, and there’s a Lua compiler based on LuaJIT.
  • 18:00 You can share code between the user space part and the kernel part.
  • 18:05 It’s not your everyday programming language - it’s partly restricted because the program needs to be verified safe (such as whether it will terminate).
  • 18:30 Until recently, it didn’t even have function calls.
  • 18:35 It is still a particular programming language in a particular programming environment.
  • 18:45 It’s definitely not something that a lot of people have played with yet, but it’s a growing area.

What is Cillium?

  • 19:30 Cillium is conceptually very similar to Envoy, but it runs in kernel space rather than user space.
  • 19:35 [Envoy is a side-car model, where all services talk to localhost and then the sidecar knows how to discover, track, and talk to other services]
  • 20:00 It’s interesting because Cillium can do things that may be unexpected.
  • 20:05 If two services want to talk to each other, and they’re talking over the local network, it can actually just connect the TCP flow from one to another over a socket, rather than via the TCP/IP stack.
  • 20:25 You can actually get more performance than without it, which is strange.

How does the control layer work with Cillium?

  • 21:20 It’s difficult to do directly all the complicated layer 7 and application layer processing directly in the kernel.
  • 21:25 The kernel deals with packets rather than streams, and it’s not the easiest to write.
  • 21:40 There are two things that service meshes give you; a layer 4 data layer, and often a layer 7 type (such as HTTP2-1 conversion) or proxying GRPC to JSON.
  • 22:00 That’s not really the sort of work that you’d want or be able to do in the kernel.
  • 22:05 Those parts aren’t the high performance piece, because those applications aren’t performance critical.
  • 22:15 You can take your high performance GRPC applications and run everything through Cillium.
  • 22:25 Linux actually added support for TLS in kernel, which was an odd case.
  • 22:35 It’s partly in kernel - the TLS handshake is done in user space, and then you hand over the connection to the kernel which then handles the encryption piece.
  • 22:50 Cillium can see unencrypted traffic in the kernel and make decisions about what to do there.
  • 22:55 There are interesting things around transparently adding encryption.
  • 23:05 It’s a lot of the data plane stuff that you expect from a modern service mesh, not the control plane or in-kernel modification of the layer 7 stuff.

What about observability and logging?

  • 23:35 From the observability point of view, I don’t think it’s that different.
  • 23:40 One of the interesting things about eBPF is that a lot of it came from Brendan Gregg’s team at Netflix taking the tools developed for dtrace and porting them over to run on eBPF.
  • 23:50 They have done an amazing job of making the Linux kernel more observable than it ever was, and they have a real interest in a set of tools for observability that are based on eBPF.
  • 24:10 A lot of the people who are doing monitoring and observability on Linux are switching to eBPF because it’s so much more observant than observing from user space.
  • 24:25 A lot of the tools are moving towards eBPF to get information and understanding from what’s happening in the kernel.
  • 24:35 The kernel sees a lot of the interesting things that an application is trying to do, like writing network packets, forking processes, running binaries - everything except computation is a call into the kernel.
  • 24:50 That set of observability tooling which is built for high performance observability is what you’re going to use anyway.

What’s next?

  • 25:10 As well as eBPF doing everything in the kernel, there was also the choice of doing everything in user space.
  • 25:25 You don’t want to context switch, you don’t want to copy data, you don’t want to go through two sets of APIs and so on.
  • 25:35 GitHub recently released their layer 4 load balancer, which is a really interesting thing, based on DPDK.

What’s DPDK?

  • 25:45 DPDK is a Linux Foundation project, started by Intel but now cross-platform, which provides user space drivers for network cards.
  • 25:50 That’s been one of the primary user space libraries for networking cards.
  • 26:05 It has a user space driver that talks to the network cards directly, focussed on 10Gb+ networking and now supports pretty much everything.
  • 26:15 That was one of the main routes to doing everything in user space.
  • 26:20 With Linux 4.18 with the recent XDP patches, is there is a hybrid path which lets you do a fast transfer from the kernel into user space, so you can do a hybrid with some eBPF code and some processing in user space.
  • 26:40 Linux has always had a few ways of bypassing packets from the kernel stack, but this is one that promises to be fast.
  • 26:50 It is the AFXDP socket; the newer version of the old packet socket format, with zero-copy support.
  • 27:00 It works with almost all network cards.
  • 27:05 I think we’ll see people using that as it’s potentially easier to use.
  • 27:10 You can also use a hybrid approach where you combine the above.
  • 27:15 It’s an interesting development that has been promised for a while and has been merged in Linux 4.18.

What are you hoping to put in QCon SF’s track in November?

  • 27:45 We are definitely going to have a deep dive into eBPF - a speaker announcement is expected shortly.
  • 28:00 We have got Allen (who works on memcache) going to talk about NVRAM and what’s going on in the storage space.
  • 28:15 We’re moving towards having flash memory on the memory bus in large quantities, probably starting to ship next year.
  • 28:25 Having large amounts of fast random access flash memory is going to change how we do things architecturally.
  • 28:45 We have these performance parts that is bypassing the Linux kernel stack, but the Linux kernel ABI is the one stable thing that is really important.
  • 29:10 A huge amount of code has been written to run on the Linux kernel stack, and there are lots of reasons where you might want to run this in an environment which isn’t a traditional Linux.
  • 29:20 One of the things that we have seen is in the Microsoft effort of having Linux run inside Windows.
  • 29:25 They have been doing an incredible job - emulated Linux has been around on BSDs since the 1990s - but Microsoft have put a huge amount of engineering effort on this.
  • 29:40 They are getting to the point where most of the applications run, they’re filling in the difficult bits like being able to run containers and cgroups.
  • 29:45 It’s starting to become a way of Linux programs without running Linux at all.
  • 29:55 A few months ago, Google launched gVisor, which is another emulation thing that is used by Google App Engine as a sandbox.
  • 30:10 They don’t want to run untrusted Linux code on their boxes, since it is a shared application host.
  • 30:15 They run the code using gVisor, which has its own TCP/IP stack written in Go.
  • 30:25 The stack comes from their Fuchsia mobile operating system.
  • 30:30 So they’re running this emulation for security reasons this time.
  • 30:35 You get into the situation that you have this whole layer of emulation around emulating Linux binaries on something that is not Linux.
  • 10:45 Potentially that way of running could be used to run existing application.
  • 31:00 The Windows thing is interesting because it started as a way of developers being able to compile and run tools in a familiar environment.
  • 31:05 That team got moved into the Windows server team, and it’s now shipping on Windows server, potentially allowing Linux applications to be run on the server without using Linux.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article