Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations eBPF - Rethinking the Linux Kernel

eBPF - Rethinking the Linux Kernel



Thomas Graf talks about how companies like Facebook and Google use BPF to patch 0-day exploits, how BPF will change the way features are added to the kernel forever, and how BPF is introducing a new type of application deployment method for the Linux kernel.


Thomas Graf is co-founder & CTO at Isovalent and creator of the Cilium project. Before this, he has been a Linux kernel developer at Red Hat for many years. Over more than 15 years working on the Linux kernel, he was involved in a variety of networking and security subsystems. For the past couple of years, he has been involved in the development of BPF and XDP.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Graf: My name is Thomas Graf, probably the best description would be a longtime kernel developer but more recently, I also co-founded a company called Isovalent and I created Cilium together with a team. This is not going to be a pure talk about Cilium, I'm going to talk more broadly about rethinking the Linux kernel and why that is happening.

Before we get to that, who remembers this age? Nice, so lots of people remember this. This is how the web used to look like, pre-year 2000, most websites looked similar to this one.

What Enabled This Evolution?

How do we get from websites looking like on the left to the age we're used to today, which was what you're seeing on the right where we spend the majority of our time in a web browser? We went from simple web pages showing nice GIFs to massive applications running in web browsers. What enabled this evolution in roughly 20 years? We went from pretty much Markup Only – and I'm using the HTML 2.0 requests for comments here – to an age where we use primarily programmable platforms.

I'm a kernel developer, so I'm not actually familiar with all of these JavaScript frameworks but there are a ton out there and it's basically what allowed this to happen. Obviously, there's a lot more to it but fundamentally, it is the programmability that enabled to go from pretty much static websites to applications running in web browsers. Why does that matter and why does that matter for the Linux kernel? We'll get to that.

Programmability Essentials

Before we look at a couple of programmability essentials, what does it mean to create a programmable system like JavaScript enables a web browser? First of all, we need some notion of safety. If you allow untrusted code to run in a web browser, that needs to be isolated, that needs to be sandbox in some way, it needs to be secure.

We need continuous delivery. There's no point if we extend our application, if we innovate, to then require the user to install a new web browser, nobody would ever use any application, any web-based application, if you have to install a new web browser. There was this age where you have to upgrade to new versions of browsers and it was very confusing to users. We're no longer used to this, we're used to pretty much automatically getting updates for both websites and browsers on the fly. Probably you don't even notice when you upgrade your Chrome at this point and you definitely don't notice unless there's some visual appeal or some visual change that the website has changed its back end, for example. This is all happening continuously and seamless. If you deploy a new version of your app, you will have potentially millions of users on that website at that time, you want to seamlessly upgrade. Any programmable system needs to have some notion of continuous delivery and seamless upgrades.

The last aspect is performance. If we gain programmability and we sacrifice performance, the programmability is probably not worth it. A good example of this is the early stages or early years of Java, where there was a huge performance penalty. A lot of that went away later on but initially, the cost of running Java or the difference running a C++ application in a Java application was huge.

Similar to JavaScript, before JIT compilers, running JavaScript added a considerable amount of CPU usage on user's laptop or on user's end machines. Any sort of programmability aspect also needs a notion of native executions and in a lot of instances, this is being done with Just-In-Time compilers or JIT compilers where some notion of generic bytecode is being translated into what the CPU that your machine is running actually understands, so we're getting to as close as possible to native execution speed.

Kernel Architecture

The second aspect before we can connect the two pieces together is to do a super quick – not a deep dive, but a super quick – introduction to kernel, how does the Linux kernel look like? Roughly there are three pieces: we have a user space –kernel people like to put that on top. Then below that there's the Linux kernel, the operating system, and then at the very bottom layer is the hardware. Then we have some notion of process by stripping application, some tasks running in user space, and on the bottom layer, we have hardware. I've massively simplified this storage and network. Obviously, there's many more pieces of hardware but I've simplified this as much as possible.

Then the first thing the kernel will do is it will abstract this away using so-called driver. The kernel obviously needs to understand the hardware and needs to enable it but it doesn't want to expose this complexity straight to the applications. It will introduce the first level of abstractions so the Linux kernel will understand, "I'm aware of block devices," "I'm aware of network devices," "I'm aware of an input/output device," "I'm aware of a console," and so on, it's the first level of abstraction. Similar, we have so-called system calls, which is what an application invokes to communicate with the Linux kernel. There's many of them, I've used a couple of examples here.

You want to do file I/O, a read and write, this would allow to read files, write to files, and also send message and receive messages. This is what an application would use to network interaction, to actually send data on a BSD socket or a TCP socket. This is what the kernel exposes to the user space into your applications, this is where the kernel provides guarantees in terms of backwards compatibility. The API does not change, we are allowing for applications to continue running even if the kernel actually evolves.

Then we have middle layers, middleware in the middle. This is what a logic is, this is the business logic, like a virtual file system, process scheduler, networking, TCP/IP firewall and technology, and so on, all of this is in the middle.

Then the last piece is somebody who's actually operating the system. In the past, it's used to be an actual human being, these days, all of this is pretty much automated and this is being done through configuration APIs. Over the years, the kernel has taken on many different APIs, we have System FS or sysfs, procfs, netlink. There are many more different APIs how either a human or a controller or some some script or some component can interact with the Linux kernel and configure these systems. For example, mount the file system, change the firewall rule, load a new driver, load the Linux kernel module, all of these things are done through these APIs. That was a super quick 101 into Linux architecture.

Kernel Development 101

We will go into the kernel development process. What do you do if you want to change any aspect of that? What options do I have if the Linux kernel as it stands right now does not provide what I need? You pretty much have two options, you can either go ahead and do a native implementation or you can write a kernel module and we'll look at both.

Native support means changing kernel source code, it means going upstream to the Linux kernel mailing list and convincing the world that this change is really needed and whatever complexity you're adding, the rest of the world should be paying. It means exposing some sort of configuration API to actually expose that and to actually enable it and then you have to wait five years until all your users actually upgrade to the latest Linux kernel version.

This is nice once it is in and once these five years have passed, the whole world will basically have this capability. The problem with that, we don't really have time for that, nobody wants to wait these five years. The second aspect is a Linux kernel module, which is a loadable plugin that you can load into the Linux kernel at runtime that will extend the kernel functionality. It's basically like a shared library that you can look. The problem with that is the kernel does not have any stable APIs inside. Only the user facing pieces are actually stable, which means if you write a Linux kernel module, it will break with probably every single kernel release.

You will basically adjust the source code of your Linux kernel module with every Linux release as it goes along and you probably have to ship a different kernel module for all the different Linux kernel versions. The most horrible part is if you have any bug in that Linux kernel module, it will just bluntly crash your kernel, so probably you want to be very careful before you ship a Linux kernel module to any of your users because you might literally just kill the machine.

These are not really great options. We've talked about JavaScript and how that helped evolve web browser applications to become huge and basically take over the world. We looked into Linux kernel and the options we have to extend it and they're not really good options. How can we combine the two?

How about we add something like a JavaScript-like functionality to the Linux kernel. That sounds appealing, it should solve all the problems we have. This is basically eBPF, almost literally eBPF and we'll take a deep dive into what that actually does.


First of all, if you go back to our description, what a user space application or process does when it interacts with the operating system, it performs system calls. For example, the execve system call is used to launch a new process. If you type something in your bash and your shell and you run a new command, the bash shell will run an execve system call to launch a new process, to launch, for example, Top or PS or whatever binary you're running. Then the scheduler will basically fork the process and so on.

What eBPF allows us to do is something that's pretty amazing, it allows us to take that system call and run a program that takes over on behalf of the system call and then the returns, so we can do something like this. We can define eBPF that will execute on the return of the system call execve and it will execute this code. This example code will basically extract some metadata from the system calls with just a binary name, with a comm name, and send that through a BPF map and we'll get to that and expose that to user space. For example, for audit or tracing purposes, this is how BPF-based tracing works where we can trace all the system calls that are being done and provide context along with that. This is one example how BPF allows us to run almost arbitrary code at various hook points. This example is using a system call, we'll look into all the other hooks we have as well.

How does that work exactly? How do we actually execute code in the Linux kernel? How do we define that program? How do we load it in? There's a so-called eBPF Runtime and the runtime will also ensure that we guarantee and fulfill all of the programmability requirements or the programmability essentials that we covered earlier. There's some program, BPF bytecode, which is basically the compiled version of the code that we have seen in the last slide and you as a user, you want to load this and actually run this as part of a system call. You will call the BPF system call and load this program and say, "I want to load this program whenever this system call is being invoked."

The kernel will then take this program and will pass it through the BPF verifier. The verifier will ensure that the program is actually safe to run, so this is the first major difference to the Linux kernel module. If the BPF program is buggy or has some flaw, it cannot crash the kernel, the verifier will ensure that the program is safe to run. If it is not safe to run, it will reject it and you cannot load the program. It will, for example, also ensure that you have privileges required to actually load the BPF program. It will, for example, assure that you cannot access arbitrary kernel memory, it will assure that you cannot expose arbitrary kernel memory to user space, and so on. It guarantees that BPF program is safe to run and only if that passes, the program can be attached.

Very similar to JavaScript where exactly the same is being done as well, there's also a software-based runtime or a software-based sandbox, which ensures that a JavaScript program running as part of one tab in your Chrome browser cannot basically access the entire memory of all of your Chrome tabs, for example. The next piece is once this BPF program is passed and it is approved, then it will go to the JIT compiler, the Just-In-Time compiler, were the generic bytecode, which is entirely portable so you can load this bytecode on x86 and ARM, whatever CPU you're running.

The JIT compiler will take this and compile this to the native CPU that your system runs, for example, x86, which means instead of interpreting bytecode in software, you will now executing a program that runs at the same speed as if we had compiled it natively and loaded that in. Now we are almost at a point where we have a kernel module but it's done in a safe way, we cannot crash the kernel. Then after it has passed through the JIT compiler, we can attach it to the system call and this is where our various sort of hook points that we'll get to, and this is where the last pieces coming in too, continuous delivery. We can replace these programs, these BPF programs in live systems, in Linux kernels without impacting applications.

For example, if we take a network-related program that processes network packets, we can replace that program atomically, which means, let's say we're processing hundreds of thousands of packets per second. Between one and the next packet that program is automatically replaced, the last packets are the old version, the next packet will see the new version, there is no breakage in any way. This is allowing us to continuously run systems while upgrading our logic without breaking any, for example, TCP connections or actual applications running and so on. We are fulfilling the three programmability essentials that we covered earlier.

Hooks – I've always been talking about system calls but I also mentioned networking. There's many different hooks where you can connect or attach eBPF program to. If we start from the top, we can attach to arbitrary system calls, so whether it's a TCP file open, a send messages, creation of a socket, executing a program, all the system calls, we can attach to. We can also attach to user space application, so we can run using so-called the uprobes or user space probes, we can run eBPF program for particular functions in your applications. This is how you can profile applications using BPF.

We can attach to an arbitrary trace point in the Linux kernel. A trace point is a well-known defined function name of the Linux kernel that will stay stable over time. While the kernel functions are changing per release, trace points are valid and they will stay stable, so this is allowing us to instrument the entire Linux kernel. We can use fentry/fexit, like for every kernel function call at enter and at exit time, we can attach a BPF program. We can attach to various sockets and actual network events. For example, TCP level, we can implement TCP congestion algorithms using BPF.

We can attach at the network device level, so for any virtual or physical device, we can attach a BPF program and it gets invoked for every network packet that is being received or sent. We can even go and offload this to hardware, so we can attach a BPF program and work together with the NIC to actually offload the program. As your physical network card receives a packet, then if the NIC is programmable, it can run the BPF program. We gain programmability and we can attach this logic at very wide range of different hook points. When we get to the use cases, you will see how wide the actual examples and how wide the use cases are that can be solved with eBPF.

Nice, we can run programs. What can these programs do? One essential piece about any program where the complexity usually comes in is state. Where can we store state, statistic, metrics, and so on? This is where eBPF maps are coming in. eBPF program itself is only the instructions, it will not contain actual data. There is no memory in the sense of that we can dynamically allocate memory on the fly. Any sort of state that is stored, it's stored in BPF maps. It's very important that these are separate from the program, which means we can keep the map alive. For example, we can create a map which is implemented as a hash table or as a stack. We can keep this alive while replacing the program.

This is again along the seamless upgrade, so let's say we have a program that's actually collecting stats, for example, or metrics. We can replace the program logic without losing that state. Maps can be accessed from eBPF programs itself so they can be used, for example, to share information or state between different programs, but they can also be accessed from user space. We can have a user space application or some CLI or some tool that allows us to retrieve or configure aspects via BPF maps.

Different types exist, so on the lower left, you see the different map types. We have hash tables, we have Least Recently Used hash tables. This is allowing us to dynamically size it which means the oldest entry – or the least recently entry – will always be expired if the map becomes full, or if you run out of resources. We can use ring buffers, so we can send actual events to user space. We have the stack trace and we have the Longest Prefix Match which is a specific networking implementation to implement routing tables. It's just a subset of all of these map types, there are more map types being added as the use cases are growing for eBPF.

Helpers – a Linux kernel module can pretty much call any kernel function, which has upsides and downsides. The upside is a Linux kernel module can reuse any functionality that is part of the Linux kernel. The downside is if you abused it or if you misuse it, you will crash your kernel. The second aspect is, obviously, the internal list of functions is changing all the time, it's not stable. If a function is being removed and your kernel module relies on it, the kernel module will no longer compile and if you load it, it will not load it, it will say, "I cannot resolve a certain symbol name."

With eBPF programs, this is done differently. There are so-called eBPF helpers, which are used to interact with the operating system and they are stable, so the subset of helper functions that exist, they are stable over time and they're being maintained backwards. For example, eBPF program does not know how to generate a random number but there's a eBPF helper that will allow it to retrieve and ask the kernel, "Give me a random number," or, "Give me the current time," or, "Redirect this network packet to this device," or, for example, "Read a certain value from eBPF map," and so on. Any sort of interaction with the operating system is being done via eBPF helpers and because these are stable APIs, BPF programs are portable across kernel versions.

Tail calls and function calls – instead of composing one big program, we can divide our programs into multiple small pieces and do tail calls and function calls. Function calls are pretty much exactly the same as you would expect from a different programming language, tail calls are a little bit different. Tail calls are more like an exec, which means you can replace the context of your programs, you can basically chain programs together. But, once the program is finished, it will not return to the old program, so it's basically more of a chaining. It allows programs which may not even be aware of each other to be chained together so you can, at a certain hook, run multiple logical pieces after each other.

Tail calls and function calls are used, obviously, to make BPF programs composable. They are also used to reduce the size of the program. Obviously, we could tell the compiler to inline everything but even if we use function calls in our source code, we could inline everything, generate one large program even as part of source code, that will just increase the size, so function calls are primarily used to reduce the size of our BPF programs.

eBPF Community

All of this is great. Who is actually responsible for this? Who can I bug if something is not working? Who is behind this? It's a huge community, it's pretty invisible actually given how many people have been working on this. The main reason is because most people involved in this don't really have much of an interest in this becoming widely known or there is no interest in the marketing, let's put it this way. There's definitely an interest in sharing the overall cost of maintaining this but primarily, this has been used by very large cloud providers, for example, or other large scale infrastructure companies which have very specific needs around the Linux operating system and they did not see an ability to fulfill their needs by implementing kernel source code changes but basically switched over to using eBPF to fulfill their needs.

You can see large companies like Google and Facebook maintaining this and driving this forward and then the third big player there you can see is Cilium, that's us, and we'll get to Cilium where, obviously, we have been involved very much in maintaining eBPF as well. Structurally, there are two maintainers, so there's Daniel Borkman and Alexei Starovoitov from Facebook, Daniel is working for us co-maintaining eBPF.

eBPF Projects

eBPF Projects, I've touched on a couple of them on. Obviously, this is just a very small subset but this gives you a glimpse into the width of active users of eBPF. Starting from the upper left, Facebook has a massive usage of BPF. Maybe the most well-known usage is Katran, which is a high-performance load balancer they have done to replace IPVS, which is another software-based load balancing solution that Facebook has been using before that. They've switched over to using eBPF and saw a massive performance increase because of that and they have open sourced this so you can go to the GitHub there and check out the code. It's definitely specifically built for the Facebook infrastructure but if you're using a Linux-based software infrastructure system, you could use this as well.

Then, obviously, we have Cilium but we'll get to that in a bit more details, providing networking, security, and load balancing for Kubernetes. We have bcc and bpftrace, many people working on this. I'm not sure it's fair to say, but the main person behind this is probably Brendan Gregg that's working at Netflix. This is applying BPF for profiling and tracing, so gaining understanding of what my application is doing, what is the system doing, figuring out why the application is not behaving in terms of, for example, how many block IO calls is it making and so on.

We have the Android team and Google in general investing heavily into BPF. For example, there's a BPF loader in Android, there's a network traffic monitor in Android, there is the KRSI, the runtime security infrastructure done by Google and so on. Then Cloudflare has many blog posts on how they use eBPF for traffic optimization and QoS. If you're interested in that, I would suggest you to check out this blog post. They've also open sourced various toolings around BPF. Then there's Falco from Sysdig, a CNCF project, which is applying BPF for container runtime security, so introspecting system calls and then detecting whether certain system calls, certain metadata around system calls could actually indicate that some security-related threat is going on. That's a glimpse into some of the use cases, we'll dive into some of them.

First of all, there is bcc, maybe the most well-known application of BPF at this point. How does bcc work? bcc stands for BPF Compiler Collection, I think, and it's allowing application developers to write a Python program which contains both the actual BPF program which will run and generate events and metrics and then also the logic in Python to read the state, the metrics from the BPF maps and display that in some way. The lower right gives you one example. For example, this is a tcptop PCC program, which will attach a BPF program to all send message and receive message system calls and then maintain a list of all TCP connections and I think all the UDP connections, or in this case, only TCP and then also record how many bytes are being transmitted, for example.

It's using exactly the same kernel subsystems, it's going through the JIT compiler, it's going through the verifier. What's specific about bcc is the user space component is actually controlling this, so the ability to write a Python-based program that will then automatically load the BPF program and so on. If you go to iovisor/bcc, obviously, it's open source, there's a huge collection of different programs where you'll find many precooked small programs that you can use to monitor your systems. From networking storage, there is stuff like, "Show me all the files that are being opened." It's hard to mention, we got so many different examples. For almost any hook that can somehow be used for tracing and profiling, we'll have an example program in that repo. You don't have to actually write the BPF programs yourself, you can pretty much reuse a lot of the precooked examples there.

This is in Python and then more recently, there has been a project introduced called bpftrace. To sum it up, it's pretty much DTrace for Linux. The main difference here is instead of writing Python programs, you can write a program in a new higher level bpftrace syntax. Same scope, same focus, so it's still about profiling and tracing, but you don't need to write Python programs. Otherwise, it works exactly the same. In this case, it's an example on how to use bpftrace to record all file opens.

As we run bpftrace and we specify to attach a program at the kernel probe called do_sys_open, that's an internal kernel function, though that's not the system call, it's the actual kernel. That's a kernel function called do_sys_open which is the function that's called by the system call. Then I invoke this program logic, the printf, and just basically invoke the comm string and the first argument which happens to be the filename that is being open. This gives you the programmability. It's a single line but it's a lot of logic that goes into that and it will automate everything from creating the BPF maps, reading that, and so on. This is bpftrace, also open source, also in the IO Visor org.

Then we're going into Cilium. Cilium is broadly networking, load-balancing, and security for Kubernetes. The main goal of it is to take all the power of eBPF and then bring that to the Kubernetes world and not have the user base actually need to understand BPF. Obviously, it's a very low level technology. Yes, it can be used to very effectively implement, for example, Kubernetes services but users just want to load a component that will then translate Kubernetes services or the implementation of that and implement it with eBPF. They don't want to unnecessarily care much more about it. Cilium automates all of that. For example, we implement network policies, Kubernetes services, and so on.

Cilium uses a wide range of hooks, so we're starting at the network hardware level where we can, for example, do DDoS mitigation, we can do load-balancing at the network layer. Then we have the software network device level, we are using sockets where we actually introspect the data that an application is sending in and out. That's how we can, for example, look into the Layer 7 aspect without running a full Layer 7 proxy and we're also looking at the system call level to, for example, do load-balancing with the connect system call and so on.

I will not go into all of the details but we're using the full range of hooks that are available and are abstracting this away nicely so you can basically run Cilium as either a CNI plugin or just a Daemon set and configure everything through the standard Kubernetes objects and resources that you're used to, such as services, network policies, and so on.

This is the full overview of everything we're doing. Container networking – we provide very efficient, flexible networking with BPF. We can do, for example, overlays, we can do native routing, we can integrate with cloud provider IPAMs. We can do multi-cluster routing, obviously, we have v4 and v6 support. We can even do things that the Linux kernel itself cannot do such as NAT46, translation between IPv4 and IPv6. We can do service load-balancing, we can completely replace kube-proxy so you can get rid of IP tables entirely on your Kubernetes nodes, we can do direct server return, and so on.

Container security is not a huge focus of Cilium. We are using an identity-based approach, so we're not using IP and port-based firewall technology but instead we're allocating security identities to workload and enforcing based on that are also something that we can do with BPF, which would not be possible with the existing Linux kernel frameworks. We are API-aware so we not only understand network packets, we also understand Layer 7 calls such as REST, memcached, Cassandra, Kafka, gRPC, and so on.

We are DNS-aware, so you can define security policies instead of whitelisting or specifying network firewall rules based on IP addresses. You can, for example, allow and then all IPs returned by the DNS for that pattern will be allowed by our firewalls, so you don't have to, for example, hard code IP addresses. We support transparent encryption, we have SSL data visibility and so on. A wide range of use cases that we can implement using BPF.

Then very recently, we also open sourced Hubble, which is the visibility component of Cilium. Super exciting, it's basically building on top of the existing eBPF framework that we have in Cilium and then adds a visibility component on top which provides mainly three things. A very nice service map, what you can see in the upper right, where you can dynamically and transparently generate an overview of which services are talking to what other services, like a service map where you are, "What service that are actually talking to each other?"

You can go all the way into Layer 7, so you can actually see the HTTP calls being made and so on. Then we have metrics, so all of the visibility that eBPF gives us is being translated into metrics that we serve. Then we also have flow logs, so if you want to record all the network connectivity that went on, you can expose that via flow logs.

This is Hubble, all BPF-powered. A lot of this would simply not be possible without BPF. If you would try to implement this based on an IP tables packet counter, that would be incredibly hard to do, for example. This is a good example of the programmability aspect giving us this huge leap forward in terms of driving features.

Go Development Toolchain

What if you want to develop? Let's say you don't want to use Cilium, you want to do your own Go programs or your own BPF program. How do you do that? How do you write a BPF program? How do you load it? What type of tools are available? We have open sourced the entire toolchain we have that we used for Cilium. It's in the library in the repo cilium/ebpf and, obviously, it's still using exactly the same Linux kernel components and it is a library that sits on top, which is what we call eBPF Go library. It will hide the abstract away, both the system call that will load the programs and create the maps and also the access of the map. All of that is nicely hidden away and exposed as Go types program and maps, so you can basically load eBPF program by using a Go library and you can create read/write to maps using Go bindings as well.

How do you define the actual program? There's many ways of doing this. The one we use the most is to write it in a higher level, from a BPF perspective, high level languages, so just C code, and then use clang with the BPF back in target to generate bytecode. You can write the program similar to one of the examples we have seen earlier, run clang, and what you will get is generic BPF bytecode and you will have your controller program using the golang binding, which will open that, generate bytecode file, and will load it into the Linux kernel using the system call. The verifier will actually verify it, the JIT compiler will compile this into your native CPU, and then attach it to the system call or to the hook that you define.

From a user perspective or from a developer perspective, you can use clang plus the Go library to inject arbitrary programs at various hook points. On the actual runtime, so what will run on the system, is only the Go library and your controller. Obviously, you don't need to run clang on the system where you actually loads the program and where you attach it, so you can, for example, run clang on your laptop, generate the bytecode, ship the bytecode, and use the Go library to load it.

Outlook: Future of eBPF

Outlook into the future, eBPF sounds amazing. Is this all you can do? Initially, the title was "Rethinking the Linux kernel," so what exactly is the rethinking part? There's two aspects here which are very interesting. All of what I've just talked about sounds normal and natural, "Yes, this makes sense, why haven't we done this earlier?" All of it makes sense, but what's happening with this is that we basically, by accident, went the microkernel route.

For those of you who are not super deep into operating systems, long time back when Linux was first created, there was a huge debate about how should operating systems be designed and it was either you go microkernel, similar to how microservices are being written today, or you create a huge monolith. Linux was the best example that monolith was the better approach. Single codebase, huge codebase, single binary that you run.

Linux kernel modules came way later, much after this debate. The Linux kernel was not really extendable in the very beginning – huge code base, compile five megabyte binary that you load when your machine comes up. With BPF, we're quietly, without even noticing maybe, went down back to the old discussion and are starting to implement a microkernel model where we can now dynamically load programs, we can dynamically replace logic in a safe way, we can make logic composable. We're going away from the requirement that every single Linux kernel change requires full consensus across the entire industry or across the entire development community and instead, you can define your own logic, you can define your own modules and load them safely and with the necessary efficiency.

A very good example of this I think is still in itself. The Linux kernel itself has no clue what a container is and this may be very surprising, but there is no such thing as a container ID in the Linux kernel. The Linux kernel literally not know what a container is. The Linux kernel knows about namespaces, the Linux kernel knows how to do resource management. It knows, "I have a cgroup and the cgroup has certain CPU constraints and these are the processes that are attached to these cgroups," and it so happens that the container uses cgroup or the container runtime uses cgroup for resource management.

The Linux kernel has a concept of network namespace, "I can have multiple virtual network scopes and each of these scopes can have its own IP address and so on." The kernel does not know that the container is using a particular network namespace. In fact, if you are using Kubernetes, a Kubernetes pod will have two containers and both of them will share the same network namespace, so there's not even like a tight coupling between network namespace or namespace in a particular container. The kernel does not know anything at all about containers, it only provides the tooling and this has been a big problem because the kernel stopped being able to deliver functionality that is essential because it was not aware of containers.

Cilium is adding this ability to the kernel, making the kernel aware of containers and, for example, Kubernetes services, without requiring to change the kernel source code, because the changes that Cilium needs may not be acceptable to the 80% of non-Kubernetes use cases, for example. The kernel community would not be willing for everybody to pay the cost of the complexity that Cilium is adding for the Kubernetes specific use case. This is definitely a huge shift and there was no debate about this. It just naturally happened and everybody is fine with this – or the majority is fine with this.

Then the second aspect, this is a really big outlook into the future. Who's running more than, let's say, 5,000 either servers, VMs, or something like that? How long does it take you to reboot all of them? You don't know? Maybe you have never had the need for it. Let's say you own 5,000 and there is some kernel security bug and you need to reboot all of them, how do you do that without sacrificing or without introducing massive downtime? You need some sort of schedule. You can say, "I can reboot 2% of them at a time," and then let's say I can reboot 10 per 5 minutes. It will take you quite a long time to actually reboot them.

Imagine this at scale of Facebook and Google and others – hundreds of thousands of servers, it may take weeks to reboot all of that. Some of the systems will be vulnerable to the kernel bug for weeks and that's not really acceptable. If you have a zero-day bug coming up, you want to protect yourself as quickly as possible. If you're lucky, you can have some firewall rule that may be able to catch, but if it needs a hot patch, you will be out of luck and you have two choices. Either downtime, like having a massive buffer of machines that are unused, or you just have systems that are unprotected.

The solution to this might be eBPF. I've used the Heartbleed example here but we could use any type of security bug of the similar scope. Obviously, in this scenario, some kernel function or something is vulnerable inside of Linux kernel and a hotfix would typically patch that function and change the logic of that. Instead of changing the Linux kernel and rebooting, what if we could actually do the hotpatching on the fly? This is not completely new, not at all. Hotpatching has been available for a while but it has never really been solved well enough that it's spread widely, it's been insecure, or difficult to use.

eBPF could be the solution to this where instead of patching the kernel, we basically bring eBPF program in which runs on behalf of the insecure function and implements the same logic but without the actual bug. Zero-day fixes are only one example, there's many other bugs that could benefit from a similar treatment. This is definitely something that has sparked a lot of interest and there's lots of discussion on how BPF can be used to solve this.

With that, I want to say a huge thank you to everybody on the left. I'm not going to read off the names but I wanted to give a big shout out to all the huge community that has been involved in getting BPF as far as it is today, starting with the actual BPF maintainers working day and night almost, like the Cilium team, Facebook team, Google team and so on, it's a long list. It's absolutely a wonderful technical community. This has been evolving incredibly fast and quick and it's a joy to work with all of them.

If you want to learn more about BPF and XDP, which is the network device BPF program, there is a Getting Started Guide which, first of all, gives you the full language spec and also introduces you how to write BPF programs. There is, obviously, a Cilium that you can look at, which is Kubernetes-specific and then the previous slides with bcc and bpftrace, the links are on the slides itself. If you want to learn more about Cilium and follow us, you can follow us on Twitter, and if you want reach out to me, you can follow me on Twitter as well.

Questions and Answers

Participant 1: Thank you for the amazing talk. I have a question about the verification process of this. What happens if there's one program that is in itself safe, the second one is safe, but in combination they are unsafe? Can that happen and what will happen then?

Graf: It should not be able to happen and the reason for this is it's not possible to do arbitrary function calls between programs. If we do a so-called tail call between programs, then the amount of context that you can put into a program is very limited. You can't just take the memory from the old program and give that to the new program, for example. You can pass on a very limited amount of well-defined types like numbers, strings, and so on. It's well-known and the verifier can validate that all possible combinations of the input to the new program are safe to run. No matter what state the old program leaves behind before we do the tail call, the verifier guarantees that it is safe to run. The verifier will test all possible combinations that can happen.

Participant 2: I would like to ask, actually, with the observability features that are built into Cilium, where do you see the line between a service mesh like Istio that's compatible or integratable in this tool stack and Cilium? This is question number one. Question number two, you said that it can run as a CNI plugin. For example, if somebody uses Kubernetes on EKS and it obviously comes with the AWS CNI plugin bundled in, does that work in any sense? Can you still do Cilium or does it exclude that?

Graf: Obviously, there is some overlap in terms of both projects will provide what's typically referred to as tracing. The biggest difference is that Cilium is 100% transparent. It does not even inject a so-called sidecar proxy, so there is nothing that the application will not even understand or notice that anybody is observing anything. All the observation is done using kernel-level BPF programs. Then we can also transparently inject Envoy as a proxy on-demand but not in a sidecar model where the actual proxy that was providing the visibility runs inside of the application container, it's running outside, hidden away from the application. The biggest difference is the transparency aspect. The second aspect is that because of BPF, we can provide this visibility with much better efficiency so you will have less overhead as you add this observability. Other aspects, for example, the DNS awareness is something that ESIO and Envoy do not provide yet, they may in the future.

In general, it's not an either-or decision at all. We have many users that will run Cilium and, for example, Istio or Linkerd or some other service mesh. It's perfectly fine. What we see very commonly is that this visibility is used by the platform team and the security team as the ground layer for everything and then different application teams, sometimes even different teams running in the same clusters will pick different service meshes based on their needs. This service mesh obviously provides several aspects that Cilium does not provide at all. Cilium does not do any Layer 7 load-balancing, it does not do any circuit breaking, it does not do any retries, none of that. We are focusing on visibility. Also, the last aspect in that is we are also providing a lot of operational metrics. For example, we allow to detect situations where DNS resolution failures, which is very common cause for outages, that you can notice and observe them earlier which is something that the service mesh is less focusing on.

Your second question was EKS. Cilium has two modes, you can run it in native AWS mode where we basically provide exactly the same model as the standard EKS CNI provides, so it's fully compatible, it also uses what's called AWS ENI. It's what probably our largest users use to scales to massive numbers. Then the second deployment mode is you can run Cilium in chaining mode where you can run it on top of, for example, the AWS CNI plugin or on top of the GCP network CNI and so on, so both are viable options. Typically, we recommend to use the native mode for any large deployment.

Participant 3: One question, which actually ties into your question is, are you open tracing API complaint, because they try to merge all the tracing efforts? Is it already compatible or is it planned?

Graf: It's planned, so we don't output the open tracing spec yet. The main reason for that is that these systems have been built with a slightly different design in mind. Most of these tracing libraries, you're running on behalf of the app and you're reporting on behalf of the app and we're more kind of a transparent middle system, so we're basically reporting both sides at the same time. This is making it a bit challenging to do this but we are working on this to implement the open tracing API.

Participant 4: You mentioned using clang to compile the eBPF programs. Does that mean that any language that has an LLVM frontend can be used or is it only C?

Graf: I think you can use other languages. As long as clang supports high level to intermediate to BPF backend, you can use any language you want. I'm not enough of a clang expert to make the claim that any of the languages will work but as long as clang supports that translation – like the Go library, all it will expect is BPF bytecode, you can even write that bytecode by hand. GCC also supports BPF backend by now, so there's multiple ways now to generate that bytecode.


See more presentations with transcripts


Recorded at:

Apr 08, 2020