Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations How to Make Linux Microservice-Aware with Cilium and eBPF

How to Make Linux Microservice-Aware with Cilium and eBPF



Thomas Graf talks about a new efficient in-kernel programming language called eBPF. It allows everyone to extend existing kernel components or glue them together in new forms without requiring to change the kernel itself.


Thomas Graf is co-founder & CTO at Covalent and creator of the Cilium project. Before this, he has been a Linux kernel developer at Red Hat for many years. Over the more than 15 years working on the Linux kernel, he was involved in a variety of networking and security subsystems. For the past couple of years, he has been involved in the development of BPF and XDP.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


My name is Thomas Graf. Before I start, I would like to know a little bit about what your interests are; who is involved with the upside of things platform, kind of less the dev side, just pure platform? Who is really into kind of development and Dev-ops? So, first of all, I've been asked to stay behind a podium, so whenever you see me walk in front just give me a sign, and push me back. Apparently, the video camera is not able to catch me if I've walked forward too much.

So what I'm here to talk about is, is BPF, Berkeley Packet Filter, and how BPF can be used to turn and Linux into what we call a Micro Service Operating System. What makes me qualified to talk about this? So I've spent about 15 years working on the Linux kernel. About 10 years of that, I've mostly focused on working and security from a subsystem perspective. So I helped write potentially the biggest monolith ever, 12 million lines of source code by now. I worked on all of the networking subsystem pretty much, a lot of security user space stuff, Netlink, prior to and so on and so on.

For the past two years, I've created Cilium, which we'll cover a little bit later. And then I co-founded a company that is behind Cilium. So what we'll cover in this session is, we'll talk about the evolution of running applications, we'll do that very quickly. And we'll look at the problems of Linux kernel right now based on its history; what are the problems that the Kernel has right now in terms of running microservices? And we'll turn to BPF, and then to Cilium and hopefully we'll have time for a quick demo as well.

Evolution: Running Applications

Evolution running applications. So you have kind of the Dark Age. Single-tasking. Basically, process has full access to the machine. I don't remember this age. I'm too young. We went into the multitasking age where the CPU was now split between applications. We are required stuff like MMUs, we introduce concepts like a virtual memory and so on. This was the age where Linux distributions really took off. You had to run a Linux distribution, you needed to package management to actually manage all of your dependencies. All of your applications running on a server required to have dependencies on the shared libraries, and you need to make sure that all of your shared libraries had the latest version and so on.

The age of multitasking was still on bar metal, so you were deploying on physical servers. We went to the age of virtualization; all of a sudden everything was running inside of VM and we started packaging the OS together with the app. Basically, every APP could run on its different OS and you would deploy this all in one piece. And then we started virtualizing all hardware and putting V in front and all of a sudden you had virtual switches and virtual breaches and virtual devices. Everything was soft to define. Essentially what we did is what we had before in hardware, we put that in software and started running it inside of VMs or on hypervisors.

We're now entering the age of microservices and containers. Some of us are further down that path, others are earlier on. The main piece is that we're actually back to basically running on the host operating system. So we're basically back in terms of that the host operating system is shared between multiple applications. It's no longer one virtual machine per app. We have multiple apps that require isolation and then resource management managed by the host operating system, containers, namespaces. We'll talk about this. This has a huge impact in terms of what the operating system has to provide. It's no longer just, “Oh, let me forward network packets to this VM. Let me do some firewalling outside.” We actually have to care about the application. Again, this is a huge shift. It's not the same as multitasking though. All of a sudden we have applications that only exist for a couple of seconds. We have completely different requirements. We have multi-tenant systems and so on, very different use cases.

Problems of the Linux Kernel in the Age of Microservices

So what are the problems that the Linux Kernel has in this context? Obviously, it was not built for this age. Problem one: abstractions. Software developers, we all love to build abstractions. This is very networking-specific, but it gives you an idea of how the abstraction looks like inside of the Linux Kernel, when it comes to networking. If we actually want to do packet filtering using Netfilter, which is kind of the firewalling layer, we actually always have to go through sockets and TCP. If you're coming from the network card side, we have to go through the network device abstraction, then traffic shaping, Ethernet, IP and so on. We always have to move up and down the stack.

There are a couple of pros. The pro is first of all very, very strong userspace API compatibility guarantees, which means that your 20-year old binary that you compiled once still works. This is amazing. This is great. The second big benefit is that the maturity of the Linux Kernel code is actually independent of hardware. I've worked 15 years, I've never written a driver. I have no clue about hardware, but I've written a lot of low level-code, like IP and routing and firewalling and all of that stuff. I have very little knowledge about actual hardware.

Cons: massive performance overhead and we'll look into why that is and how we can solve that. And the second con, it's very hard to bypass these layers. There are some examples, but the majority, it's very hard to bypass.

Problem number two: every subsystem has its own API. Again, this is not just specific, but it actually applies exactly the same to storage or other things. If all of these layers, all of these subsystems, I've listed a couple of tools here. So if we want to configure our ethernet drivers, our network device, we you call If Tool. If we configure IP routing, we call IP. If we configured the system called filter, we call Seccomp. If we call the IP filtering, we call IPtables, but hey, if you're using route sockets, all of that stuff gets bypassed, so it's not really a complete firewall. If you're doing traffic shaping you're calling TC, if you want visibility, you called tcpdump, but there again that doesn't see quite everything, it only looks at one layer. If you have virtual switches, you call brcontrol or ovsctl. So every subsystem has its own API, which means if we automate this, if we automate the management of this, we actually have to use all of these tools independently. And there's tooling that does this for us, but it means that we have to understand the complete layering of everything.

Problem number three: development process. So if you have a need for a change in the Linux Kernel, this is hard, but it will cover the good parts first. First of all, what's good about the Linux Kernel level up in process? It's open and transparent, everybody knows what everybody else is doing and it has excellent code quality. It's relatively stable, probably the most stable out there right now let's say, and it's available everywhere. Once you have merged your change in the Linux Kernel, everybody consuming the Linux Kernel will actually have this available to them and it's almost entirely vendor-neutral, which I think is positive as well.

The bad: the list is much longer. First of all, Linux Kernel is very, very hard to change. Shouting is involved. Apparently, this is getting better. It's a super large and complicated code base. It's 12 million lines of source codec. C code. Some of it is 30 years old. Upstreaming code is hard; consensus has to be found. So if you have your specific use case and you don't find others that share your views, they've already checked this code change and you're not able to actually get this in, so you cannot just not modify it on your own. You can do this, but then you have to fork the Kernel and after you have to maintain 12 million lines of code. Depending on the Linux Kernel distribution that you're using, whether Linux Kernel distribution, it will take years until that change actually becomes available to your users. So people are running 10-year-old Kernels. Probably the biggest one is everybody's maintaining their own fork. Sometimes there are thousands and thousands of backports of patches. If you're running android, yes, you're running Linux, but you're really running Android Linux. If you're running Rail, yes, you're running kind of Linux but you have like 40,000 patches. It's not really, really like the upstream Linux. It's a fork of Linux that you're running. So not everybody's using the same Linux.

The fourth problem, this is probably the biggest one. The Kernel doesn't actually know what a container is. What the Kernel knows about, this is a list here. The Kernel knows about processes and threats. It knows about Cgroups. Cgroups is a logical construct where you can associate processes with a group and then associate resource limits to this. Like how much CPU can this group of processes use, how much memory, what are the eye operations per second, and so on. It has a concept of namespaces. Hey, give these group of processes a virtual process space so they can only see the processes in that space. Give this a network namespace so it can only see the networking elements in that same space, and so on.

The Kernel knows about IP addresses and port numbers. The Kernel knows about the system calls that are being made, so whenever an application interacts with the Linux Kernel, it will perform a system call. The Kernel sees these system calls, and can actually filter on this. The Kernel also knows the SELinux context, which gives them the ability to actually provide filtering capabilities on security, like what kind of process actually do and how can it interact with other processes and so on. Sounds great. These were the building blocks that are built for that multitasking age.

What does the Kernel not know? The Kernel has no clue what a container or a Kubernetes part is. You can find the container ID in the Cgroup file, but the Kernel itself does has no clue what an actual container is. Kernel only sees namespaces in Cgroups. The Kernel does actually no longer know whether an application is intended to be exposed outside of the host or not. If you go back to the multitasking age, the Kernel actually knew what an application would bind to a port in an IP that would make that poor publicly available. Let's say a web server running on port 80; if you bind that to the loopback interface, it would not be publicly exposed. If you're buying it to all interface addresses, it will be publicly exposed. In the age of containers and pods, it's no longer clear to the kernel what should be exposed and what should not be exposed.

The next one is actually a huge; what used to be IPC calls, Linux domain socket pipes and so on are now API calls, rest, GRPC and so on. The Kernel has no clue about these things. The Kernel only knows about network packets, about port numbers. The Kernel will know that, “Hey, there is a process. It's listening on port 80 and it's running in its own namespace.” Kernel has no clue about what goes beyond, what's actually running on that port. Before, the Kernel actually knew this is an application that's doing an IPC call to the other process. That's obviously service-service process to process communication. And servicemesh, I'm not sure how many of you are looking at servicemesh yet. Kernel level has no clue what a servicemesh is. Lots of things that a Kernel has no clue about.

What do we do about it? A couple of alternatives. First of all, for the first problem, we could just give userspace access to all of our hardware. Just bypass it completely. I mean it will be fine. Userspace can handle this fine. The application probably knows how to deal with this hardware. Examples of this are userspace DMA, DPDK. There are many, many more frameworks like this. Second Alternative, Unikernel. Linux was wrong. The app should provide its own OS. Definitely feasible. There are many examples; three of them ClickOS, MirageOS, Rumprun. Many, many more. So instead of having a shared operating system, every app can provide its own OS.

Move to OS Userspace, like gVisor, actually using mode Linux like many, many years back. We can just run the majority of your operating system in userspace; we only need that small minimal operating system that actually deals with hardware and everything else we can run in userspace on top, and we don't have to deal with the Linux Kernel community to change networking or change storage and so on. Yes, it's a great idea. You will have a massive performance disadvantage of course.

Last one, we can just rewrite everything. Apparently, this is a thing. I think later today Brian will talk about rewriting everything in rust. I think it's just a massive undertaking. So I just Googled how much would it cost to rewrite the Linux Kernel, and this is the number that came up. I'm not sure if the salaries are still up-to-date here. But it would be a massive undertaking to actually rewrite the Linux Kernel. So probably not feasible.

What is BPF?

So this brings us to BPF. BPF is a solution to a lot of this. So what is BPF? BPF is a highly efficient sandbox virtual machine in the Linux Kernel and it makes it programmable. It's jointly maintained by some of the engineers on our team and Facebook, and we have massive collaborations and more people joining in from Google, Red Hat, Netflix, Netronome and many, many orders. BPF is basically, it looks like the code on the right. It's basically xad six byte code that allows you to program the Linux Kernel and we'll look into how that actually works.

First of all, to understand this, the Linux Kernel is fundamentally event-driven. We have on the top, we have processes doing system calls. They will connect to other applications, they will write to the disk, they will write to a network socket, they will read from a network socket, they will require timers, they will acquire some force. And so on. It's all event-driven. All of that are system calls. From the bottom, we have the hardware. This can be actual hardware or virtual hardware, and they will do interrupts, “Hey, I have a new network packet”, “Hey, your data that you have requested under this block IP device is now ready for reading”, and so on. So the Kernel is doing everything event-driven. In the middle, we have this giant monolith of 12 million lines of code that basically handles these events and does a lot of magic.

What BPF allows us to do is actually run a BPF program on an event. So examples here: we can run a BPF program when an application, there's a read system call and that will later read from a Block Io or device. We can also run the BPF program and that Block Io event is actually happening. We can run a BPF program when a process does a connect system call, for example, connect to the DNS server or to a web server. If a TCP retransmissioning is happening as part of this, we can also again run a BPF program on that particular event to only run when a TCP retransmission happen. And then when the package is actually going out to network card, we can again run a BPF program.

So fundamentally BPF allows us to create these logical programs and implement logic whenever something is happening inside the Linux Kernel. We can do this for all kernel functions using kprobes. We can do this for tracepoints. These are well defined, stable function names. We can even do this for userspace function names using uproble. So when your userspace application calls the function, we can call uprobe and run a BPF program. This is how some of the profiling and tracing tools work that leverage BPF. We can call BPF programs on system calls and network devices, on socket level interactions. So whenever data is being incurred into a socket or when it is read, we can even call this on the network driver level with DMA access to have very fast access. And then there's more and more attachment points coming in every release of the Linux Kernel.

BPF programs can communicate with each other and they can store state via BPF maps. It's basically, so you have your bytecode which is just a code that runs, and then separate from that you have the state of the program which is stored in BPF maps. BPF maps can be accessed from BPF programs and from userspace. So you can write state into BPF map and then read it from userspace for example, export metrics or you can write configuration into map and read it from a BPF program to configure that program, and so on. It allows us to store state. Map types that are supported are hashtables, Arrays, LRU, Ring Buffer, Stack trace, and an LPM, Longest Prefix Match. Some of them have per CPU variants as well for efficiency.

We can call BPF helpers. So the Kernel actually allows us to basically interact with it. For example, BPF program does not know on its own how to change and manipulate a network packet. For this, we'll call a helper. And BPF program does not know how to generate a random number. For this, we call a helper. These are stable APIs. They never change. They will be maintained forever. This allows BPF programs to interact with the Linux Kernel and use existing functionality that the Linux Kernel already understands.

We can do tail calls; we can have one program call another program. We can implement chains of logical programs. We can implement function calls using this. This allows to build small programs and then call them in a sequence. We have a JIT compiler, a Just-In-Time compiler. This means that as you load generic BPF bytecode, which is not CPU-specific, the Kernel will take this, will verify that it's safe to run and after this will just compile it to whatever your CPU actually runs, so X86 or whatever. You can see the list of supported CPU models. They're basically all 64-bit CPU models of support at this point. So this is an insurance test; the native execution speed without actually requiring the order of the BPF program to understand what the CPU does.

Who owns BPF? This list is growing. So this is just the top 10. This is the number of contributions to BFP on the kernel side in the last roughly two years. So you can see that we have the two maintainers, Daniel and Alexei. They're co-maintaining BPF. And then we have contributions from basically Facebook, Reddit, Netronome, the list goes on and on and on and on. There's I think, 186 contributors to the Linux Kernel code on its own in this subsistence. So one of the fastest growing subsystems right now.

Who is using BPF for what? So this gets a bit to what Justin was mentioning. This is kind of revolutionizing a lot and people don't see quite yet. Use case number one. Facebook is basically rewriting the maturity of our infrastructure related code using BPF. Small spoiler, there will be a very interesting talk around how Facebook is replacing IP tables and network filtering with BPF in the upcoming BPF summit in November 12th. That talk will definitely be online. If you are interested in this stuff, definitely check that talk out. They have lots and lots of details and performance numbers. So basically Facebook has already replaced their DDoS mitigation L3-L4 lope launches with BPF moving away from IPVs. They are already using it for traffic optimization and at the talk will cover how they will use it for network security in the future.

Google has started out using it for profiling to actually figure out how much CPU are applications using in a distributed system, and they're now moving to traffic optimization and network security as well. Reddit is working upstream on a project called BPF Filter, which will replace the Kernel portion of IP tables. So the part of IP tables of the packet filter that runs in the Kernel to rip that out and replace it with BPF. And there are also several papers and projects out there, to use XDP and BFP for NFV use cases.

And last but not least, Netflix. If you have seen DPF before, it's quite likely that you've seen Brendan Gregg talk about performance troubleshooting using BPF to actually look at production, and let's say, high scale environments where applications consume CPU in an environment where it's difficult to troubleshoot and profile unless you have a profiling tool that's very low overhead, and use BPF to basically extract so-called flame graphs and troubleshooting data. Mostly recently, he has open sourced a new tool called BPF trace, which is kind of the trace for Linux and offers a nice syntax to do a performance troubleshooting. These are just a few. There are many, many more examples on how BPF is being used. There are definitely a massive number of projects that have started in the last two years.

To give you a very simple, simple, example- at least to me, this is simple. But how does a BPF program look like? Basically, you're writing a high-level language, such as C, and you can execute code when a certain event happens. In this case, this small program runs whenever the exec system call returns. So whenever a process is executing and another binary and that system called returns, this code is invoked. And we will get from that process the current process ID and group ID, and we will retrieve to command what are the- could be call or it could be HDBD, the actual binary name- and it will expose that through the buffering bar for like a ring buffer userspace. So this will actually give us an event and an invocation, every time exec was called by a process, and you will get the PRD and the actual binary name. You can monitor your systems. Very, very simple example, but this is exactly how, for example, profiling and monitoring systems, for example, cystic is using BPF to monitor your system.

What is Cilium?

That was an intro on BPF. Very low level; what is BPF? Now, do you need to fully understand all aspects of BPF to actually understand and, or to use it? No. This is why we created Cilium. So Cilium is an open source project and it is aiming to provide networking, low balancing, and security for microservices in an environment such as Kubernetes and Azure Docker and so on. Primarily aimed at containers. The technology itself is not container-specific at all, but right now we're primarily supporting containers. And we're making the Linux Kernel, or micro service or wherever, with the help of the BPF.

Project goals. So first of all, approachable, BPF. BPF is fantastic, flexible and highly efficient. But it's super hard to use it unless you understood Kernel level upper. There's no doubt that the majority of people will not want to write BPF programs on their own, but you want to benefit from the flexibility and powers of it. So you want to automate program creation, you want to automate the management part and so on. That's goal number one.

Second goal is to actually use the flexibility of BPF to make the Linux Kernel aware of cloud native application use case, and we'll dive into some of these. Security- use the power of BPF to make the Kernel API aware. So you actually have the Kernel understand, “Hey, you have to two applications talking to each other, what are the API calls that are making?” Enable the Kernel to actually secure API calls. Build an identity-based mechanism to secure services communication. So instead of filtering on IP addresses, actually understand what is a microservice, what are the labels on that microservice, what is the security at any of the microservice, and then build an identity-based security mechanism and firewall, instead of just understanding IP addresses and ports.

And process level context enforcement. So actually using the power of BPF to understand what is the binary, what is the process inside of a container that is making a certain API call. This is highly useful. For example, everybody using Kubernetes understands that yes, you can run Kube control Exec to execute the command inside of a part, for example. So who's going to secure the communication that this invocation makes? It's obviously not the service itself that you're running that is making this call. So how can you secure this additional command that you're executing, whether that should basically be able to reach out somewhere else and so on? And the last aspect is to leverage performance of BPF.

So some use cases are, first of all, we're a CNI and CMM plugin, so you can use Cilium for container networking. So for poll to poll networking, for container to container networking. We support IPv4, IPv6, NAT46, we have concerts for multicluster routing. So all of the BPF flexibility we use for container networking that is fast and flexible. We implement service load balancing. So if you're running Kubernetes or Docker or something like this, you will put a so called service construct in front of your replicas of containers or parts to make it has HA to make it available. We implement, for example, Kubernetes services using BPF, the same technology as Facebook is using for the low plan, so highly efficient, highly scalable implementation.

We implement microservice security, so instead of just providing pure IP and port-based network security, we basically enforce using identities. So we give services identities and we allow you to define a security policy based on service labels. It's like my service frontend can talk to my service backend and we ensure that we enforced it on the networking level. We have an accelerator, like a BPF accelerator way of doing API security. So instead of just saying two services can talk to each other, we can say they can only make certain rest API calls, or this service can talk to my Kafka cluster, but it can only produce on a particular topic or it can only consume from a particular topic.

We have DNS server policies. And this is upcoming and one of the next releases we'll have SSL data visibility via KTLS. So the ability to actually understand what data is in application sending, even if the application is using SSL encryption, which is very important as you want to have visibility and security for data that is leaving your cluster and maybe going to a SaaS offering outside of the cluster. This data would almost always be encrypted, or you're using cloud-managed databases. Maybe you're doing calls to, let's say, some database service that will always be SSL encrypted. So if you want to have, for example, servicemesh functionality or API visibility, this is where Cilium as a cell data visibility with KTLS can give you visibility.

Now, last part is servicemesh acceleration. We'll do a quick deep dive into that because I think that might be something that's interesting to a lot of people. So who has heard about servicemesh? Most people, that's great. It's growing. So this is usually the picture that you see when people talk about servicemesh. Two services are talking to each other and instead of talking directly, you will have a sidecar proxy that basically is in the way of all this communication. Looks pretty simple and neat. What it really looks like is something like this. This is how it looks like on the data path side of things. So the service is talking. I will use the regular data path, the networking mechanisms, and then you will have an IP tables role or similar, that transparently redirects all traffic to the sidecar. So instead of being able to go out to the world directly, it has to go to the sidecar. And then a sidecar will terminate that TCP connection. And willl look at the HTTP headers and everything, run its services, and then send it out again. It's not really efficient. You will have about a 10 x impact just from injecting this sidecar architecture. It's not because the proxy- in this case, I have a picture of envoy up there. It's not because the proxy is slow or not fast or not efficient. It's because of this transparent injection because the proxy is getting forced into the way.

Why do we do TCP, which was designed for a lousy environment? It was literally designed to sustain an atomic attack. So why do we run TCP if the service and the sidecar proxy is always running next to each other on the same host? We can choose to do this. We can just say we can just shortcut these sockets. So if they're always running on the same host, we can find this out and we can copy the data from one socket of the application to the socket of, for example envoy. And we did that and there was about a 3x to 4x performance improvement in terms of number of requests per second that we can forward. So this is one example of how Cilium and BPF can basically make the Kernel microservice-aware. Kind of shape it in a way that is built and tailored for this new age of microservices to, for example, bring functionalities such as servicemesh.

There are many, many other BPF projects and I've only listed a couple of them. I've already mentioned BPF Trace, which is the D trace for Linux. There's BPFD which is kind of an ability to load BPF programs in a large cluster of servers. This case, this is done by a Google engineer mostly for profiling. There are several frameworks, I've listed two here. There’s go gobof and BCC. Both of them allow you to write PBF programs using either Python or Goal. Often you still have to run the actual BPF program in C, but all of the logic around it can be written using a higher level language. I guess to me C it's still a high-level language.

Load balancing Katran, is Facebook's load balancing that is open source. Their security tool, for example, Seccomp has BPF mode. There's DDoS mitigation tools. Cloudflare has open sourced their a BPF tool that they use to mitigate DDoS attacks. It's kind of a tool that has IP tables like syntax where you can block IP addresses, in entire ranges, and do this at massive speeds up to 40 million packets per second. That's basically what you need to do if you are a CDN that is branding himself as being DDoS mitigation proof. It shows kind of how it easy it is to actually use BPF if you build the right level of abstraction on top. There are many, many, many more that I have not listed here. If you want to have the fullest Cilium BPF documentation reference, at the end, on the last slide we'll have a full list as well.

So before we go into the Q&A, I want to do a quick demo and talk a lot about low-level stuff, but I want to do a quick demo as well. How much time do we have? Yes, I think I still have a couple of minutes. So everybody has done about Kubernetes I assume, right? Just making sure. So I have a Kubernetes cluster here, it's actually just mini cube and I've already started a couple of parts because of conference Wi-Fi. You can see there are some deathstars and some spaceships and some xwings. So what could it be? Let's see, I want to do my intro. So this demo is Star Wars theme. You may have guessed it.

So a long time ago in a container cluster far, far away, it is a period of civil war. The empire has adopted microservices and continuous delivery. Despite this, rebel spaceships striking from hidden cluster have won the first victory against the evil Galactic Empire. During the battle, rebel spies managed to steal the swagger API specification to the empire's ultimate weapon, the deathstar. So that's our demo. We'll demo some functionality around this.

So as you see that the empire has constructed the deathstar; it's now running here with a three replicas. We should also have a service. Yeah. So we have a deathstar service. Which is a cluster IP and it will load balance to my three deathstar replicas. So empire is now running deathstar in HA manner, that's good. And then there's a small script, which will just give me a cube control command line. So those who have not used Kubernetes yet, so Kube control exec will basically execute this curl command here, curl in this part, which is basically a container.

So what we're doing here is we're basically doing curl and we're calling the deathstar service name, and the deathstar will give us some. Like, “Hey, I’m the deathstar, here are some attributes. And by the way, here's my entire API that you can access.” So you can do a get, you can get to slash, you can do a get to slash health. You can request a landing. You can put something into the cargo bay, you can get the status of the hyper-matter reactor or you can put something into the exhaust port. Amazing. So I'm assuming everybody has seen the movie. So obviously the rebels know what they have to do, just invoke that rusty API and the deathstar is going down right, way easier than in the actual movie.

So fortunately the secops team off the empire has adopted Cilium. Very bad for rebels and they're doing an L7 an API or policy. Let's have a quick look. So this is a so-called Kubernetes COD, custom resource definition, which we defined any kind for, which is Cilium policy. So this is how a Cilium or policy looks like, that will then be implemented using BPF. And it allows you to basically specify, first of all, in this block, this policy will apply to all parts in your Kubernetes cluster that have the following two labels, class deathstar and organization empire. So all our deathstar replicas will have this policy apply, then we'll have an ingress policy that the policy that applies to stuff that's going into a port. It means that you can be talked to from all parts which have the label class spaceship. So that could be a thousand parts, could be at one part, all parts that have this label. You can talk to me on port 80 TCP, and then here are the API rules; you can do a get to/rerun or you can do a post/rerun requests landing. So this is a policy, it includes service to service, includes L4 on the port and include API.

Let's get out of here. So I've created this policy. It's now loaded in here. We could also extract it again, if you want to. Kubernetes now understands this policy and has loaded it. So let's go back and actually let's go to the get the slash rerun again. That still works. Now let me change that. So going from a get to a port. Rebels are trying to kill the deathstar. Access denied. So this was Cilium using BPF and envoy to enforce a layer seven security role that basically restricted the API calls. That's what I call a successful MRI.

Not quite. So what we've missed is that while the deathstar was constructed while the terraform scripts were running, the Jedis have managed to infiltrate stuff and they actually loaded it a slightly different policy than the one that I showed you. So let's do a diff. So this is the policy that I showed you guys, this is the policy that I actually loaded and if we do the diff, we can see that there is one more rule in there which says you can do a put two/exhaust port if you have the HTTP header, x has force true. Let me see if I can do that. Let me go back here and we do HXS force true. Luke is coming, panic, death exploded. So we're good again, we can actually bypass the rule if we wanted to.

So it shows you the flexibility of the power of this. We can filter on path of maiden name, headers and so on. One of the next versions will include a functionality or reconstruct, enforcing on payload as well. So we have extended envoy with so-called Golang extensions, where you can write your own smaller Golang extension that will then run on the entire HTTP message, including the payload. So this would, for example, allow you to validate your own tokens or enforce a particular Schema on the payload and so on. So, all right, while it's nice to have enforcement on the request, a lot of APIs obviously carry some of the data as part of the HTTP body, which we cannot enforce on right now. So the next and one of the next versions- I don't want to promise the next version- one of the next questions will feature this particular feature.

With that, here are the links. First of all, thank you for listening to me. Links. So all of this is open source. There's the getup page. You can check it out. We have an extensive BPF reference guide which goes into all of the gory details all the way down to the individual BPF register numbers and everything. You can follow us on Twitter and we have a website as well.

Questions and Answers

Man 1: So I'm relatively new to this, so I'm asking a very basic question. What exactly is DPDK? How does DPDK differ from what you're describing?

Graf: So the main fundamental difference is that DPDK would allow you to write a userspace program that has access to either the hardware directly or the data itself. So instead of basically solving your problem in the Kernel space, you're solving it in userspace. Previously DPDK would come with drivers for each specific network cart that were supported, and the newer version of DPDK actually now runs on top of a special BPF mode that then gives direct access to the data only. The one sentence answer is Kernel space versus userspace.

Man 2: You are basically extending the Kernel with a [inaudible 00:41:07]. Is there anything in the design to sort of help with the fact that if you're running six different tools, and someone else is running 12, that you're basically running a unique kernel [inaudible 00:41:16]?

Graf: So the BPF API has been designed that all hooks can be used by multiple users, that will run in sequence. That said, if you have a hook that basically filter and says, “Hey, drop this,” then hook afterwards will not see this. All right. So this is definitely an area that is still being worked on. For example, what is the exact ordering? I think this is always a problem with any user space facing API that just somehow, like how does this all work together?

Man 3: [inaudible 00:41:53]

Graf: BPF is older than I am. It’s been invented by one of- basically the same guy that almost seemed to pretty much invented on the internet. And you've all used BPF. If you have been running TCP dump, you have been using BPF. You’ve been using BPF for its filtering technique. If you are using Chrome, you have been using BPF. It's using DPF to do system called filtering for the plugins of chrome. But that's classic BPF, CPBF. It's the old BPF. Relatively basic. EBPF is the extended, or evil BPF, like both names are being used. And, there are lots of technical details in what's different. 32 bit to 64, many more registers, many more helper calls, and so on. But it should compile. Lots and lots and lots of additions. It's pretty much the same thing. That's why I'm just saying BPF. So classic BPF doesn't really exist anymore. If you low classic BPF, it will be translated into EBPF at this point, but technically they are two different instruction sets.

See more presentations with transcripts


Recorded at:

Jan 10, 2019