InfoQ Homepage Articles A Gentle Introduction to eBPF

A Gentle Introduction to eBPF

May 03, 2021 11 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

Key Takeaways

eBPF is a mechanism for Linux applications to execute code in Linux kernel space. eBPF has already been used to create programs for networking, debugging, tracing, firewalls, and more.
eBPF can run sandboxed programs in the Linux kernel without changing kernel source code or loading kernel modules.
Several complex components are involved in the functioning of eBPF programs and their execution.
Teleport, an identity-aware access proxy, is an example of an open source project using eBPF. It can be used to collect events from SSH sessions such as network connections, filesystem changes, etc.

In this article, we will review what eBPF is, what it does, and how it works. Then, we will explain how to execute an eBPF program and provide an example of eBPF in action. Finally, we will conclude with recommendations for next steps.

eBPF lets programmers execute custom bytecode within the kernel without having to change the kernel or load kernel modules. Exciting? Maybe not yet.

What is eBPF?

Linux divides its memory into two distinct areas: kernel space and user space. Kernel space is where the core of the operating system resides. It has full and unrestricted access to all hardware — memory, storage, CPU, etc. Due to the privileged nature of kernel access, kernel space is protected and allows to run only the most trusted code, which includes the kernel itself and various device drivers.

User space is where anything that is not a kernel process runs, e.g. regular applications. User space code has limited access to hardware and relies on code running in kernel space for privileged operations such as disk or network I/O. For example, to send a network packet, a user space application must talk to the kernel space network card driver via a kernel API referred to as “system calls”.

While the system call interface is sufficient in most cases, developers may need more flexibility to add support for new hardware, implement new filesystems, or even custom system calls. For this to be possible, there must be a way for programmers to extend the base kernel without adding directly to the kernel source code. Linux Kernel Modules(LKMs) serve this function.

Unlike system calls, whereby requests traverse from user space to kernel space, LKMs are loaded directly into the kernel. Perhaps the most valuable feature of LKMs is that they can be loaded at runtime, removing the need to recompile the entire kernel and reboot the machine each time a new kernel module is required.

Figure 1 - LKMs can be dynamically loaded and unloaded as part of kernel space (Source)

As helpful as LKMs are, they introduce a lot of risks to the system. Indeed, the separation between kernel and user spaces adds a number of important security measures to the OS. The kernel space is meant to run only a privileged OS kernel, with the intermediate layer, labelled as Kernel Services in the picture above, separating user space programs and preventing them from messing with finely tuned hardware.

In other words, LKMs can make the kernel crash. Additionally, and aside from the wide blast radius of security vulnerabilities, modules incur a large maintenance cost in that kernel version upgrades can break them.

What does eBPF do?

eBPF is a more recent mechanism for writing code to be executed in the Linux kernel space that has already been used to create programs for networking, debugging, tracing, firewalls, and more.

Born out of a need for better Linux tracing tools, eBPF drew inspiration from dtrace, a dynamic tracing tool available primarily for the Solaris and BSD operating systems. Unlike dtrace, Linux could not get a global overview of running systems, since it was limited to specific frameworks for system calls, library calls, and functions. Building on the Berkeley Packet Filter (BPF), a tool for writing packer-filtering code using an in-kernel VM, a small group of engineers began to extend the BPF backend to provide a similar set of features as dtrace. eBPF was born.

First released in limited capacity in 2014 with Linux 3.18, making full use of eBPF requires at least Linux 4.4 or above.

Figure 2 - Simplified eBPF architecture

In Figure 2, we see a simplified visualization of eBPF architecture.

eBPF allows regular userspace applications to package the logic to be executed within the Linux kernel as a bytecode. These are called eBPF programs and they are produced by eBPF compiler toolchain called BCC. eBPF programs are invoked by the kernel when certain events, called hooks, happen. Examples of such hooks include system calls, network events, and others.

Before being loaded into the kernel, an eBPF program must pass a certain set of checks. Verification involves executing the eBPF program within a virtual machine. Doing so allows the verifier, with 10,000+ lines of code, to perform a series of checks. The verifier will traverse the potential paths the eBPF program may take when executed in the kernel, making sure the program does indeed run to completion without any looping, which would cause a kernel lockup. Other checks, from valid register state and program size to out of bound jumps, are also carried through.

From the outset, eBPF sets itself apart from LKMs with important safety controls in place. Only if all checks pass, the eBPFprogram is loaded and compiled into the kernel and starts waiting for the right hook. Once triggered, the bytecode executes.

The end result is that eBPF lets programmers safely execute custom bytecode within the Linux kernel without modifying or adding to kernel source code. While still a far cry from replacing LKMs as a whole, eBPF programs introduce custom code to interact with protected hardware resources with minimal risk for the kernel.

How does eBPF work?

So far, I’ve reduced eBPF to its bare architecture, but there are more components working together, each of which has layers of complexity of their own.

Dissecting an eBPF program

Events and Hooking

As we have already covered, eBPF programs execute in an event-driven environment. They are triggered by kernel hooks. The diversity of hook locations is one of the many aspects that makes eBPF so useful. A quick sampling of these include:

System Calls - Inserted when user space functions transfer execution to the kernel
Function Entry and Exit - Intercepts calls to pre-existing functions
Network Events - Executes when packets are received
Kprobes and uprobes - Attach to probes for kernel or user functions

Helper Functions

When eBPF programs are triggered at their hook points, they can call helper functions. These special functions are what makes eBPF feature-rich. For example, helpers can perform a wide variety of tasks:

Search, update, and delete key-value pairs in tables
Generate a pseudo-random number
Collect and flag tunnel metadata
Chain eBPF programs together, known as tail calls
Perform tasks with sockets, like binding, retrieve cookies, redirect packets, etc.

These helper functions must be defined by the kernel, meaning there is a whitelist of calls eBPF programs can make. But their number is large and continues to grow.

eBPF Maps

eBPF maps allow eBPF programs to keep state between invocations and to share data with the user-space applications. An eBPF map is basically a key-value store, where values are generally treated as binary blobs of arbitrary data.

They are created using the `bpf_cmd` syscall with BPF_MAP_CREATE parameter and, as everything else in the Linux world, they are addressed via a file descriptor. Interacting with a map happens through lookup/update/delete syscalls as shown here.

Executing an eBPF Program

Building eBPF Programs

The kernel expects all eBPF programs to be loaded as bytecode, so we need a way to create the bytecode using higher-level languages. The most popular toolchain for writing and debugging eBPF programs is called BPF Compiler Collection (BCC) and it is based on LLVM and CLang.

Just-In-Time (JIT) Compiler

After verification, eBPF bytecode is just-in-time (JIT) compiled into native machine code. eBPF has a modern design, meaning it has been upgraded to be 64-bit encoded with 11 total registers. This closely maps eBPF to hardware for x86_64, ARM, and arm64 architecture, amongst others. Fast compilation at runtime makes it possible for eBPF to remain performant even as it must first pass through a VM.

Figure 3- eBPF Architecture

The important takeaway here is understanding that eBPF unlocks access to kernel level events without the typical restrictions found when changing kernel code directly. Summarizing, eBPF works by:

Compiling eBPF programs into bytecode
Verifying programs execute safely in a VM before being being loaded at the hook point
Attaching programs to hook points within the kernel that are triggered by specified events
Compiling at runtime for maximum efficiency
Calling helper functions to manipulate data when a program is triggered
Using maps (key-value pairs) to share data between the user space and kernel space and for keeping state.

eBPF in Action

Teleport is an open source multi-protocol identity-aware access proxy. It provides a convenient and secure way of accessing SSH servers, Kubernetes clusters, databases and other resources behind NAT, think cloud-native replacement for OpenSSH.

One of the project goals was to provide the detailed audit log of what actually happens during SSH sessions. To achieve that, Teleport logs the following data:

Bytestreams of interactive sessions, so they can be replayed in a Youtube-like web interface.
JSON audit log of structured events.

The interactive sessions can show what a user was typing in her terminal during an interactive session. Let’s say she executed a bash script and the recording will show this. But the recording will not show if any file system changes took place, or whether the script downloaded or uploaded any data to/from this machine.

That’s what the JSON event log is for, and Teleport uses eBPF to “spy” on user’s actions during interactive SSH sessions. Consider, for example, the command:

echo Y3VybCBodHRwOi8vd3d3LmV4YW1wbGUuY29tCg== | base64 --decode | sh

Even though we can capture this command as printed out in the terminal, it means nothing to us as the user has obfuscated the command that is piped into sh by encoding it in base64. But by looking into the JSON log, we learn the user was attempting to obfuscate curl:

{
  "event": "session.command",
  "path": "/bin/sh",
  "program": "sh",
  "argv": [],
  "login": "centos",
  "user": "jsmith"
}
{
  "event": "session.command",
  "path": "/bin/base64",
  "program": "base64",
  "argv": [
    "--decode"
  ],
  "login": "centos",
  "user": "jsmith"
}
{
  "event": "session.command",
  "path": "/bin/curl",
    "argv": [
    "http://www.example.com"
  ],
  "program": "curl",
  "return_code": 0,
  "login": "centos",
  "user": "jsmith"
}
{
  "event": "session.network",
  "program": "curl",
  "src_addr": "172.31.43.104",
  "dst_addr": "93.184.216.34",
  "dst_port": 80,
  "login": "centos",
  "user": "jsmith",
  "version": 4
}

How did Teleport collect these events? By installing eBPF hooks at the beginning of the SSH session. Specifically, it uses three BPF programs to get this data: execsnoop to capture the script execution, opensnoop to capture files opened by the script, and tcpconnect to capture TCP connections established during the session.

Let’s focus on tcpconnnect, which gives us the information in the final JSON object:

tcpconnect traces TCP connections. For a tool like Teleport that manages access using SSH certificates, knowing when TCP connections are made is vital. tcpconnect can be used to trace the connect() syscall, which initiates a connection on a socket. To trace these entries, tcpconnect inserts a kprobe into the kernel to dynamically break into any routine. Kprobe collects debugging and performance information non-disruptively and can be inserted on virtually any instruction in the kernel.

BPF b = BPF(text=bpf_text) b.attach_kprobe(event="tcp_v4_connect",
fn_name="trace_connect_entry") b.attach_kretprobe(event="tcp_v4_connect",
fn_name="trace_connect_v4_return")

When the program is triggered along the code path, tcpconnect will start outputting information. The table below exemplifies some of this information.

# ./tcpconnect
PID   COMM  SADDR             DADDR           DPORT
-----------------------------------------------------
2315  curl  172.31.43.104     93.184.216.34   80

All this data has been collected using helper functions. In fact, when we look at the (Python) code, we can see tcpconnect using helper functions from the bcc’s BPF library to format the information outputted above.

...
struct ipv4_data_t data4 = {.pid = pid, .ip = ipver}; 
data4.saddr = skp->__sk_common.skc_rcv_saddr; 
data4.daddr = skp->__sk_common.skc_daddr; 
data4.dport = ntohs(dport); 
bpf_get_current_comm(&data4.task, sizeof(data4.task));
...

Where to go from here

Now would be a good time to explore more technical documentation and articles, referencing this post if you start to get lost in the minutia. Below is the list of articles I have been referring to in this article, so you can explore on your own:

Read more about using BPF To Transform SSH Sessions into Structured Events
BCC - “BCC is a toolkit for creating efficient kernel tracing and manipulation programs, and includes several useful tools and examples […] BCC makes BPF programs easier to write, with kernel instrumentation in C (and includes a C wrapper around LLVM), and front-ends in Python and lua. It is suited for many tasks, including performance analysis and network traffic control.” BCC also provides an API for other programs to use.
bpftrace - “BPFtrace is a high-level tracing language [that] uses LLVM as a backend to compile scripts to BPF-bytecode and makes use of BCC for interacting with the Linux BPF system, as well as existing Linux tracing capabilities: kernel dynamic tracing (kprobes), user-level dynamic tracing (uprobes), and tracepoints.”
Generic libraries for Go, C/C++, and Rust
Probably the most exhaustive accumulation of eBPF resources is Quinten Monnet’s blog, Whirl Offload.

If you’ve made it to this point, my hope is you’ve got at least a baseline understanding of what eBPF is, why it’s important, and the basics of how it works. In this article, we have briefly covered the following points:

eBPF is a revolutionary technology because it lets programmers execute custom bytecode within the kernel without having to change the kernel or load kernel modules.
eBPF is event-driven, i.e. each eBPF program is an event handler. These events are called “hooks”.
eBPF programs interact with user-space programs via eBPF maps that are key-value pairs.
You can see eBPF in action by playing with the audit log in Teleport, an open source alternative to OpenSSH.

About the Author

Virag Mody joined Teleport in January of 2020, after co-founding a software code auditing company for Ethereum applications. Having joined Teleport, Virag continues to learn about trending technologies and produces high-quality written and video content. In his free time, Virag enjoys rock climbing, video games, and walking his dog.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

A Gentle Introduction to eBPF

InfoQ Article Contest

Key Takeaways

What is eBPF?

Related Sponsored Content

What does eBPF do?

How does eBPF work?

Dissecting an eBPF program

Events and Hooking

Helper Functions

eBPF Maps

Executing an eBPF Program

Building eBPF Programs

Just-In-Time (JIT) Compiler

eBPF in Action

Where to go from here

About the Author

Rate this Article

This content is in the Operating Systems topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter