Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Zero to Performance Hero: How to Benchmark and Profile Your eBPF Code in Rust

Zero to Performance Hero: How to Benchmark and Profile Your eBPF Code in Rust

Key Takeaways

  • Kernel-space eBPF code can be written in C or Rust. User-space bindings to eBPF are often written in C, Rust, Go, or Python. Using Rust for kernel and user-space code provides unmatched speed, safety, and developer experience.
  • Blind optimization is the root of all evil. Profiling your code allows you to see where to focus your performance optimizations.
  • Different profiling techniques may illuminate other areas of interest. Use several profiling tools and triangulate the root cause of your performance problems.
  • Benchmarking allows you to measure your performance optimizations for your kernel and user space eBPF Rust code.
  • Continuous benchmarking with tools like Bencher helps prevent performance regressions before they get released.

The silent eBPF revolution is well underway. From networking to observability to security, Extended Berkeley Packet Filter (eBPF) is used across the cloud-native world to enable faster and more customizable computing. eBPF is a virtual machine within the Linux kernel that allows for extending the kernel’s functionality in a safe and maintainable way. This has empowered modern systems software to move more custom logic into the kernel.

As more logic moves into the kernel, we must ensure that our systems stay performant. Profiling eBPF code is crucial as it allows developers to identify areas that need performance optimizations. This prevents blind optimizations, which often lead to needless complexity. Different profiling techniques can highlight various areas of interest, helping to pinpoint the root cause of performance problems.

In this article, we will walk through creating a basic eBPF program in Rust. This simple and approachable eBPF program will intentionally include a performance regression. We will then explore using an instrumenting profiler and a sampling profiler to locate this performance bug. Once we have located the bug, we must create benchmarks to measure our performance optimizations. Finally, we will track our benchmarks with a continuous benchmarking tool to catch performance regressions as a part of continuous integration (CI).

Getting Started with eBPF

eBPF is a virtual machine within the Linux kernel that executes bytecode compiled from languages like C and Rust. eBPF allows you to extend the kernel’s functionality without developing and maintaining a kernel module. The eBPF verifier ensures that your code doesn’t harm the kernel by checking it at load time.

These load time checks include: a one million instruction limit, no unbounded loops, no heap allocations, and no waiting for user space events.

Once verified, the eBPF bytecode is loaded into the eBPF virtual machine and executed within the kernel to perform tasks like tracing syscalls, probing user or kernel space, capturing perf events, instrumenting Linux Security Modules (LSM), and filtering packets.

Initially known as Berkeley Packet Filtering (BPF), it evolved into eBPF as new use cases were added.

eBPF programs can be written with several different programming languages and tool sets. This includes the canonical libbpf developed in C within the Linux kernel source tree. Additional tools like bcc and ebpf-go allow user space programs to be written in Python and Go. However, they require C and libbpf for the eBPF side of things. The Rust eBPF ecosystem includes libbpf-rs, RedBPF, and Aya. Aya enables writing user space and eBPF programs in Rust without a dependency on libbpf. We will be using Aya throughout the rest of this article. Aya will allow us to leverage Rust’s strengths in performance, safety, and productivity for systems programming.

Building an eBPF Profiler

For our example, we will create a very basic eBPF sampling profiler. A sampling profiler sits outside your target application, and at a set interval, it samples the state of your application. The Parca Agent and Pyroscope Agent are two examples of production-grade eBPF sampling profilers. We will discuss the benefits and drawbacks of sampling profilers in depth later in this article. For now, it’s just important to understand that our goal is to periodically get a snapshot of the stack of a target application. Let’s dive in!

First, use Aya to set up an eBPF development environment. Name your project profiler.

Inside of profiler-ebpf/src/, we’re going to add:

// In eBPF, we can’t use the Rust standard library.
// The kernel calls our `perf_event`, so there is no `main` function.

// We use the `aya_ebpf` crate to make the magic happen.
use aya_ebpf::{
    helpers::gen::{bpf_get_stack, bpf_ktime_get_ns},
    macros::{map, perf_event},
use profiler_common::{Sample, SampleHeader};

// Create a global variable that will be set by user space.
// It will be set to the process identifier (PID) of the target application.
// To do this, we must use the `no_mangle` attribute.
// This keeps Rust from mangling the `PID` symbol so it can be properly linked.
static PID: u32 = 0;

// Use the Aya `map` procedural macro to create a ring buffer eBPF map.
// This map will be used to hold our profile samples.
// The byte size for the ring buffer must be a power of 2 multiple of the page size.
static SAMPLES: RingBuf = RingBuf::with_byte_size(4_096 * 4_096, 0);

// Use the Aya `perf_event` procedural macro to create an eBPF perf event.
// We take in one argument, the context for the perf event.
// This context is provided by the Linux kernel.
pub fn perf_profiler(ctx: PerfEventContext) -> u32 {
    // Reserve memory in the ring buffer to fit our sample.
    // If the ring buffer is full, then we will return early.
    let Some(mut sample) = SAMPLES.reserve::<Sample>(0) else {
        aya_log_ebpf::error!(&ctx, "Failed to reserve sample.");
        return 0;

    // The rest of our code is `unsafe` as we are dealing with raw pointers.
    unsafe {
        // Use the eBPF `bpf_get_stack` helper function
        // to get a user space stack trace.
        let stack_len = bpf_get_stack(
            // Provide the Linux kernel context for the tracing program.
            // Write the stack trace to the reserved sample buffer.
            // We make sure to offset by the size of the sample header.
            sample.as_mut_ptr().byte_add(SampleHeader::SIZE) as *mut core::ffi::c_void,
            // The size of the reserved sample buffer allocated for the stack trace.
            Sample::STACK_SIZE as u32,
            // Set the flag to collect a user space stack trace.
            aya_ebpf::bindings::BPF_F_USER_STACK as u64,

        // If the length of the stack trace is negative, then there was an error.
        let Ok(stack_len) = u64::try_from(stack_len) else {
            aya_log_ebpf::error!(&ctx, "Failed to get stack.");
            // If there was an error, discard the sample.
            sample.discard(aya_ebpf::bindings::BPF_RB_NO_WAKEUP as u64);
            return 0;

        // Write the sample header to the reserved sample buffer.
        // This header includes important metadata about the stack trace.
            sample.as_mut_ptr() as *mut SampleHeader,
            SampleHeader {
                // Get the current time in nanoseconds since system boot.
                ktime: bpf_ktime_get_ns(),
                // Get the current thread group ID.
                pid: ctx.tgid(),
                // Get the current thread ID, confusingly called the `pid`.
                // The length of the stack trace.
                // This is needed to safely read the stack trace in user space.

    // Commit our sample as an entry in the ring buffer.
    // The sample will then be made visible to the user space.

    // Our result is a signed 32-bit integer, which we always set to `0`.

// Finally, we have to create a custom panic handler.
// This custom panic handler tells the Rust compiler that we should never panic.
// Making this guarantee is required to satisfy the eBPF verifier.
fn panic(_info: &core::panic::PanicInfo) -> ! {
    unsafe { core::hint::unreachable_unchecked() }

This Rust program uses the Aya to turn the perf_profiler function into an eBPF perf event. Every time this perf event is triggered, we capture a stack trace for our target application using the bpf_get_stack eBPF helper function.

To get our eBPF loaded into the kernel, we must set things up in user space. Inside of profiler/src/, we’re going to add the following:

// In user space, we use the `aya` crate to make the magic happen.
use aya::{include_bytes_aligned, maps::ring_buf::RingBuf, programs::perf_event, BpfLoader};

// Run our `main` function using the `tokio` async runtime.
// On success, simply return a unit tuple.
// If there is an error, return a catch-all `anyhow::Error`.
async fn main() -> Result<(), anyhow::Error> {
    // Initialize the user space logger.

    // Our user space program expects one and only one argument,
    // the process identifier (PID) for the process to be profiled.
    let pid: u32 = std::env::args().last().unwrap().parse()?;

    // Use Aya to set up our eBPF program.
    // The eBPF byte code is included in our user space binary
    // to make it much easier to deploy.
    // When loading the eBPF byte code,
    // set the PID of the process to be profiled as a global variable.
    let mut bpf = BpfLoader::new()
        .set_global("PID", &pid, true)
    let mut bpf = BpfLoader::new()
        .set_global("PID", &pid, true)
    // Initialize the eBPF logger.
    // This allows us to receive the logs from our eBPF program.
    aya_log::BpfLogger::init(&mut bpf)?;

    // Get a handle to our `perf_event` eBPF program named `perf_profiler`.
    let program: &mut perf_event::PerfEvent =
    // Load our `perf_event` eBPF program into the Linux kernel.
    // Attach to our `perf_event` eBPF program that is now running in the Linux kernel.
        // We are expecting to attach to a software application.
        // We will use the `cpu-clock` counter to time our sampling frequency.
        perf_event::perf_sw_ids::PERF_COUNT_SW_CPU_CLOCK as u64,
        // We want to profile just a single process across any CPU.
        // That process is the one we have the PID for.
        perf_event::PerfEventScope::OneProcessAnyCpu { pid },
        // We want to collect samples 100 times per second.
        // We want to profile any child processes spawned by the profiled process.

    // Spawn a task to handle reading profile samples.
    tokio::spawn(async move {
        // Create a user space handle to our `SAMPLES` ring buffer eBPF map.
        let samples = RingBuf::try_from(bpf.take_map("SAMPLES").unwrap()).unwrap();
        // Create an asynchronous way to poll the samples ring buffer.
        let mut poll = tokio::io::unix::AsyncFd::new(samples).unwrap();

        loop {
            let mut guard = poll.readable_mut().await.unwrap();
            let ring_buf = guard.get_inner_mut();
            // While the ring buffer is valid, try to read the next sample.
            // To keep things simple, we just log each sample.
            while let Some(sample) = {
                // Don't look at me!
                let _oops = Box::new(std::thread::sleep(std::time::Duration::from_millis(

    // Run our program until the user enters `CTRL` + `c`.

Our user-space code can now load our perf event eBPF program. Once loaded, our eBPF program will use the cpu-clock counter to time our sampling frequency. We will sample the target application and capture a stack trace one hundred times a second. This stack trace sample is then sent to user space via the ring buffer. Finally, the stack trace sample is printed to standard out.

This is obviously a very simple profiler. We aren’t symbolicating the call stack. All we have is a list of memory addresses with some metadata. Nor are we able to sample our target program while it is sleeping. For that, we would have to add a sched tracepoint for sched_switch. However, this is already enough code for a performance regression. Did you spot it?

Profiling the Profiler

Users of our simple profiler have given us feedback that it seems rather sluggish. They don’t mind having to symbolicate the call stack for their sleepless programs by hand. What bothers them is that the samples take a while to print. Sometimes, things even appear to get backed up. Right about now, the seemingly ubiquitous adage "premature optimization is the root of all evil" usually starts to get bandied around.

However, let’s take a look at what Donald Knuth actually said back in 1974:

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: pre-mature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

Donald E. Knuth, Structured Programming with go to Statements

So that is exactly what we need to do: look for "opportunities in that critical 3%." To do so, we will explore two different kinds of profilers: sampling and instrumenting. We will then use each type of profiler to find that critical 3% in our simple profiler.

Our simple eBPF profiler is an example of a sampling profiler. It sits outside of the target application. At a given interval, it collects a sample of the target application’s stack trace. Because a sampling profiler only runs periodically, it has relatively little overhead. However, this means that we may miss some things. By analogy, this is like watching a movie by only looking at one out of every one hundred frames. Movies are usually shot at 24 frames per second. That means you would only see a new frame about once every 4 seconds. Besides being a boring way to watch a film, this can also lead to a distorted view of what is happening. The frame you see could be a momentary flashback (overweighting). Conversely, there could have just been an amazing action sequence, and you only caught the closeup of the lead actor’s face on either side of it (underweighting).

The other major kind of profiler is an instrumenting profiler. Unlike a sampling profiler, an instrumenting profiler is a part of the target application. Inside the target application, a sampling profiler collects information about the work being done. This usually leads instrumenting profilers to have a much higher overhead than sampling profilers. Therefore, a sampling profiler is more likely to give you an accurate picture of what is happening in production than an instrumenting profiler. To continue our analogy from above, an instrumenting profiler is like watching a movie shot on an old 35mm hand-cranked camera. Being hand-cranked, it was challenging to film at 24 frames per second consistently. So, cinematographers settled for around 18 frames per second. Likewise, you can view all of the proverbial frames with an instrumenting profiler, but everything has to run much slower. You can run right into the observer effect.

Sampling Profiler

The go-to sampling profiler on Linux is perf. Under the hood, perf uses the same perf events as our simple profiler. There is a fantastic tool for Rust developers that wraps perf and generates beautiful flame graphs. It is aptly named flamegraph. Flame graphs are a technique used to visualize stack traces created by Brendan Gregg.

To get started, follow the flamegraph installation steps. Once you have flamegraph installed, we can finally profile the profiler!

The flame graph that is produced is an interactive SVG file. The length along the x-axis indicates the percentage of time that a stack was present in the samples. This is accomplished by sorting the stacks alphabetically and then merging identically named stacks into a single rectangle. It is important to note that the x-axis of a flame graph is not sorted by time. Instead, it is meant to show the proportion of time used, like a mini rectangular pie chart for each row of the diagram. The height along the y-axis indicates the stack depth, going from the bottom up. That is, the longest-lived stacks are at the bottom, and newer generations are at the top. Therefore, the stack frames with a top edge exposed were the bits of code that were actively running when a sample was taken.

Zooming in on this peak, we can see the call stack for our task that reads from the samples map. We seem to be doing quite a bit of sleeping. Now, let’s hop over to using an instrumenting profiler to get another vantage point.

Instrumenting Profiler

There are many different things that one could measure at runtime within their application. Some are distinctive to the application under observation, and others are more general. For measures particular to your application, the counts crate is a valuable tool. A measure that is useful for almost all applications is heap allocations. The easiest way to measure heap allocations in Rust is with  the dhat-rs crate .

To use dhat-rs, we have to update our profiler/src/ file:


// Create a custom global allocator
#[cfg(feature = "dhat-heap")]
static ALLOC: dhat::Alloc = dhat::Alloc;

async fn main() -> Result<(), anyhow::Error> {
    // Instantiate an instance of the heap instrumenting profiler
    #[cfg(feature = "dhat-heap")]
    let _profiler = dhat::Profiler::new_heap();


With dhat-heap added as a feature, and our release builds set to keep debug symbols in our Cargo.toml file, we can now run our simple profiler with the --features dhat-heap option.

dhat: Total:     21,310,798 bytes in 135,500 blocks
dhat: At t-gmax: 13,715,186 bytes in 35,036 blocks
dhat: At t-end:  64,173 bytes in 47 blocks
dhat: The data has been saved to dhat-heap.json, and is viewable with dhat/dh_view.html

The Total is the total memory allocated by our simple profiler. That is a total of 1,256 bytes in 6 allocations. Next, At t-gmax indicates the largest that the heap got while running. Finally, At t-end is the size of the heap at the end of our application.

As for that dhat-heap.json, you can open it in the online viewer.

[Click here to expand image above to full-size]

This shows you a tree structure of when and where heap allocations occurred. The outer nodes are the parent, and the inner nodes are its children. That is, the longest-lived stack frames are on the outside, and newer generations are on the inside. We can examine the allocation stack trace by zooming in on one of those blocks.

Here, the highest-numbered field will be the line from our source code. As we descend numerically, we actually go up the stack trace. Now spin around three times and tell which way an icicle graph goes.

Looking at the percentages in the DHAT viewer, it seems like we are doing quite a bit of allocating. To get a more visual representation of the DHAT results, we can open them in the Firefox Profiler. The Firefox Profiler also allows you to create shareable links. This is the link for my DHAT profile.

At this point, I think we have narrowed down the culprit:

// Don't look at me!
let _oops = Box::new(std::thread::sleep(std::time::Duration::from_millis(

We could probably remove this line and call it a day. However, let’s heed Donald Knuth's words and really make sure we have found that critical 3%.

Benchmarking the Profiler

It seems like our slowdown is in the user space, so we are going to focus our benchmarking efforts there. If that were not the case, we would have to build a custom eBPF benchmarking harness. Lucky for us, we can use a less bespoke solution to test our user-space source code.

We will need to refactor our profiler/src/ file. Benchmarks in Rust can only be run against libraries and not binaries. Thus, we must create a new profiler/src/ file that both our binary and benchmarks will use.

Refactoring our code to break out our sample processing logic gives us this library function:

pub fn process_sample(sample: profiler_common::Sample) -> Result<(), anyhow::Error> {
    // Don't look at me!
    let _oops = Box::new(std::thread::sleep(std::time::Duration::from_millis(


Next, we will add benchmarks using Criterion. After adding Criterion as our benchmarking harness in our Cargo.toml, we can create a benchmark for our process_sample library function.

// The benchmark function for `process_sample`
fn bench_process_sample(c: &mut criterion::Criterion) {
    c.bench_function("process_sample", |b| {
        // Criterion will run our benchmark multiple times
        // to try to get a statistically significant result.
        b.iter(|| {
            // Call our `process_sample` library function with a test sample.

// Create a custom benchmarking harness named `benchmark_profiler`
// Register our `bench_process_sample` benchmark
// with our custom `benchmark_profiler` benchmarking harness.
criterion::criterion_group!(benchmark_profiler, bench_process_sample);

When we run our benchmark with cargo bench, we get a result that looks something like this:

Benchmarking process_sample: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 89.5s, or reduce sample count to 10.
Benchmarking process_sample: Collecting 100 samples in estimated 89.471 s (10
process_sample          time:   [379.52 ms 427.63 ms 476.17 ms]

Now let's remove that pesky oops line from process_sample and see what happens:

Benchmarking process_sample: Collecting 100 samples in estimated 5.0002 s (12
process_sample          time:   [40.614 ns 40.771 ns 40.937 ns]
                        change: [-100.000% -100.000% -100.000%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

Excellent! Criterion can compare the results between our local runs and let us know that our performance has improved. If you're interested in a step-by-step guide, you can also dig deeper into how to benchmark Rust code with Criterion. Going the other way, if we add that oops line back, Criterion will let us know we have a performance regression.

Benchmarking process_sample: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 62.0s, or reduce sample count to 10.
Benchmarking process_sample: Collecting 100 samples in estimated 62.013 s (10
process_sample          time:   [502.01 ms 554.24 ms 606.55 ms]
                        change: [+1229331031% +1349277891% +1484780037%] (p = 0.00 < 0.05)
                        Performance has regressed.

It’s tempting to call this a job well done. We have found and fixed our opportunity in that critical 3%. However, what’s preventing us from introducing another performance regression like oops in the future? Surprisingly, for most software projects, the answer is "Nothing." This is where continuous benchmarking comes in.

Continuous Benchmarking

Continuous benchmarking is a software development practice that involves frequent, automated benchmarking to quickly detect performance regressions. This reduces the cycle time for detecting performance regressions from days and weeks to hours and minutes. For the same reasons that unit tests are run as part of continuous integration for each code change, benchmarks should be run as part of continuous benchmarking for each code change.

To add continuous benchmarking to our simple profiler, we’re going to use Bencher. Bencher is an open source continuous benchmarking tool. This means you can easily self-host Bencher. However, for this tutorial, we will use a free account on Bencher Cloud. Go ahead and sign up for a free account. Once you are logged in, new user onboarding should provide you with an API token and ask you to name your first project. Name your project Simple Profiler. Next, follow the instructions to install the bencher CLI. With the bencher CLI installed, we can now start tracking our benchmarks.

bencher run \
    --project simple-profiler \
    --token $BENCHER_API_TOKEN \
    cargo bench

This command uses the bencher CLI to run cargo bench for us. The bencher run command parses the results of cargo bench and sends them to the Bencher API server under our Simple Profiler project. Click on the link in the CLI output to view a plot of your first results. After running our benchmarks a couple times, my Perf Page looks like this:

By saving our results to Bencher, we can now track and compare our results over time and across several dimensions. Bench supports tracking results based on the following:

  • Branch: The git branch used (ex: main)
  • Testbed: The testing environment (ex: ubuntu-latest for GitHub Actions)
  • Benchmark: The performance test that was run (ex: process_sample)
  • Measure: The unit of measure for the benchmark (ex: latency in nanoseconds)

Now that we have benchmark tracking in place, it is time to take care of the "continuous" part. Step-by-step guides are available for continuous benchmarking in GitHub Actions and GitLab CI/CD. For our example, though, we will implement continuous benchmarking without worrying about the specific CI provider.

We will have two different CI jobs. One will track our default main branch, and the other will catch performance regressions in candidate branches (pull requests, merge requests, etc.).

For our main branch job, we’ll have a command like this:

bencher run \
    --project simple-profiler \
    --token $BENCHER_API_TOKEN \
    --branch main \
    --testbed ci-runner \
    --err \
    cargo bench

For clarity, we explicitly set our branch as main. We also set our testbed to a name for the CI runner, ci-runner. Finally, we set things to fail if we generate an alert with the --err flag.

For the candidate branch, we’ll have a command like this:

bencher run \
    --project simple-profiler \
    --token $BENCHER_API_TOKEN \
    --branch $CANDIDATE_BRANCH \
    --branch-start-point $DEFAULT_BRANCH \
    --branch-start-point-hash $DEFAULT_BRANCH_HASH \
    --testbed ci-runner \
    --err \
    bencher mock

Here, things get a little more complicated. Since we want our candidate branch to be compared to our default branch, we will need to use some environment variables provided by our CI system.

  • $CANDIDATE_BRANCH should be the candidate branch name
  • $DEFAULT_BRANCH should be the default branch name (i.e., main)
  • $DEFAULT_BRANCH_HASH should be the current default branch git hash

For a more detailed guide, see how to track benchmarks in CI for a step-by-step walkthrough. If you want to explore real-world examples of continuous benchmarking, you can do so here.

With continuous benchmarking in place, we can iterate on your simple profiler without worrying about introducing performance regressions into our code. Continuous benchmarking is not meant to replace profiling or running benchmarks locally. It is meant to complement them. Analogously, continuous integration has not replaced using a debugger or running unit tests locally. It has complemented them by providing a backstop for feature regressions. In this same vein, Continuous benchmarking provides a backstop for preventing performance regressions before they make it to production.

Wrap Up

eBPF is a powerful tool that allows software engineers to add custom capabilities to the Linux kernel without having to be a kernel developer. We surveyed the existing options for creating eBPF programs. Based on the requirements of speed, safety, and developer experience, we built our sample program in Rust using Aya.

The simple profiler we built contained a performance regression. Following Donald Knuth's wisdom, we set out to discover what critical 3% of our simple profiler we needed to fix. We triangulated our performance regressions using a sampling profiler based on perf visualized with flame graphs and an instrumenting profiler for heap allocations based on DHAT.

With our performance regression pinpointed, we verified our fix with a benchmark. The Criterion benchmarking harness proved invaluable for local benchmarking. However, to prevent performance regressions before they merged, we implemented continuous benchmarking. Using Bencher, we set up continuous benchmarking to catch performance regressions in CI.

All of the source code for this guide is available on GitHub.

About the Author

Rate this Article