InfoQ Homepage Presentations From Monitoring to Observability: eBPF Chaos

From Monitoring to Observability: eBPF Chaos

View Presentation

Speed:

49:16

Summary

Michael Friedrich discusses the learning steps with eBPF and traditional metrics monitoring and future Observability data collection, storage and visualization.

Bio

Michael Friedrich is a Senior Developer Evangelist at GitLab, focussing on Observability, SRE, Ops. He loves to educate everyone and regularly speaks at events and meetups. Michael co-founded the #EveryoneCanContribute cafe meetup group to learn cloud-native & DevSecOps. Michael created o11y.love.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Friedrich: I want to dive into, "From Monitoring to Observability: eBPF Chaos." We will hear my learning story about eBPF, how to use tools based on eBPF, and debugging certain things in production like an incident, and how chaos engineering and chaos can help with that. My name is Michael. I'm a Senior Developer Evangelist at GitLab. I have my own newsletter. We will learn about many things.

It's key to remember them even if you don't immediately understand them, research, look into what is an eBPF program, what is a Berkeley Packet Filter, diving into kernel user space, bytecode, compilers, C, C++, Go and Rust might be related. Then going a little more high-level with observability, DevSecOps, security, chaos experiments, obviously a little bit of DNS will be involved. Then tying everything together with security chaos, eBPF Probes, reliability, and more ideas which should inspire you to get more efficient with anything production related incidents and whatnot.

Observability

To get started with observability. We have monitoring, now we have observability. How would someone define observability? My personal definition of that is modern application development and deployment with microservices, using cloud native technologies that require a new approach beyond traditional metrics monitoring, or state-based monitoring. We are collecting a lot of data types, a lot of events, a lot of signals, so we are able to answer not unknown questions, but what is the overall state of the production environment?

It's also key to identify unknown unknowns. For example, a DNS response latency in a CI/CD pipeline, actually caused the deployment cost to rise significantly. The cloud cost like €10,000, or dollars a month. This is something you probably wouldn't figure out with the individual data sources and metrics on their own. Combined, this is what describes observability. It's also a way to help reduce infrastructure cost.

Considering that there are many different data types involved with observability, we started with metrics, then there were traces, logs, and events. Profiling comes to mind. Error tracking. Real user monitoring, or end-to-end monitoring. Test reports even can be treated as observability data. Also, NetFlow or network data. There's much more which can add to the bigger picture within observability. Metrics are a key value with text stored in a time-series database. Prometheus is defining the standard in the cloud native monitoring and observability community.

It provides a query language, the OpenMetrics specification, which was also adopted into OpenTelemetry, and the ability to visualize that as a graph, doing forecast trends and whatnot. There are different data sources within observability, like metrics from a Prometheus exporter, code instrumentation could be sending traces.

Potentially, in a Kubernetes cluster, there's a sidecar fetching the pod logs and then sending it to a central storage. Everything happens on the user level. This is great to some degree. Sometimes we really want to look deeper, so there are more data sources, specifically like a syscall, network events, resource access. In a microservices cluster, this would need a deeper look into the kernel level, so we have more observability data possibilities even.

(e)BPF

Is the problem solved? There is eBPF and everyone talks about it. It's on the kernel level. What is it? What problem does it solve? By definition, it provides observability, security, and networking at the kernel level, which the ebpf.io website describes pretty much. The thing is, the kernel needs to be stable, so there is less innovation within the kernel itself. The idea with eBPF was to run eBPF programs as an operating system runtime addition. It's an addition to the kernel, and you can execute small programs in a safe environment.

Looking into the use cases, for these small programs, one of them is high-performance networking and load balancing, which is done by Cilium and others. You can trace applications, what they're doing on the inside, which function calls are being executed. It also helps with performance troubleshooting. Different use cases come to mind with fine-grained security observability, or even something around application or container runtime security. Being able to see which network connections a container opens or something like that. This is all possible on the kernel level with an eBPF program when provided and developed.

An eBPF program itself is a little complicated to start with, because the kernel expects bytecode and nobody writes bytecode, so I have no idea how it looks like. The thing is, we need an abstraction layer for that. Cilium provides a Go library. There is BCC as a tool chain. There's bpftrace. A lot of tools and names floating around, which provide an abstraction layer in a higher-level programming language, being able to convert it and create bytecode for the kernel. The verification happens with just-in-time compilation from bytecode to the machine specific instruction set at the kernel level. This is essentially in the background the idea behind it.

eBPF: Getting Started

From a user side, it's like, I need a lot to learn and this can be quite overwhelming. For me, personally, it took me quite some time to really say, ok, where should I be starting? What is the best learning strategy? I started my own research and documented everything on a knowledge base, which I maintain on o11y.love. At some point, everyone was saying, there's Brendan Gregg's tutorial, the blog post from 2019. It's current. It's accurate. It provides tutorial and examples for beginners, for intermediate, and advanced users.

The best way to get started with is like, start with a Linux virtual machine on a Linux host, use a kernel greater than 4.17, which provides access to eBPF support. Also, on the way of learning all these terms and tools and technologies, it's important to note them. When you don't really understand what it does, write the term down. Also, think about how would you explain what it does when you have the first success moment in running bpftrace, and think about, this solved my problem.

How would I explain this complex technology to others? Doing so really helped me understand, or even like verify my knowledge, and considering I actually was wrong. It's really a good thing to practice explaining this. This is also why I'm doing this talk. I got started looking into the BCC toolchain, which was mentioned quite often. It's also mentioned in Brendan Gregg's tutorial. I looked into what's available and what are the tools, and thought of something similar to strace telling me that a specific binary has been executed.

Like execsnoop -t, means trace all programs which are actually executing something, like executing a binary. In the first terminal, I ran the command. On the second terminal, I decided to run some curl commands, just to simulate an outgoing connection as well. It could be something like a malicious actor downloading something, which could be an interesting use case for later on. In essence, I saw something working. The commands have been logged, sshd was also doing something. I was like, ok, this is my first success moment, but what else is out there?

I looked into the next tool or platform or framework, which was bpftrace, so I was really addicted to learning now. Because bpftrace provides many use cases, or many things you can actually probe or look at, and most obvious like Ethernet traffic, but also things like looking into file systems, and much more. Getting a better insight when high-performance scaling systems are not working. It provides a high-level tracing language, so it's not necessary to write deep down C code or something. It's more inspired by DTrace and others. It can help with Ops and SRE tooling, maybe replacing even something like strace.

Because, oftentimes, it's really hard to remember what all the CLI tools are doing. With bpftrace, I thought, there is opensnoop, which is able to trace open calls. I thought, I could open a file by myself, but what if I write a quick C program, which just opens a file, creates it, and then closes it again, in order compile it and then see what opensnoop is actually doing.

I could write my own code, and then see how the eBPF tooling is handling that. I made this happen, compiled to C binary, executed it. Then I saw not only that like the ebpf-chaos.txt was created, but also the libc was loaded by the binary. It was like, this actually makes sense, because the header include for the standard library is there and libc provides that. This was an interesting insight, also in a way of saying, I can verify what other files are being opened by the specific binary and maybe see whether the call to malloc or jemalloc, or something else is actually happening.

It got me thinking, what else is out there? Folks have been mentioning like BCC, but this needs C and Python knowledge, or you should know at least C on the side for the kernel instrumentation, and Python and Lua actually as a frontend. It can be used to run these programs. BCC means BPF Compiler Collection. I found it pretty interesting because it's the first time I saw [inaudible 00:12:57] as a hook into the kprobe_sys_clone. Whenever this happened, it was printing the Hello World command.

This really was interesting for network traffic control and performance analysis and whatnot. I was like, ok, I'm bookmarking this now and documenting it now. What else do we have? Looking into libbpf. This got me interested as a C or Rust developer, because the great thing about it is there are bootstrap demos available in a separate repository, which also provided me with the term XDP, like measuring the ingress path of packets. I was curious to like, how would I be compiling and installing this?

It could be something like a tcpdump if I'm able to capture packets, but more in a faster and in an efficient way. I tried compiling the tools, tried several things. Like, is it the network interface name? Is it the network interface ID? After a while, I was able to actually see the packet size being captured and sent, for example, a systemd-resolve process and also a 5 o'clock command, just to verify it's actually doing something. The slides provide all the instructions, how to compile that. I've also linked a demo project at the end where everything is documented, so you can reproduce what I was doing back then.

eBPF: Development

Considering that these are all great tools, what if I want to write my own eBPF program? What are the development steps to get going? Learning development is similar to learning eBPF on its own. I would recommend, think about a use case which is either fun or which helps solve a production problem. Think about an SRE or DevOps use case. A program is starting or exiting, there was control groups. There are TCP sessions, network interfaces, something where you can easily see something, or verify specific values, or whatnot.

Then it's required to select a compiler. Because there is LLVM, and even GCC in version 10, as far as I know, supports compiling the eBPF program or the high-level code into an eBPF program within the bytecode. We don't need to actually worry about anything like, what is bytecode in the background.

For specific libraries, it's recommended to know Go, Rust, or C, C++ in the basics. Probably, intermediate or advanced knowledge is required to some degree. All the libraries provide great examples and how to get started documentation. Sometimes it's really like, you should know the language to really understand what is the next step or what is the design pattern being used in the implementation.

For once, I looked into the Cilium Go eBPF library, which was interesting, because it also provided more use case examples. Actually, like cloning the repository allowed to navigate into the examples path, and then run the XDP measuring again. In this case, for example, I saw that it's storing the network traffic in so-called maps. It's like a persistent storage within eBPF being able to see, this IP address was sending these many packets, and so on. There is more with the Cilium Go library.

From the examples you can attach the program, the cgroups, and again the network interfaces, which is a great way to start. For Rust, I've been looking into aya-rs, which is a Rust developer toolchain. Anything you know about Rust, you can just continue using it, and use cargo to build and even to run the examples. There was a book tutorial available online, which is fun to learn and look into. For example, the xdp-hello program was sparking my interest again, just to see, this is how this example for measuring a network traffic is being implemented.

The most interesting part for me was like in production, Parca is a tool or an agent for continuous profiling from Polar Signals. This is actually using aya and eBPF for magic function calls, stack unwinding, and other things in different languages using eBPF, and using it in Rust because it's more memory safe, or there's better memory safety than in raw C code, which is quite interesting.

eBPF Use Cases (Debug and Troubleshoot Production)

Considering that probably we don't want to get started immediately with developing our own use cases and reinvent the wheel, because someone else actually thought about, eBPF debugging and troubleshooting in production, we should be building something. People have been looking into this already. I think it's important to separate or create an overview for different use cases.

Considering that we think about observability which is often the case in a distributed Kubernetes cluster or somewhere else, there is actually a Prometheus exporter using eBPF. There are OpenTelemetry Collectors collecting metrics from different ways, which we will look into a bit. A different example is, for example, specifically for developers and auto-instrumentation, when something is deployed in a Kubernetes cluster. Pixie is an example for that.

More on the Ops side, I found Coroot, which has an interesting way of implementing service maps using eBPF and providing general Kubernetes observability. I've mentioned Parca, already, for continuous profiling. These are some tools. It's obviously not complete. The ecosystem and community is growing fast in 2023, but it's important to keep this in mind.

Looking on the security end, on the security side, you can see tools like Cilium for network connectivity, security, and observability. Most recently, Tetragon was released for runtime security enforcement. Specifically around avoiding that an attacker can access a file, or specific other things. Tracee on the other side also provides runtime security and forensics. We will see in a bit how to have the rootkit.

I think one of the most mature or even the most mature tool is the Kubernetes threat detection engine called Falco, which provides different use cases also to inspect what containers are doing. The teams at GitLab have been inventing the Package Hunter, which does software dependency scanning using Falco, just by installing a dependency in a container and then seeing whether it calls home, or download some malicious software and whatnot. It's a pretty interesting space, or actually then knowing that eBPF is used in the background.

When we consider the third use case, or the third area, I'm thinking of like for SRE and DevOps, what tools are out there, what could be helpful. For eBPF, I found Inspektor Gadget, which is a collection of eBPF based gadgets to debug and inspect Kubernetes apps and resources. There's a wide range of tools and things, like trace outgoing connections, DNS, and even more.

It's like, install it, try it out, and get to see what it's capable of. Another tool I found was Caretta, for instant Kubernetes service dependency maps, which also looks pretty awesome to get a visual picture of what is actually going on in a Kubernetes cluster. Last, I was thinking of like, an eBPF program needs to be distributed somehow, like package it, tarball, or ZIP file, whatever. BumbleBee actually goes into the direction of building, running, and distributing eBPF programs using OCI images. You actually use a container image to distribute the eBPF programs, which is a nice isolated way and can also be tested and automated.

Observability: Storage (All Things)

Considering that this is all awesome, we also need to store all the events we are collecting. Changing the topic from collecting the data or collecting the events, to more storage with all things observability. We have so many different storage types over time. There is a time-series database. There's logs databases, traces databases, maybe an eBPF event database or something else like network traffic, NetFlow database, everything all together.

Maybe it's time to create a unified observability data storage, which is something our teams are doing at GitLab, but also others are doing that as well. It's probably something to consider in the future. Now for the storage itself, it's like, what should be the best retention period? How long do I need this data? The incident that got resolved three days ago, do I really need to keep the data for future SLA reporting, or is it just good for troubleshooting a live incident?

Another question is like, do I really want to self-host everything, then scale it and invest money to buy new hardware, buy new resources? Or would I be just uploading everything to a SaaS provider and then pay for the amount of traffic or data being pushed or pulled? Coming to the overall question like, which data do I really need to troubleshoot an incident, debug something?

Also, considering a way of like, we want to become more efficient and also more cost efficient. We need capacity planning, forecasting, trending. The SRE teams or infrastructure teams at GitLab have been creating Tamland which provides that. It can also be used in estimating the storage needed by the observability systems, by using observability metrics, which can be helpful to really say, our growth of observability data is like a petabyte next year. Do we really need that data, in order to reduce cost?

Observability: Alerts and Dashboards

Considering that observability also means that we're doing something with the data. We're defining alert thresholds. We have dashboards. We want to reduce the meantime to response. An alert is fired when a threshold is violated, so we want to do something about it. Also correlate, analyze, and suppress all these alerts, because when too many alerts are being fired, it's not fun debugging at 3 a.m. in the morning. Also, if there are new possibilities with eBPF event data, this will also be an interesting use case to actually add that. Considering that we also have dashboards, we need to do something with the data. Creating summaries and correlations, providing the overall health state, reducing the meantime to response. Also considering forecasts and trends.

Verify Reliability

The thing is, if the dashboard is green, and everything is healthy, this doesn't prove anything. All ok is like, how do I verify reliability and all the tools and dashboards and whatnot? Which brings me to chaos engineering. We can break things in a controlled way in order to verify service level objectives, alerts, and dashboards. For that, using chaos frameworks and experiments. The interesting thing is in the example before that was Chaos Mesh, but there were different chaos frameworks available, which is Chaos Toolkit, which can be run on the CLI, for example.

Providing extensions with Pixie can be integrated in CI/CD. You can develop your own extension. It's like a wide variety of ensuring I can break things in my environment, and then verify that all the data collection which happens with eBPF also, is actually in a good shape. Considering that chaos engineering isn't just like break things, and then observability dashboards are read.

It's also a way of, going beyond traditional chaos engineering, injecting unexpected behavior, doing security testing, even like hardening software and doing some fuzz testing, which could also be seen or defined as a chaos experiment. Which is helpful knowledge by looking into all the things which are helpful. We will be talking about specific tools. We also want to break them to verify that they're actually working, and what are the weaknesses, what are the edge cases which are not yet implemented and not yet solved?

eBPF Observability Chaos (Let's Break Everything eBPF)

Let's consider some ideas and some use cases specifically tied to observability. For golden signals, it's rather easy to create chaos experiments or use chaos experiments for latency, traffic, errors, and saturation. There are tools and examples already available. This can be verified. Considering that we might be using the eBPF to Prometheus exporter, we can collect metrics. The exporter uses libbpf, supports CO-RE, which is like, you compile it once and can literally run it with every kernel. It's a good way to run it on different systems.

I did it using a container, specifying in the box at the bottom with the command, the different configuration names which are available, like looking for example the TCP SYN backlog window, and so on. This is really helpful to look inside. Thinking about how to verify this is actually working, add some chaos experiments, which is, CPU stress testing, I/O stress testing, memory stress testing, adding TCP delays. Maybe even doing a network attack or something in order to see that the metrics that are being collected with this exporter are not like all times the same, but you can see the spikes, see the behavior of the system. Then, also, get an insight whether the tool is working or not.

Looking into a different example, which I mentioned before, around how developers can benefit from Kubernetes observability, where Pixie is one of the tools. It provides auto-instrumentation for a deployed application. There's also a way to get an insight with service maps, which can be a great way to visualize things. It got me thinking of, if there was a service map, how does this change when there was an incident or when there's something broken?

Stress testing this again, or even running a network attack to see if the service map changes, or to see if the application insights, the traces are taking longer. There might be some race conditions, some deadlock, something weird going on, which could be a production incident. Then we can see, the tool is actually working and providing the insight we actually need. Again, this is all using eBPF in the background. For Kubernetes troubleshooting, there's Inspektor Gadget, so we can trace DNS and even more within the Kubernetes cluster. It's not bound to Kubernetes only.

There's also a local CLI which can be run in a virtual machine on Linux, which is a great way to get an insight, what are the DNS requests doing? Is there something blocked, or telling NXDOMAIN, or something like that? For chaos experiments to verify that these tools are actually providing the expected results, like inject some DNS chaos, which provides random results or NXDOMAIN results. Think about breaking the network, or doing even a network DDoS attack, traffic attack, and out of memory kills, certain other things. This is really a helpful toolchain, but in order to verify it's working, we need to break it. This is why I'm always thinking of, test it with chaos experiments or chaos engineering.

Speaking of which, for Kubernetes observability, it's also great to have service maps or getting an overview which container is talking to another container, Coroot using eBPF for creating network service maps, which is a super interesting feature, in my opinion. Because it also provides an insight of, what is the traffic going on, the CPU usage on the nodes, and so on. If we break TCP connections, or increase the network traffic, or even stress the memory, how would this graph behave?

What is the actual Kubernetes cluster doing? Which is a way to also verify that the tool is actually providing the solution for our use cases. It's working reliably. We can use it in production in the future. From time to time, again, we run the chaos experiment to really verify that the tool is still working after an update or something like that. Lots of ideas. Lots of things to consider.

When looking into profiling, and this is an example with Parca. Parca uses eBPF to auto-instrument the code, which means like function calls, stack debug symbol unwinding. It's really interesting that it's like auto-instrumentation and I as a developer don't need to take any action on adding this, or understanding how perf calls work. The most interesting part here is that the agent provides all this functionality.

There is a demo available. The Polar Signal folks also started e-learning series called, Let's Profile, where they are actually profiling Kubernetes and then looking into how to optimize it, which can be a use case for your projects as well. Continuous profiling is on the rise also in 2023. The idea is how to verify that the behavior is there, that we can simulate a spike with code crisis.

The function calls are taking too long, we maybe want to unreel a race condition or a lock in the software. We can run CPU or memory stress tests to see, the continuous profiling results are actually showing that under CPU stress, everything is behaving as expected or maybe it is not. This is really a runtime verification using some chaos experiments with continuous profiling.

Considering that OpenTelemetry moved beyond traces, also adding support for metrics, logs, and different other observability data types in the future. There is a project which implements eBPF in OpenTelemetry to collect low-level metrics directly from the kernel, from a Kubernetes cluster, or even from a cloud collector. I think AWS and GCP currently support it.

The idea really is to send that to a reducer, which I think is like an ingestor, allowing to modify the data or sanitize it, and then send it, or then provide it either as a scrape target for metrics in Prometheus, or send it to the OpenTelemetry Collector, which then can forward the metrics, and to move with that. Again, in order to verify that the data collection is usable, add some chaos experiments when testing the tool.

Think about CPU, memory, and also network attacks to really see, the data being collected is actually something valid or something useful in this regard. Last, I think DNS is my favorite topic. There is a thorough guide on DNS monitoring with eBPF which has a lot of source code and examples to learn. To test that, again, is similar to, let's break DNS, add some DNS chaos to the tests, which I think is always a great idea, because it's always DNS has a problem, whether we're in a chaos experiment or in production.

eBPF Security Chaos

I've talked a lot about observability chaos with eBPF. Now, let's add some security chaos, which is pretty interesting, especially because we want to verify security policies, with everything going on. Thinking about how to break things, we want to, for example, inject behavioral data that simulates privilege escalation. Or another idea could be, there is multi-tenancy with data separation, and we want to simulate an access to a sector of a dataset we shouldn't have access to.

Which brought me to the idea of, what are the tools out there promising all these things? I read a lot about Tracee from Aqua Security, which had some interesting features described in a blog post and also in a recording, saying it can detect syscall hooking. In the beginning, I wasn't really sure what a syscall hook means. Then I read on and thought, this is actually like a rootkit, which can be installed on Linux.

Then, it hooks a syscall and overwrites the kill command or the getdents command, which I think is for directory listing, or could overwrite any syscall in order to do anything malicious, or just read password credentials, or do Bitcoin mining, or whatever. I was curious, saying, how can Tracee detect a rootkit? This was the first time I actually installed a rootkit on a fresh Linux virtual machine, although in a production environment.

I was able to inject the rootkit, and then run Tracee with some modifications to the container command shown in the box here to really see, there's actually a syscall being hooked. It's overwritten or hidden in the screenshot, but it shows something weird is going on in the system. I was like, this proves to be useful.

Considering that I also wanted to test different tools, I looked into Cilium Tetragon, which provides its own abstraction layer and security policy domain specific language. I thought of, this could also be used for detecting a rootkit, so like simulating a rootkit as a chaos experiment, or simulate file access that match certain policies. After running Tetragon in a container, I was able to also see what the rootkit was doing, because it provided me with the insight of, there was some strange binaries being created, which then run some commands, which are calling home, opening a port.

Yes, some fancy things. Potentially, the virtual machine is now compromised, and we shouldn't be using it. It was really an interesting use case. This got me to the idea, could we do a chaos experiment which is like a rootkit simulation? Something which hooks a syscall but does nothing. I'm not sure if this actually is possible.

It would be an interesting way to do some Red Team pentesting in production, trying to verify that, for example, the policies with Tetragon and Tracee are actually working, so impersonating the attacker, again. Installing a full rootkit in production is real unwanted chaos, you don't want to do that. I deleted the virtual machine after doing the demos, or the ideas for this talk, and documented all the steps to do it again, which should be fine then.

eBPF Chaos (Visions and Adventures)

Considering that there is more to that, so what is the idea behind combining eBPF with chaos engineering and verifying that everything is working. I also had the idea of like, we could be using eBPF on the kernel level to inject some chaos, change a syscall, change the responses, modify a DNS request into a different response, maybe even access protected data and try to protect it.

When I thought about this, I also saw my friend, Jason Yee, having a similar idea, using eBPF to collect everything. Then also he's thinking about, how does this help our chaos engineering work in that regard? There's also a great research paper, which I linked at the bottom, for maximizing error injection realism for chaos engineering with system calls, which is called the Phoebe project, which brings more ideas into that area.

Thinking of a real-world example with Chaos Mesh and DNSChaos. It's using a plugin that implements the Kubernetes DNS-based service discovery spec. It needs CoreDNS. What if we could be changing that eBPF program that intercepts the DNS request, and then does some chaos engineering. While looking into this, and researching a little bit, I found Xpress DNS as an experimental DNS server, which is written in BPF, which has a user space application where you can add DNS records or modify DNS records into a BPF map, which is then read directly by the module or by the eBPF program. I thought, this would be actually an interesting way to do chaos engineering 2.0, something like that, to being able to modify DNS requests on the fly, and adding some high-performance DNS chaos engineering to production environments.

Another idea around this was eBPF Probes. I thought this could be a reasonable term for a chaos experiment. We can simulate the rootkit behavior in such small programs or snippets, and could either use a feature flag or whatever we want to use for enabling a specific chaos experiment. We can simulate a call home using some HTTP requests or whatnot, could intercept the traffic and cause delays. I'm reading something from the buffer, I'm sleeping for 10 seconds, and then I'm continuing, maybe. Also considering CPU stress testing, DNS chaos, all the things which can be broken or which you want to break in a professional way, and then verify reliability.

Chaos eBPF (We've Got Work to Do)

This is all great ideas, but we also got work to do. Vice versa, chaos with eBPF. Because from a development perspective, or DevSecOps, eBPF programs, this is like normal source code. You want to compile it. You want to run it. You want to build it. You want to test it. You want to look into code quality. It's complicated to write good eBPF program code. Humans make mistakes, so we need security scanning in the CI/CD pipeline, or shifting left.

We also want to see if there are programs that could be slowing down the kernel, or could be doing something which is not with good intentions, like supply chain attacks, installing an eBPF program and magically it becomes a Bitcoin miner. This is an interesting problem to solve, because at the moment the kernel verifies the eBPF programs at load time and rejects anything which is deemed unsafe. In CI/CD, this is a nightmare to test because you need something to be able to test it. You cannot really run it in an actual kernel. One of the attempts is to create a harness, which moves the eBPF verifier outside of the running kernel. The linked article is an interesting read. Let's see how far this goes in order to improve everything.

There are certainly risks involved. eBPF is root on the kernel level. There are real world exploits, rootkits, and vulnerabilities which are actually bypassing eBPF, because there are also limitations of eBPF security enforcement using different programming techniques, different ring buffers, and so on.

It's like a cat and mouse game with eBPF. My own wish list for eBPF would be having a fiber or a sleepable eBPF program in order to sleep and continue at a later point. Also considering monitor the monitor, like eBPF programs that observe other eBPF programs for malicious behavior, or having it out of the box in the kernel. Better developer experience and also abstraction by platforms.

Conclusion

Consider eBPF as a new way to collect observability data. It provides you with network insights. It provides security observability and enforcement. Add chaos engineering to verify this observability data. Verify the eBPF program behavior. Also consider integrating eBPF Probes for chaos experiments, hopefully done by upstream in the future. We have moved from monitoring to observability.

We have moved from traditional metrics monitoring to being able to correlate, verify, and observe. We need to consider, there is DataOps coming, we want to use the observability data for MLOps and AIOps. In the future we might be seeing AllOps, or whatever Ops.

Also consider the benefits. We have observability driven development with auto-instrumentation, so developers can focus on writing code and not something else. We can verify the reliability from an Ops perspective with chaos engineering. For the sec perspective, we hopefully get better cloud native security defaults from everything we learn while using eBPF.

To-dos, the eBPF program verification in CI/CD. There's the chaos experiments using new technology, and also more ready to use eBPF level abstractions would be something. Consider these learning tips. Start in a virtual machine. Use Ansible or Vagrant provisioning, or something else, and share that with your teams.

I did that in the project which uses Ansible to install all the tools in the Ubuntu virtual machine. Consider taking a step back when you don't understand names or technologies. Take a note, read on. You don't need to understand everything which is eBPF. A general understanding can help you when the data collection breaks or something else is going on and the tools are not working. This is helpful information to get a deeper insight into what's actually going on.

Resources

You can read more about the GitLab Observability Direction on the left-hand side. You can access the demo project where all the tools and all the scripts are located. Here's my newsletter where I write everything I learn about eBPF, about observability and also chaos engineering, https://opsindev.news/.

See more presentations with transcripts

Recorded at:

Nov 24, 2023

Michael Friedrich

InfoQ Software Architects' Newsletter