Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations gVisor: Building and Battle Testing a Userspace OS in Go

gVisor: Building and Battle Testing a Userspace OS in Go



Adin Scannell talks about gVisor - a container runtime that implements the Linux kernel API in userspace using Go. He talks about the architectural challenges associated with userspace kernels, the positive and negative experiences with Go as an implementation language, and finally, how to ensure API coverage and compatibility.


Adin Scannell is a Software Engineer at Google, where he leads the gVisor team and focuses on container security and isolation. He has been virtualizing things for a while: he was previously co-founder and CTO at Gridcentric, which pioneered rapid virtual machine cloning technology and was acquired by Google.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


I'm Adin [Scannell], and I'm here to talk about gVisor. I'm an engineer at Google. I actually think “What's this thing?" is a really good summary of the presentation that I'm going to give today. And hopefully, people will be able to walk out with an answer to that question. So we've got a nice, small-ish room here, intimate. So although I have time for questions at the end, if there's something you really don't understand, you want me to dive into in some slide, you can stop me and I'll tell you if I'm going to address it more later on. And also, please forgive me, I can't see my slides in the screen, so I'll be walking back and forth a little bit.

Operating Systems & Containers

I'm going to start with some background. And the first thing I want to do is talk about what an operating system is. So everyone knows an operating system is the blue box that sits between hardware and application. Everyone's on board? It's the blue box. Well, for my purposes, I'm going to define an operating system as something that does two things. The first thing it does is it takes a set of fixed physical resources and transforms those into a set of fungible, virtual resources for applications to use. So that's kind of the bottom half there. And then the other thing that it's responsible for is providing a set of abstractions and a system API for applications to actually be able to create those resources, use them, destroy them.

And so, an example is Cores, physical CPUs. An operating system is going to take some set of Cores, it has a scheduler, that's its basic mechanism, and it's going to provide threads as an application abstraction. And the system API for threads are things like clone, fork, wait, futex. Those are all interacting with that schedule or mechanism. Memory is very similar. So, you have physical memory, and the operating system uses virtual memory, the MMU, to provide the notion of mappings that applications will interact with. So the [inaudible 00:01:58] mapping is mmap, munmap, fork. The operating system multiplexes these resources using reclaim, Unified Buffer Cache, swap. These are that first half that it's doing. And finally, one more example, just because I wanted to include a 10BASE-T NIC in the slides. The operating system has a single network device that's using a network stack and providing socket abstractions for applications to use.

Next, since I'm going to be talking about containers and container isolation, I want to say what a container is. It's really two different things. The first is packaging format. It's a content addressable bundle of content addressable layers. And these are the things that are in Docker Hub or Google Container Registry or wherever your preferred source of container truth is. And it's also a set of abstractions around those system abstractions, or a sort of set of standard semantics for the system API. So when a container starts, the first process of that container is PID 1. It has a particular view of the file system which is defined by that, what was in the set of layers there. And that's typically implemented in Linux, for example, with a set of kernel features called namespaces, and a set of resource controls called cgroups. I'm not going to get into too much detail on those things. Hopefully, those words are vaguely familiar to everyone.

And it goes without saying that containers are pretty amazing. They've transformed the landscape of how people run services. They've transformed infrastructure itself, and it's because they're incredibly useful. They've very portable. You can run the same workload on your work station as you can in production pretty easily. They've very, very performant, very efficient. And those second ones, the performance and efficiency, a lot of that actually comes from the fact that you have a single shared kernel that's providing those virtualized abstractions. You have a single scheduler that is scheduling threads across all containers. You have a single memory management mechanism that is optimizing for all the applications that are running on a host.

Container Isolation

But there's a bit of a problem, which is containers are really amazing but they're not actually that amazing at containing. And it's for the exact same reason that they're performant and efficient, you have this single shared kernel. And the kernel is very, very big and very, very complicated. It's an incredibly high quality piece of software, but it's just a simple fact that the bigger and more complicated something is, the more bugs it's likely to have. I was actually struck by this headline yesterday, "The Linux 5.0 Kernel Is the Biggest All Year with 350,000 Lines of New Code." So, something to keep in mind, that although the kernel seems relatively stable, like there's not that many new features that are going in, development's actually increasing in pace. And the number of backports to stable branches is going up over time. So there's a fantastic presentation you can look up by Dimitry Vyukov called "Syzbot and the Tale of Thousand Kernel Bugs," where he lays all this out and lays this reality on you.

While everyone is familiar with the big headline bug, Spectre, Meltdown, Dirty COW, things that have logos associated with them, there's actually a lot more to it. In 2017, there were 450 kernel CVEs, including 169 code execution bugs. And they're all varying severities. They're not all super severe, but there are a lot of bugs in the kernel, and it's not that hard to find them. So, for example, syzbot, which is the tool by Dimitry [Vyukov], it's run in the open. It's fuzzing kernel interfaces, and you can go look up 371 different race conditions, crashes, used-after-free bugs, whatever you want here. And some of these are exploitable, some of them are not. It's just the state of the world.

So, this is a problem because although there are a lot of talks on microservices and how to build the right kind of architecture, you really need all those things, you really need to ensure that your microservices are limited in scopes and you have network policy enforcement and all these kinds of stuff. But a lot of those tools rely on proper container isolation. If an attacker compromised one of your microservices, everyone should know they can talk to everything that microservice can talk to. So you always have to keep in mind when you're architecting your application, that's just a fact. But an attacker may try to move laterally. They may try to gain more privileges on the host, they'll move to other services, and you really have to keep that in mind. This sounds very scary, but I do want to temper this by saying that container isolation is actually pretty good, and if you're running all first party code, you don't have to be too scared by this. But a lot of people are at more risk than they might think. There are a lot of cases where you're running external code, you're running third party code and you may not even be aware of that. So you could have pretty exploitable tools that you're running on user data and that's one vector that people will take to get into your services. A couple good examples there, Ghostscript, PDF rendering, and transcoding FFmpeg both have RCEs associated with them. There's the famous incident with Struts and the remote code execution. I can't remember the company, the credit company. It starts with an E. Equifax. Yes, that was a relatively famous incident. And, of course, you may be running user code through plugins, or extensions, or other kinds of services that you are just not really thinking about too carefully.


All of this is the reason that people use sandboxes. Sandboxes are a way of getting an extra layer of isolation for those kinds of services on the last slide where you're running user code directly or you want to have a little bit more protection than just using container isolation natively. And a couple of sandboxes that people may be familiar with: seccomp, a really good example. So, seccomp is an in kernel mechanism, but you can use it to limit the surface of the kernel that you're exposing to applications. This is something that's standard. I think Docker has a seccomp profile that you can apply to individual containers. But there's a fundamental tradeoff that exists for seccomp which is it's more effective the more restrictive the policy you can define. So, if you have an application that you can start up and it only does read, write, exit, it takes some input, does a few operations in that input, spits out some output, and then quits, that's going to be a really, really effective sandbox, but there's almost nothing that can run with that particular policy.

So, another challenge that seccomp has is you can have things that you think are pretty safe, but turn out to be not safe. There's a very famous example of this from 2014 with a futex bug. Futex is a system call used to implement mutexes, and you interact with a scheduler. And it's sort of, “Oh it's really important, really safe, and really fundamental.” And then, “Oh, it wasn't actually that safe.” So there's probably a whole bunch of scrambling for people to get their sandbox policies in line or update their kernels or deal with that in some way. And it also is exposing the actual host kernel surface, which means if an application, say the JVM, needs to read /proc/meminfo, or /proc/self/maps, it's going to read the actual host data from that file. And you may actually be giving people a little more information than you want to give them how much memory is in this host, maybe figure out if they're co-located with other services. There may be some value in that.

Another kind of sandbox that everyone should be familiar with is VM. A VM transforms the way that an application interacts with the host completely. You're no longer exposing a system call API, you're exposing some virtualized hardware that's implemented by some VMM. And this is really powerful because the X86 virtualized interface, hopefully, is a little bit simpler and easier to walk down than the system interface, maybe. I wouldn't make that claim, but hopefully it is. But it's challenging because you have completely different semantics for the application. You can't pass in a file anymore and just expect it to read that file and spit out a result. If you're doing a transcode or a DFM by case I was talking about, you have to get the video data into that VM somehow. Maybe you have [inaudible 00:10:19] devices, you have some agent in there, maybe it's over the network. There's a whole different set of semantics that apply.

And the other really important challenge here is that now you have some guest kernel that is interacting with the virtualized hardware, and it has its own mechanisms for transforming physical resources into fungible, virtual ones. So it has a scheduler, and it has a memory reclaim mechanism, and it has swap, and that can all fight with the host kernel's mechanisms.


So that's what led us to gVisor. What we really wanted it to have was some additional defense against arbitrary kernel exploits. We didn't want to deal with arbitrary constraints anymore. We didn't want to have to say, "This application runs, this one doesn't." And we wanted it to work with the existing host mechanisms around scheduling and memory management specifically.

So, to circle back to that first definition for operating systems, we wanted the kernel to do what it was really good at multiplexing physical hardware. So it was handling scheduling, it was handling memory management. But we were going to handle the system API part. We're not going to let applications that we don't trust talk directly to the kernel anymore. And on the container side, we really like the way containers are packaged up, but in terms of using the kernel mechanisms, the kernel abstractions on top of abstractions, we didn't want to rely on namespaces. Although we rely on the cgroups, we didn't want to rely on all that stuff exclusively to implement containers.

What we needed to build was distinct, type-safe implementation of Linux. And that sounds a little bit insane, but as Justin [Cormack] mentioned in his talk, it's possibly because the Linux system call interface is stable. You can't break user programs. Once you put something in there, it's stuck. So it actually takes a long time to add things to the Linux kernel interface. I think where several sequences were just merged, and they were proposed five years ago or something. So, long discussion around that. My favorite system calls, my favorite calls in the Linux kernel interface which are great example of this being stable, they're only in 32 bit because every architecture you get a chance to kind of reset the list of calls you have. But my favorite are uname, olduname, and oldolduname. So that's a good example of how it's stable.

And we do need to have bug-for-bug compatibility in this implementation, but we don't want to have exploit-for-exploit compatibility. So, if there's some exploit that works on Linux with raw sockets, we don't want that same exploit to compromise our implementation. This implementation we built not on physical hardware, but on these existing system abstractions so that we can work cleanly with them.

We also want to have a fully virtualized environment, so we're not going to pass anything through to the host. If we want something to work, if we want to support something, then we have to implement it ourselves. This is a big challenge, right? It sort of limits our compatibility and it takes a long time to get to the point where we're running substantial things. But it also means that we reduce our chances of having accidental exposure to CVEs or accidental leaks of information.

So that's what we built. gVisor has this component called the Sentry, which is a pretty complete implementation of the system surface. I mean, it's not complete in the sense there's a long tail of things that we don't do, but there's a lot of stuff there and it has all the things that you would normally think of a kernel having. There's a full MM layer, a VFS layer. I'm going to talk about this stuff in more detail anyways.

Architecture & Challenges

Architecture. How does that work? To start, I'm just going to review some basics about how an operating system works, because it's sort of germane here. This might be boring for some people or it might be new to others, I'm not sure. But when an application is running- and this is Linux, by the way, different operating system. Windows has a different compatibility layer that it applies at. But a Linux layer has a system call layer. And when an application is running, when it wants to interact with a system, it'll execute a system call. There are a number of mechanisms to execute a system call, this is just the fastest and most common one that you'd see on AMD64, so I'll just stick to this syscall instruction.

So it's going to set the registers in such a way that it encodes the system call arguments, RDI, RSI, RDX, the first three arguments. Sets the AX register to include the system call number and executes the syscall instruction. At that point, the hardware is going to load the contents from a privileged MSR, a privileged special register, jump to that instruction, set the code privilege to zero, mask a bunch of flags. Probably a few other operations in there. And essentially, you're executing in kernel mode at this point. There's some mapping in that address space that corresponds to the kernel. Kernel is going to switch the kernel stack, push a bunch of registers and call a function that dispatches the system call. Eventually, that system call is done, it's handled it, done whatever you've asked it to. It's going to execute some other instructions, sysret or iret, to return to user space, which kind of undoes all that stuff that happens. So, that's how system call works. Pre-kernel page-table isolation, it's probably like 100 cycles. Post-kernel page-table isolation is probably 500 to 1,000 cycles, something like that.

Very similar story for traps and faults. Say an application has some address in R14 that corresponds to some page that it's never touched before. Maybe it's a page that's been swapped out or maybe it's a page that's never been allocated. When it executes this instruction, the hardware is going to look up the page fault handler in the IDT. It's going to set segments appropriately for that fault handler, it's going to switch the stack, push a bunch of stuff to the kernel stack and jump to the page fault handler itself. The kernel page fault handler is going to dispatch, it's going to look up in its own representation all the list of mappings, these virtual mappings that the applications created in the past. And it's going to say, "What does this address correspond to?" Okay, it's this anonymous memory that I haven't allocated yet. It's going to allocate a new page, it's going to fill in the page tables. It's going to return here call iret, undo all that work, jump back in to the application and continue executing. So this is how an operating system works in three minutes. I think it was about three minutes.

So, the Sentry, is it a userspace kernel or gVisor userspace kernel? What the heck does that mean? Userspace has nothing to do with that explanation there. I think if there's one technical detail that I hope people walk away from this with, it's that very, very key component within Sentry is the platform API. So the platform API is a set of abstractions around these core mechanisms for how an operating system actually executes user code. So, it looks a little something like this. You have this platform, this top level interface. The platform can create new address spaces and new context for execution. An address space is something that you can map data into. A context is something that you can switch into. You take a register set and you can switch into that context and begin executing. When the application executes a system call or has a trap or anything else, the switch call returns. So it's just seen like a synchronous call through this API.

So this is a monster slide. But this is the same version of that OS trapping slide as implemented by the Sentry. So the Sentry has a memory management subsystem. It keeps track of all the VMAs that the application has mapped, all the memory regions that it's created. Just like a regular operating system, it sort of populates these things on demand based on faults. And so, a flow for trapping a fault might look like this. is the entry point for a task, and I'll talk more about that in a little bit. You call Switch to switch in the application. Application's running, it executes an mmap system call. Switch returns and says, "Mmap was executed." You will create the VMA inside the Sentry, you'll call Switch again. The application may actually touch that page at this point. Switch returns and says, "There was a trap. There was a fault that occurred." We handle it internally, we see if that fault was valid. If it's not valid, you may inject a signal or whatever. Well, you synthesize, you create a signal and you simulate that for the application. But if it's valid, we call Mappable Mapinto which is in that platform API, call Switch again, the application continues running. So that's sort of the view of faulting two slides ago, as seen by the Sentry through that platform API.

So there are multiple platform implementations. The one by default, because it works everywhere, is Ptrace. Ptrace essentially uses an existing kernel mechanism that lets you trap system calls and faults in other processes. This is pretty slow and complicated and it involves a bunch of … That should say “context switches”, not “content switches”. But as I was saying, it's universal. So I won't step through every single line here, but there's a couple of reschedule calls that you can see where you're swapping in and out, and those are kind of costly. So this is between 5,000, 10,000 and 20,000 cycles, something like that, for a full trap, which is a higher than that last number I said.

So it sort of looks like this. I won't spend more than a few seconds on this slide. But that untrusted box on the left represents an address space in that platform API. So you create a number of these stubs. Each stub represents a unique address space, and each one of those stubs will have a set of threads in it, which are just those contexts of execution. And when you call switch, you'll figure out which thread you are, which is the tracer thread for some thread inside that stub, and you'll inject the call via Ptrace [inaudible 00:21:04].

So another platform is KVM platform. This is deeply confusing to people, because I think they hear KVM and they think it's a VM. But it's not a VM. There is no machine model, there's no virtual hardware. What happens is the Sentry runs in host ring 3 and guest ring 0 simultaneously. So it's the kernel that runs inside the guest. And it transitions between those two dynamically, which is pretty bizarre, but it's what happens. When the Sentry attempts to return to user code by executing an iret, essentially that's trapped. The Sentry stays automatically stuffed into a KVM VCPU. And that VCPU resumes, the iret succeeds, you start running the application code. When the application executes a system call or a fault occurs, you jump into the trap handler in the Sentry. If it handles it without needing a host system call, it can iret back into the application, no VM exits occur. If it needs a host system call to service that application system call, it'll be automatically transitioned back into host mode. The state is essentially extracted from that VCPU.

So, in a very straightforward state diagram, it looks something like this. If you start in the middle there, in that switch call, the first thing that happens might be an iret in host mode. That blue pill is what this functionality is called that automatically transitions the Sentry between guest and host mode. I'm not going to walk through this, but hopefully it's clear there's two platforms.


I want to talk a little bit about some of our experiences of building gVisor. The first which I'm not going to be able to speak too much about, is that it's used inside Google. It's used by the next generation App Engine runtimes. You can connect this slide pretty directly with that requirement slide and you can understand why we wanted to build this thing if you're familiar with App Engine. It's also used for other internal sandboxing use cases. We've served billions of containers at this point. Probably more, although I have never done the math. And one very interesting part of our experience is that there is a lot of scale there, and a lot of things happen at scale. So you hit every race condition, every deadlock, every kind of CPU bug. The thing is never a CPU bug, but sometimes at scale it is a CPU bug.

So the first part of our experience is in the Linux system call surface. I mentioned that it's stable, and that's totally true, but it's not necessarily well documented. It's a total de facto standard. And it's not always to spec. There are some bugs that are a critical feature or a critical part of the system call surface. One great example from production is all of a sudden we started seeing these crazy gRPC requests that took two minutes. They succeeded, but where once they took a second, now they took two minutes, and exactly two minutes. And something weird is happening here. It turns out that EPOLL, even if it's edge-triggered, will always trigger on eventfds when they're written to. Even if the readable or readable state doesn't change, they always trigger. And there was a bug in gRPC where they depend on this behavior. And so, although it's a very weird set of semantics because normally an edge-triggered EPOLL is not going to fire on no state change, it's very much part of this system call surface and very much something that we have to be bug-for-bug compatible with.

Another point is that system calls are not a great metric for tracking how much of the system API you implement. There's 300-something calls, but far, far more proc files, hidden complexity and tiny little flags and crazy things going on in the surface there. So I hate talking about system calls as if it's super significant.

A final point here is, it takes a little bit of time to understand exactly what is important in the system call surface. Someone pointed me to some paper, the name I can't remember, that talked about what is the minimum set of system calls you need to support X percent of applications? And I think one of the top ones there was set_robust_list. It required for robust futexes. And they arrived to that conclusion because every application, when it starts up, calls set_robust_list, so you must need it. But no one uses robust futexes because they're a terrible idea. And glibc just probes for that every single time to determine if the feature is available. But it really does not matter at all.

So in terms of testing and validating, I mentioned this de facto standard. So you're doing A/B testing. You're checking the behavior on Linux, you're checking a test pass on Linux, and then you're checking that you have the exact same behavior that you're implementing. And there's a variety of different strategies we employ. We use common workloads, language test suites, platform test suites. The chrome tests are pretty phenomenal for exercising a system surface. And then dedicated system call test suites. We have our own stuff here that we're hoping to talk more about in the future as well.

And then one final point that I'll make is, in any effort like this, I think it's really, really critical to know that no matter what you do, you're going to get it wrong. You're going to miss things. So, have a plan for failure. When you're rolling things out, have a plan to identify those cases, write out appropriate monitoring metrics, deadlock detection, watchdogs, all that kind of stuff.

So, the final segment here, I just want to talk a little bit about our experience using Go. So the Sentry there, I don't think I said it yet, but the Sentry is written in Go. And say we wanted to type safe implementation, and I'll just pretend that it's type safe. And language nerds can talk to me later. So, I mean, it's certainly better than C. It's certainly a step up from where we were before and we have a whole new set of reasonable types that we can introduce the kernel. So we have arch.SyscallArgument, usermem.Addr. We enforce that you don't do direct additions on usermem.Addr. You always check for overflow, and that's very helpful for certain classes of bugs. You don't have C style arrays, you get bounds checking on everything. You don't have pointer arithmetic anymore. All that is great. It's phenomenal.

A big question that everyone asks about Go, is garbage collection. How can you write an OS or a kernel with garbage collection? Isn't that a terrible idea? And of course, we're aware that it's a risky move. And we have a simple philosophy around this, which is to not have allocations on hot paths. Right? It's fine to allocate things when you're setting up a new file or you're opening a new socket, but in the read path, read and write path, you shouldn't be allocating anything. And garbage collection actually has a set of benefits as well. We eliminate use after free, double free, basically putting the free in the wrong place. All of those classes of bugs are gone.

And surprisingly, a lot of the things that we thought would be problems upfront were not problems. There's no real allocation cost. It's pretty cheap to allocate. And stalls due to scanning has not been a problem for us as well. And a big part of that is because the threading model that I'll talk about in a second, you can still run application code while garbage collection is happening. The big problem that we have seen from garbage collection, though, is that we waste cycles scanning that are necessary. And the biggest source of unnecessary scanning is bad escape analysis.

Go has been getting better all the time at doing escape analysis. Escape analysis is, so everyone knows, when the compiler decides whether it needs to allocate something on the heap or whether it can keep something on the stack. And if it can keep something on the stack, then it's not going to be garbage collected. If it has to allocate something on the heap, then it will have to go through and collect that thing later. And when it writes or sends some value as an interface, it almost always triggers escape analysis. That thing escapes again, allocate it, no matter what it is. You know, because you can have one implementation to that interface that just does a very, very simple thing with that object and it never escapes. But the compiler, I think, assumes that, “well, there could be many, many implementations and I can't know, so I'm going to have to assume that all these values escape.” So that's been the biggest source of problems with respect to garbage collection.

Participant 1: I'm dying to ask, did you use Rust?

Scannell: Bryan [Cantrill] is giving this talk this afternoon. We started this project a little while ago. I think we talk about this stuff sometimes, but I think Go was a good direction for us to go in at the time. Yes.

The threading model is very interesting. It's something that we try to leverage in Go where we can. The tasks, individual tasks, are goroutines. And switching it to application code is a system call from the perspective of the runtime. And the last slide there, garbage collection, garbage collection can proceed when a goroutine is in a system call. So if you have a bunch of application threads that are executing user code, then you can garbage collect in the background. And that's maybe why it's not as big a problem as it seems. One interesting consequence of using the Go scheduler or using goroutines as the basis for a task, is that the number of system threads converges to the number of active threads in the system. Because you can have a million "untrusted" application threads, and if they're all blocked doing nothing, you may only have two system threads that are actually running the system. If you have a million application threads that are actually busy doing things, then you'll end up with a million system threads in the host system. We have had a couple of issues with the scheduler. We found some of our lead cycles were a problem. Work stealing was overly aggressive. It was just hammering the global run queue end time, just trying to find additional work. And we've submitted some fixes upstream for some of these things as well.

So, this brings me to my miscellaneous slide on Go. I'm a big fan of Go generally, and I like a lot of the language features. From five years ago when I started programming in Go, I thought it's sort of like C but a lot of the bad stuff is fixed and there's a few, new cool things. The bad news with respect to gVisor is most of the new cool things we can't really use. So we just find there's a lot of cost that we can't necessarily pay. Defer is a great example of that. It's actually gotten way cheaper. I think there was a change that was made in Go 10 or Go 11 that made defers 1/10th the cost for simple cases. But they tend to trigger that allocation cost and garbage collection as well. So, sort of compounding other factors.

And one of the big areas where I think there's been a lot of improvements, but there's a lot of room for improvement, is in code generation. We got all these amazing checks. I mentioned bound checks for slice accesses. But I don't think the compiler is aligning those checks, in a lot of cases, where it might be able to. So we tend to find the code generated by the Go compiler is much less performant than by other back ends. Two quick highlights at the bottom here. The race detector is awesome though. That's a really great tool to have for kernel. And the AST tooling is really good as well, but it would be even better if we didn't have to use that, because we have metacode generation that we use for certain things. Go generics.


So, just a couple quick points here to wrap-up. Like Justin mentioned, we released this project in May. We're looking to make some of our roadmap a bit more public and do a bit more public issues and tracking. The big rocks, performance and compatibility, we're continuing to invest; there’s a team. And make it run more and more things. And we're continuing to invest in security as well. We recognize that security is a process and we have to keep improving it and finding issues and vulnerabilities, and find all those things before they find you. So, that's what I got. Happy to take questions.


See more presentations with transcripts


Recorded at:

Jan 15, 2019