BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Google Release "gVisor", a Lightweight Container Runtime Sandbox Used to Provide Secure Isolation

Google Release "gVisor", a Lightweight Container Runtime Sandbox Used to Provide Secure Isolation

This item in japanese

Google has released gVisor, a new kind of sandbox that can be used to provide secure isolation for containers that is less resource intensive than running a full virtual machine (VM). At its core, gVisor is an open source user-space kernel that implements a substantial portion of the Linux system surface. It is written in Go and designed with different trade-offs than existing container technology. The project includes an Open Container Initiative (OCI) runtime called "runsc" that integrates with Docker and Kubernetes.

The gVisor project GitHub README states that the core of gVisor is a kernel that runs as a normal, unprivileged process that supports most Linux system calls. Just like within a VM, an application running in a gVisor sandbox gets its own kernel and set of virtualized devices, distinct from the host and other sandboxes. gVisor provides a strong isolation boundary by intercepting application system calls and acting as the guest kernel, and can be thought of as an extremely paravirtualized operating system with a "flexible resource footprint and lower fixed cost than a full VM". However, this flexibility has associated tradeoffs with performance and compatability: gVisor may provide poor performance for system call heavy workloads; and although gVisor implements a large part of the Linux system API (currently 200 system calls), several system calls and arguments are not supported (and neither are some parts of the /proc and /sys filesystems), which means that not all applications will run inside gVisor.

 

gvisor layers
gVisor layers (image taken from project's GitHub repo)

 

The Google Cloud Platform (GCP) blog announcement for gVisor discusses that containers have revolutionised how organisations develop, package, and deploy applications, but states that the system surface exposed to containers is broad enough that many security experts "don't recommend them for running untrusted or potentially malicious applications". The blog post references an opensource.com article "Are Docker containers really secure?" in order to add credence to this claim, although it is worth noting that this article was published in 2014, and much has changed in the container security landscape since this time, particularly in relation to Docker.

There are, however, still widely acknowledged security challenges with current container technology, as we have catalogued in a previously published InfoQ article "Docker and High Security Microservices: A Summary of Aaron Grattafiori's DockerCon 2016 Talk". One of the primary issues is that the efficiency and performance gains from using a single, shared kernel also mean that container escape is possible with a single vulnerability. Accordingly, Google posit that there is a growing desire to run more heterogenous and less trusted workloads that has created a new interest in sandboxed containers, "containers that help provide a secure isolation boundary between the host OS and the application running inside the container".

gVisor limits the host kernel surface accessible to the application while still giving the application access to all the features it expects. Unlike most kernels, gVisor does not assume or require a fixed set of physical resources; instead, it leverages existing host kernel functionality and runs as a normal user-space process. gVisor intercepts all system calls made by the application, and does the necessary work to service them. A key distinction in comparison with other container technology, gVisor does not simply redirect application system calls through to the host kernel, and instead implements most kernel primitives (signals, file systems, futexes, pipes, mm, etc.) and has complete system call handlers built on top of these primitives.

In order to provide defense-in-depth and limit the host system surface, the gVisor runtime is split into two separate processes. First, the Sentry process includes the kernel and is responsible for executing user code and handling system calls. Second, file system operations that extend beyond the sandbox (not internal proc or tmp files, pipes, etc.) are sent to a proxy, called a Gofer, via a 9P connection.

 

gVisor Sentry and Gofer architecture
gVisor Sentry and Gofer Architecture (image taken from project's GitHub repo)

 

The Sentry requires a platform to implement basic context switching and memory mapping functionality. Today, gVisor supports two platforms: the Ptrace platform uses SYSEMU functionality to execute user code without executing host system calls; and the KVM platform (experimental) allows the Sentry to act as both guest OS and Virtual Machine Monitor (VMM), switching back and forth between the two worlds seamlessly.

The gVisor runtime integrates with Docker and Kubernetes via "runsc" (short for "run Sandboxed Container"), which conforms to the OCI runtime API. The runsc runtime is interchangeable with runc, which is Docker's default container runtime. In Kubernetes, most resource isolation occurs at the pod level, making the pod a natural fit for a gVisor sandbox boundary. The Kubernetes community is currently formalizing the sandbox pod API, but experimental support is available today. The runsc runtime can run sandboxed pods in a Kubernetes cluster through the use of either the cri-o or cri-containerd projects, which convert messages from the Kubelet into OCI runtime commands.

In regards to related projects, Kata containers is an open-source project that uses "extremely lightweight" VMs to keep the resource footprint minimal for container isolation. Like gVisor, Kata contains an OCI runtime that is compatible with Docker and Kubernetes. There has been much associated discussion about the trade-offs between the technologies on HackerNews, with one user "jsolson" suggesting that "the tradeoffs between [these differing sandbox technologies] are mostly with respect to compatibility, robustness of the security boundaries, and performance".

gVisor is written in Golang (Go), which was chosen for its memory- and type-safety. It is worth noting that gVisor can currently only build and run on x86_64 Linux 3.17+ and only supports x86_64 binaries inside the sandbox (i.e., it cannot run 32-bit binaries).

Additional information can be found in the gVisor repository on GitHub, and there is also a Google group for engineers wanting to take part in the discussion.

Rate this Article

Adoption
Style

BT