Security Insights into the LXD Container Hypervisor

A presentation at the Linux security summit last month covered various issues and edge cases with container security in LXD, a container hypervisor from Canonical that uses Linux Containers (LXC) under the hood. The session by Stéphane Graber and Tycho Andersen went into the internal details of some of the problems.

LXD is not a different virtualization technology but a tool that utilizes LXC features. LXC is implemented using the namespaces and control groups (cgroups) features in the Linux kernel. As a result, it uses the security features provided by the namespaces API.

There is extensive usage of cgroups in LXD to enforce resource quotas like CPU, swapping, disk and network traffic limits for containers. This also implies that any issues arising out of shared kernel resources impacts all running containers. One example of this is inotify handles that are used to track filesystem changes. The global limit per user is 512, which translates to 512 being the limit across all running containers on a host. This is far too less for something like systemd. The situation is made worse by the fact that systemd fails when it runs out of inotify handles rather than using a fallback mechanism like polling the filesystem. Other examples include networking tables (e.g. used to store routing entries) and ulimit.

For some of these problems, the suggested solution is to virtualize the environment in which the limits exist, i.e., tie the limit to a namespace so that it becomes local to the container. However, for things like ulimit, it’s not entirely clear which namespace would fit.

LXD runs as a root-privileged daemon, which means that it has more privileges compared to LXC. LXD does remove a few capabilities from its containers like loading/unloading kernel modules but keeps most of them since it does not know beforehand what capabilities might be required by the applications that will run in the containers.

Linux Security Modules (LSM) is a Linux framework that enables plugging in a security model implementation for enforcing access control without becoming dependent on any specific model. Examples of LSM are AppArmor and SELinux. While LXC supports both AppArmor and SELinux, LXD currently supports only AppArmor. The primary method of LXD container isolation is a namespace, but an AppArmor profile is also installed to prevent cross container access to resources like files.

The second part of the presentation covered checkpoint/restore of containers. The checkpoint and restore process saves the memory state of a running container and enables restoring it back at some time in the future. The techniques used for checkpoint/restore involve looking deep into a process’s state via system calls like ptrace. However, security measures like seccomp can prevent such calls, thus requiring special handling of checkpoints.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter