As container technology evolves, its implementation of security, isolation and resource control has been continually improved. In this article, we will review how Docker container achieves its security and isolation utilizing native container features of Linux such as namespaces, cgroups, capabilities, etc.
Virtualization and Isolation
Operating system-level virtualization, containers, zones, or even "chroot with steroids" are names that define the same concept of user-space isolation. Product such as Docker makes use of user-space isolation on top of OS-level vitualization facilities to provide extra security.
Since version 0.9, Docker includes the libcontainer library as its own way to directly use virtualization facilities provided by the Linux kernel, in addition to using abstracted virtualization interfaces via LXC,  systemd-nspawn, and libvert,
These virtualization libraries all utilize native container features of Linux (see Diagram above):
The default container format is called libcontainer. Docker also supports traditional Linux containers using LXC. In the future, Docker may support other container formats, for example, by integrating with BSD Jails or Solaris Zones.
Execution driver is the implementation of a specific container format and used for running docker containers. In the latest release, libcontainer
- Is the default execution driver for running docker containers
- Is shipped alongside the LXC driver
- Is a pure Go library which is developed to access the kernel’s container APIs directly, without any other dependencies
- Docker out of the box can now manipulate namespaces, control groups, capabilities, apparmor profiles, network interfaces and firewalling rules – all in a consistent and predictable way, and without depending on LXC or any other userland package.
- You provide a root filesystem and a configuration on how libcontainer is supposed to execute a container and it does the rest.
- It allows spawning new containers or attaching to an existing container.
- In fact, libcontainer delivered much needed stability that the team had decided to make it the default.
- As of Docker 0.9, LXC is now optional
- Note that LXC driver will continue to be supported going forward.
- To switch back to the LXC driver, simply restart the Docker daemon with
- docker -d -e lxc
Docker isn't virtualization, as such – instead, it's an abstraction on top of the kernel's support for namespaces, which provides the isolated workspace (or containter). When you run a container, Docker creates a set of namespaces for that container.
Some of the namespaces that Docker uses on Linux are:
- pid namespace
- Used for process isolation (PID: Process ID).
- Processes running inside the container appear to be running on a normal Linux system although they are sharing the underlying kernel with processes located in other namespaces.
- net namespace
- Used for managing network interfaces (NET: Networking).
- DNAT allows you to configure your guest's networking independently of your host's and have a convenient interface for forwarding only the ports you want between them.
- However, you can replace this with a bridge to a physical interface.
- ipc namespace
- Used for managing access to IPC resources (IPC: InterProcess Communication).
- mnt namespace
- Used for managing mount-points (MNT: Mount).
- uts namespace
- Used for isolating kernel and version identifiers. (UTS: Unix Timesharing System).
cgroups (or Control Groups)
Docker on Linux makes use of another technology called cgroups. Because each VM is a process, all normal Linux resource management facilities such as scheduling and cgroups apply to VMs. Furthermore, there is only one level of resource allocation and scheduling because a containerized Linux system only has one kernel and the kernel has full visibility into the containers.
In summary, cgroups allow Docker to
- Group processes and manage their aggregate resource consumption
- Share available hardware resources to containers
- Limit the memory and CPU consumption of containers
- A container can be resized by simply changing the limits of its corresponding cgroup.
- You can gather information about resource usage in the container by examining Linux control groups in /sys/fs/cgroup.
- Provide a reliable way of terminating all processes inside a container.
"POSIX capabilities" is what Linux uses. These capabilities are a partitioning of the all powerful root privilege into a set of distinct privileges. You can see a full list of available capabilities in Linux manpages. Docker drops all capabilities except those needed, a whitelist instead of a blacklist approach.
Your average server (bare metal or virtual machine) needs to run a bunch of processes as root. Those typically include:
- Hardware management tools (e.g., load modules)
- Network configuration tools (e.g., to handle DHCP, WPA, or VPNs),
A container is very different, because almost all of those tasks are handled by the infrastructure around the container. By default, Docker starts containers with a restricted set of capabilities. In most cases, containers will not need “real” root privileges at all. For example, processes (like web servers) that just need to bind on a port below 1024 do not have to run as root: they can just be granted the CAP_NET_BIND_SERVICE instead. And therefore, containers can run with a reduced capability set; meaning that “root” within a container has much less privileges than the real “root”.
Capabilities are just one of the many security features provided by modern Linux kernels. To harden a Docker host, you can also leverage other existing, well-known systems like
- LXC—Linux containers.
- Control Centre: The systemd Linux init system
- The virtualization API: libvirt
- Solomon Hykes and others. What is Docker?
- How is Docker different from a normal virtual machine? (Stackoverflow)
- Docker 0.9: introducing execution drivers and libcontainer
- Uses layered filesystems AuFS.
- There is one key metric where Docker Containers are weaker than Virtual Machines, and that’s “Isolation”. Intel’s VT-d and VT- x technologies have provided Virtual Machines with ring-1 hardware isolation of which, it takes full advantage. It helps Virtual Machines from breaking down and interfering with each other.
- Control groups are exposed through a pseudo-filesystem. In recent distros, you should find this filesystem under /sys/fs/cgroup.
- On older systems, the control groups might be mounted on /cgroup, without distinct hierarchies.
- To figure out where your control groups are mounted, you can run:
- $ grep cgroup /proc/mounts