The current adoption status of cgroup v2 in containers
Fedora 31 was released on October 29, 2019. This is the first major distro that comes with cgroup v2 (aka unified hierarchy) enabled by default, 5 years after it first appeared in Linux kernel 3.16 (Aug 3, 2014).
While the adoption of cgroup v2 is an inevitable step toward 2020s, most container implementations including Docker/Moby and Kubernetes still don’t support cgroup v2.
If you attempt to install and run Docker/Moby (sudo dnf install -y moby-engine
), you will notice that the Docker daemon can no longer start up :(
$ sudo journalctl -u docker.service
...
dockerd[10141]: Error starting daemon: Devices cgroup isn’t mounted
systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
TL;DR if you just want to start Docker on Fedora 31 right now
Update (December 9, 2020): Docker 20.10 supports cgroup v2 and works on Fedora by default. No need to change the cgroup configuration.
https://medium.com/nttlabs/docker-20-10-59cc4bd59d37
Run the following command and reboot:
$ sudo dnf install -y grubby && \
sudo grubby \
--update-kernel=ALL \
--args=”systemd.unified_cgroup_hierarchy=0"
This command reverts the systemd configuration to use cgroup v1. Other cgroup-v1-based container software including Kubernetes require this command as well.
Or if you don’t want to rollback cgroup version, you can try Podman instead of Docker.
$ podman run --rm docker.io/library/hello-world
This is the most recommended solution by Fedora maintainers, but some caveats apply (discussed later).
So what‘s new in cgroup v2?
Simple architecture
cgroup v1 has independent trees for each of controllers. eg. a process can join group "foo” for CPU (/sys/fs/cgroup/cpu/foo
) while joining group “bar” for memory ( /sys/fs/cgroup/memory/bar
). While this design seemed to provide good flexibility, it wasn’t proved to be useful in practice.
cgroup v2 focuses on simplicity: /sys/fs/cgroup/cpu/$GROUPNAME
and /sys/fs/cgroup/memory/$GROUPNAME
in v1 are now unified as /sys/fs/cgroup/$GROUPNAME
, and a process can no longer join different groups for different controllers. If the process joins foo
( /sys/fs/cgroup/foo
), all controllers enabled for foo
will take the control of the process.
eBPF-oriented
cgroup is not only for imposing limitation on CPU and memory usage; it also limits accesses to device files such as /dev/sda1
. In cgroup v1, the device access control is implemented by writing the static configuration on /sys/fs/cgroup/devices
. For example, to allow the processes in the foo
group to only read/write/mknod access to a character device with major=10, minor=200 (i.e. /dev/net/tun
):
$ echo "a *:* rwm" > /sys/fs/cgroup/devices/foo/devices.deny
$ echo "c 10:200 rwm" > /sys/fs/cgroup/devices/foo/devices.allow
In cgroup v2, the device access control is implemented by attaching an eBPF program (BPF_PROG_TYPE_CGROUP_DEVICE
)to the file descriptor of /sys/fs/cgroup/foo
directory. The example above for cgroup v1 becomes as follows in cgroup v2 eBPF (in cilium-flavored assembler syntax):
// * R1 refers to the struct
{u32 access_type, u32 major, u32 minor}
// * R2 = type = (u16)access_type
/ * R3 = access = access_type >> 16
// * R4 = major
// * R5 = minor
LdXMemH dst: r2 src: r1 off: 0
LdXMemW dst: r3 src: r1 off: 0
RSh32Imm dst: r3 imm: 16
LdXMemW dst: r4 src: r1 off: 4
LdXMemW dst: r5 src: r1 off: 8
block-0:
// return 1 if type == 'c' && major == 10 && minor = 200 (allow)
JNEImm dst: r2 imm: 2 <block-1>
JNEImm dst: r4 imm: 10 <block-1>
JNEImm dst: r5 imm: 200 <block-1>
Mov32Imm dst: r0 imm: 1
Exit
block-1:
// return 0 (deny)
Mov32Imm dst: r0 imm: 0
Exit
The eBPF support is not only for device access control. systemd uses cgroup v2 eBPF ( BPF_PROG_TYPE_CGROUP_SKB
) for implementing firewall. See Kai Lüke’s blog series for the further information.
Friendly to rootless containers
Rootless containers allow running containers as a non-root user on the host to mitigate potential runtime vulnerabilities. Rootless containers became a trend this year, however, most rootless container implementations still don’t support imposing resource quota (e.g. docker run --cpus
), because delegating cgroup v1 access to non-root users has been considered dangerous.
Think twice before delegating cgroup v1 controllers to less privileged containers. It’s not safe, you basically allow your containers to freeze the system with that and worse. Delegation is a strongpoint of cgroup v2 though, and there it’s safe to treat delegation boundaries as privilege boundaries.
— https://systemd.io/CGROUP_DELEGATION.html#some-donts
With the adoption of cgroup v2, rootless containers are officially gaining the support for imposing resource quota.
Support modern features
Some of the recent features added in the kernel provide support only for cgroup v2. For example, Pressure Stall Information (PSI), appeared in kernel 4.20 (Dec 23, 2018), provides “pressure” (kind of loadavg but different) files such as /sys/fs/cgroup/foo/cpu.pressure
only for v2 hierarchy.
See also “Issues with v1 and Rationales for v2” in the kernel documentation for the detailed reason behind deprecating cgroup v1. If you can read Japanese, you should definitely take a look at Hiroyuki Kamezawa-san’s slidedeck as well.
Why couldn’t we migrate to cgroup v2 earlier?
cgroup v2 became official in Linux kernel 4.5 (March 13, 2016). However, it wasn’t considered to be useful for containers until the release of kernel 5.2 (July 7, 2019), due to the lack of the support for the device controller and the freezer. The lack of the device controller was the main blocker because that means the root user in a container can breakout the container by directly accessing device files such as /dev/sda1
. The lack of the freezer was also considered as a major issue, because freezing containers is sometimes useful for preventing TOCTOU attack that may result in container breakout. After the introduction of v2 device controller in kernel 4.15 (Jan 28, 2018) and v2 freezer in kernel 5.2, now cgroup v2 is considered to be ready for containers.
Also note that there was no easy migration path that could avoid breaking cgroup v1 containers, because cgroup v1 and v2 are incompatible and can’t be enabled simultaneously. Although there is “hybrid” configuration that allows mounting both v1 hierarchy and v2 hierarchy, the “hybrid” mode is underutilized for containers because you can’t enable v2 controllers that are already enabled for v1.
Adoption status of cgroup v2 in Linux distros
As far as I know, only Fedora 31 adopts cgroup v2 by default. But you can enable cgroup v2 right now on other distros as well, as long as you are running systemd ≥ v226 with kernel ≥ v4.2. Just add systemd.unified_cgroup_hierarchy=1
to the kernel arguments. I confirmed it works fine on Ubuntu 19.10 (kernel 5.3).
I predict community-driven distros will switch to cgroup v2 by default in 2020–2021. Enterprise distros will probably stay on cgroup v1 until 2022–2023.
Adoption status in low-level container runtimes (OCI)
runc
runc, the reference implementation of OCI Runtime Spec, gained the initial support for cgroup v2 just last month (PR: #2113). This is not ready for production, especially because it lacks the implementation for eBPF device controller (PR: #2145). Yet the current implementation is almost untested because of the lack of CI infrastructure with cgroup v2 enabled (Issue: #2124).
There is no official timeline, but I predict it will reach feature-complete status (w.r.t. the features implemented in crun) in mid-November on git master. More eyeballs to review are needed toward production-ready. Also, before announcing the general availability of cgroup v2 support, probably OCI Runtime Spec needs to be amended (Issue: opencontainers/runtime-spec#1002).
Update (Nov 6, 2019): Now runc is almost feature-ready for cgroup v2 except rootless mode, but there is still a bunch of issues. https://github.com/opencontainers/runc/issues
crun
crun is yet another implementation of OCI Runtime Spec, led by Red Hat. crun already provides the full support for cgroup v2, and hence, it is adopted as the default runtime in Fedora 31.
Adoption status in high-level container runtimes
containerd
containerd still doesn’t work on cgroup v2 environments even with crun, because containerd-shim still doesn’t support cgroup v2 (PR: containerd/cgroups#102). This work isn’t hard and will be implemented soon on git master, but the official release with support for cgroup v2 (containerd 1.4) won’t be available until early 2020 probably.
Update (Nov 6, 2019): PR is ready: https://github.com/containerd/containerd/pull/3799
Docker / Moby
Docker / Moby will gain the support for cgroup v2, as soon as runc and containerd gains the support. Hopefully, we may be able to get nightly Moby build that works with cgroup v2 by the end of this year, if everything goes well.
Update (Nov 6, 2019): PR is ready: https://github.com/moby/moby/pull/40174
Update (December 9, 2020): Docker 20.10 supports cgroup v2
https://medium.com/nttlabs/docker-20-10-59cc4bd59d37
Podman
Podman already supports cgroup v2 along with crun, and works like a charm without any extra configuration on Fedora 31.
But if you want to enable CPU controller ( podman run --cpus
) for rootless mode, you need to modify the configruration for cgroup v2 delegation:
$ cat > /etc/systemd/system/user@.service.d/foo.conf << EOF
[Service]
# default: Delegate=pids memory
Delegate=pids memory cpu
EOF
The CLI of Podman is almost fully compatible with Docker and can replace Docker in many usecases (alias docker=podman
). But note that some caveats apply:
- API is not compatible (While Docker implements REST API, Podman implements varlink API)
podman network create
is not supported for Rootless mode- BuildKit is not integrated into
podman build
. i.e. No support for concurrent build, cache volumes, secrets, ssh-agent proxy… - SwarmKit is not integrated
- Notary is not integrated
The biggest issue is the API incompatibility. If you have applications that calls Docker API, you can’t migrate to Podman unless you rewrite the application to execve
Docker/Podman CLI.
The second biggest drawback of Podman I think is the lack of BuildKit integration, but it is not a huge deal anyway, because BuildKit can be executed as a standalone tool and can export OCI tarballs that Podman can import.
$ buildctl-daemonless.sh build \
--frontend dockerfile.v0 \
--local dockerfile=. \
--local context=. \
--output type=oci \
| podman load foo$ podman run -it --rm foo
BuildKit works fine on cgroup v2 environment, but requires crun to be used instead of runc.
Adoption status in Kubernetes
CRI runtimes (containerd and CRI-O), kubelet, and cAdvisor will need to support cgroup v2. Giuseppe Scrivano, a maintainer of crun/Podman/CRI-O, is already preparing PRs (Huge props!): https://github.com/giuseppe/kubernetes/commits/cgroupv2 https://github.com/giuseppe/cadvisor/commits/libcontainer-cgroupv2
Kubernetes Enhancement Proposal (KEP) for cgroup v2 will be officially available soon according to Giuseppe. Probably this will be merged as an “alpha” feature in 2020 (Kubernetes 1.18? 1.19?), and will graduate to GA in 2021–2022.
With cgroup v2 KEP, we will be also able to bring Rootless Kubernetes (“Usernetes”) to the upstream.
Update (Nov 18, 2019): KEP is now ready https://github.com/kubernetes/enhancements/pull/1370
Conclusion
- Migration to cgroup v2 might be a pain, but it is a necessarily step.
- If you want to rollback to cgroup v1 due to compatibility issues, reboot the kernel with
systemd.unified_cgroup_hierarchy=0
. - Podman+crun already supports cgroup v2, even for rootless containers.
- Docker/Moby+containerd+runc will follow soon. If everything goes well, we might be able to get nightly binaries for cgroup v2 by the end of 2019.
- Kubernetes with support for cgroup v2 will be available in early 2020s.
We’re hiring!
NTT is looking for engineers who work in Open Source communities like Kubernetes & Docker projects. If you wish to work on such projects please do visit our recruitment page.
To know more about NTT contribution towards open source projects please visit our Software Innovation Center page. We have a lot of maintainers and contributors in several open source projects.
Our offices are located in the downtown area of Tokyo (Tamachi, Shinagawa) and Musashino.