cgroup v2 in details

25 min readJan 15, 2024

In this article, I will describe cgroup version 2 in details.

This article is following previous articles on the same topic and necessary to understand that one:

To be honest, the subject is difficult so you need time to read and understand. However, I hope it brings you a good understanding of the subject with useful practical command lines.

Deprecated cgroup version 1 core features

Compared to cgroup v1, what are the differences with cgroup v2?

Multiple hierarchies including named ones are not supported.
All v1 mount options are not supported.
The “tasks” file is removed and cgroup.procs is not sorted.
cgroup.clone_children is removed.
/proc/cgroups is meaningless for v2. Use cgroup.controllers file at the root instead.

Understanding cgroup version 2

cgroup v1 has too many resource controllers and too many attributes per controller.

Some unnecessary controllers have been removed from cgroup version 2.

Also, version 1 hierarchy is too much complex because each controller has its own subdirectory inside /sys/fs/cgroup filesystem.

Remember when you wanted to control 1001 user CPUQuota, memory and BlockIOReadBandwidth, you had 3 different locations:

#CPUQuota
/sys/fs/cgroup/cpu/user.slice/user-1001.slice/cpu.cfs_quota_us
#memory
/sys/fs/cgroup/memory/user.slice/user-1001.slice/memory.max_usage_in_bytes
#BlockIOReadBandwidth
/sys/fs/cgroup/blkio/user.slice/user-1001.slice/blkio.throttle.read_bps_device

Another problem with version 1 is that there is no consistency naming convention for the attribute files of the different resource controllers.

For example MemoryMax value is in the memory.max_usage_in_bytes file but CPUQuota is in the cpu.cfs_quota_us file...

Important
In Debian 11 man systemd.resource-control, le cgroup version 2 is referred as unified control group hierarchy by opposition to cgroups version 1.
Access the man page running man systemd.resource-control.

Moving to cgroup version 2

To adopt cgroup version 2, you need a minimum of systemd v226 and a kernel v4.17.

If your system is compliant, to move to cgroup version 2, you first edit /etc/default/grub and add systemd.unified_cgroup_hierarchy=1 to the line beginning with GRUB_CMDLINE_LINUX=.

GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1"

Next, rebuild the grub configuration:

sudo grub2-mkconfig -o /boot/grub2/grub.cfg

Reboot the machine and execute the following command to know the cgroup version:

$ mount | grep cgroup

Output should be something like this:

cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

Important
At anytime you can return to cgroup version 1 by editing /etc/default/grub file and rebuilding grub configuration.
If your system is already in cgroup v2, simply run the mount & grep commands to verify but do not modify anything in grub file.

cgroup v2 overview

cgroup design is based on a hierarchical organized file-system (cgroupfs) where each directory represents a boundedcgroup (i.e. group of processes). This file-system starts at /sys/fs/cgroup which is called the root control group. This root cgroup is the cgroup to which all processes belong.

Any process is represented by a folder inside this cgroupfs hierarchy of folder (tree structure) and each cgroup / folder has the same set of files, except for the root cgroup.

Adding or removing processes is made by adding or removing folders to/from cgroupfs.

By default, the root control group contains:

interface files (starting with cgroup.*)
controller-specific files such as cpuset.cpus.effective and cpuset.mems.effective.

Info
The cpuset.* files are relative to the cpuset controller that provides a mechanism for assigning a set of CPUs and Memory Nodes to a set of tasks. cpuset are needed for the management of large computer systems, with many processors (CPUs), complex memory cache hierarchies and multiple Memory Nodes having non-uniform access times (NUMA).

cpuset.cpus.effective: List of the physical numbers of the CPUs on which processes in that cpuset are allowed to execute. For example, for a HP Elitebook G3 with Core i5 value of this file is 0-3.
cpuset.mems.effective: List of memory nodes on which processes in this cpuset are allowed to allocate memory.

In addition, there are some directories related to systemd, such as, /sys/fs/cgroup/init.scope, /sys/fs/cgroup/system.slice, and /sys/fs/cgroup/user.slice.

The /sys/fs/cgroup contains the following files and directories in a Debian 11:

$ ls -l /sys/fs/cgroup
total 0
-r--r--r--  1 root root 0 Jul 27 16:16 cgroup.controllers
-rw-r--r--  1 root root 0 Jul 27 16:16 cgroup.max.depth
-rw-r--r--  1 root root 0 Jul 27 16:16 cgroup.max.descendants
-rw-r--r--  1 root root 0 Jul 27 16:16 cgroup.procs
-r--r--r--  1 root root 0 Jul 27 16:16 cgroup.stat
-rw-r--r--  1 root root 0 Jul 26 07:24 cgroup.subtree_control
-rw-r--r--  1 root root 0 Jul 27 16:16 cgroup.threads
-rw-r--r--  1 root root 0 Jul 27 16:16 cpu.pressure
-r--r--r--  1 root root 0 Jul 27 16:16 cpuset.cpus.effective
-r--r--r--  1 root root 0 Jul 27 16:16 cpuset.mems.effective
-r--r--r--  1 root root 0 Jul 27 16:16 cpu.stat
drwxr-xr-x  2 root root 0 Jul 26 07:24 dev-hugepages.mount
drwxr-xr-x  2 root root 0 Jul 26 07:24 dev-mqueue.mount
drwxr-xr-x  2 root root 0 Jul 27 16:16 init.scope
-rw-r--r--  1 root root 0 Jul 27 16:16 io.cost.model
-rw-r--r--  1 root root 0 Jul 27 16:16 io.cost.qos
-rw-r--r--  1 root root 0 Jul 27 16:16 io.pressure
-r--r--r--  1 root root 0 Jul 27 16:16 io.stat
-r--r--r--  1 root root 0 Jul 27 16:16 memory.numa_stat
-rw-r--r--  1 root root 0 Jul 27 16:16 memory.pressure
-r--r--r--  1 root root 0 Jul 27 16:16 memory.stat
drwxr-xr-x  2 root root 0 Jul 26 07:25 proc-sys-fs-binfmt_misc.mount
drwxr-xr-x  2 root root 0 Jul 26 07:24 sys-fs-fuse-connections.mount
drwxr-xr-x  2 root root 0 Jul 26 07:24 sys-kernel-config.mount
drwxr-xr-x  2 root root 0 Jul 26 07:24 sys-kernel-debug.mount
drwxr-xr-x  2 root root 0 Jul 26 07:24 sys-kernel-tracing.mount
drwxr-xr-x 45 root root 0 Jul 27 16:10 system.slice
drwxr-xr-x  3 root root 0 Jul 26 07:25 user.slice

Interface files

The following table is extracted from Control Group v2 — The Linux Kernel documentation to explain the content of each core file (cgroup.* files) existing in all cgroups:

To complete the list, here we have the core files that exist only on non-root cgroups (means child cgroups...):

Cpu interface files

The cpu controller regulates the distribution of CPU cycles. This controller implements weight and absolute bandwidth limit models for normal scheduling policy and absolute bandwidth allocation model for real-time scheduling policy.

Time duration is in microseconds.

See Documentation/accounting/psi.rst for details.

Memory interface files

The memory controller regulates distribution of memory. Memory is stateful and implements both limit and protection models.

Full description of memory interface files here: Control Group v2 — The Linux Kernel documentation

memory.high

A read-write single value file which exists on non-root cgroups. The default is max.

Memory usage throttle limit. This is the main mechanism to control memory usage of a cgroup. If a cgroup usage goes over the high boundary, the processes of the cgroup are throttled and put under heavy reclaim pressure.

Going over the high limit never invokes the OOM killer and under extreme conditions the limit may be breached.

Usage Guidelines

memory.high is the main mechanism to control memory usage.

Over-committing on high limit (sum of high limits > available memory) and letting global memory pressure to distribute memory according to usage is a viable strategy.

Because breach of the high limit doesn’t trigger the OOM killer but throttles the offending cgroup, a management agent has ample opportunities to monitor and take appropriate actions such as granting more memory or terminating the workload.

Determining whether a cgroup has enough memory is not trivial as memory usage doesn’t indicate whether the workload can benefit from more memory.

For example, a workload which writes data received from network to a file can use all available memory but can also operate as performant with a small amount of memory.

A measure of memory pressure (how much the workload is being impacted due to lack of memory) is necessary to determine whether a workload needs more memory. Unfortunately, memory pressure monitoring mechanism isn’t implemented yet.

Memory Ownership

A memory area is charged to the cgroup which instantiated it and stays charged to the cgroup until the area is released. Migrating a process to a different cgroup doesn’t move the memory usages.

A memory area may be used by processes belonging to different cgroup.
To which cgroup the area will be charged is in-deterministic. However, over time, the memory area is likely to end up in a cgroup which has enough memory allowance to avoid high reclaim pressure.

If a cgroup sweeps a considerable amount of memory which is expected to be accessed repeatedly by other cgroup. It make sense to use POSIX_FADV_DONTNEED (See posix_fadvise(2) - Linux manual page) to relinquish the ownership of memory areas belonging to the affected files to ensure correct memory ownership.

IO interface files

The io controller regulates the distribution of IO resources. This controller implements both weight based and absolute bandwidth or IOPS limit distribution; however, weight based distribution is available only if cfq-iosched (See Kernel/Reference/IOSchedulers - Ubuntu Wiki) is in use and neither scheme is available for blk-mq devices.

The full list of IO controllers files is here: Control Group v2 — The Linux Kernel documentation

io.max

A read-write nested-keyed file which exists on non-root cgroup.

BPS and IOPS based IO limit. Lines are keyed by $MAJ:$MIN
device numbers and not ordered. The following nested keys are
defined.

rbpsMax read bytes per secondwbpsMax write bytes per secondriopsMax read IO operations per secondwiopsMax write IO operations per second

When writing, any number of nested key-value pairs can be specified in any order. max can be specified as the value to remove a specific limit. If the same key is specified multiple times, the outcome is undefined.

BPS and IOPS are measured in each IO direction and IOs are delayed if limit is reached. Temporary bursts are allowed.

Writeback

Page cache is dirtied through buffered writes and shared mmaps and written asynchronously to the backing filesystem by the writeback mechanism. Writeback sits between the memory and IO domains and regulates the proportion of dirty memory by balancing dirtying and write IOs.

More at Control Group v2 — The Linux Kernel documentation

IO interface files

This is a cgroup v2 controller for IO workload protection.

More at Control Group v2 — The Linux Kernel documentation

PID interface files

The process number controller is used to allow a cgroup to stop any new tasks from being fork()'d or clone()’d after a specified limit is reached.

More at Control Group v2 — The Linux Kernel documentation

Cpuset interface files

The cpuset controller provides a mechanism for constraining the CPU and memory node placement of tasks to only the resources specified in the cpuset interface files in a task’s current cgroup.

The cpuset controller is hierarchical. That means the controller cannot use CPUs or memory nodes not allowed in its parent.

More at Control Group v2 — The Linux Kernel documentation

Device controller

Device controller manages access to device files. It includes both creation of new device files (using mknod), and access to the existing device files.

More at Control Group v2 — The Linux Kernel documentation

RDMA interface files

The rdma controller regulates the distribution and accounting of RDMA resources.

More at Control Group v2 — The Linux Kernel documentation

HugeTLB interface files

The HugeTLB controller allows to limit the HugeTLB usage per control group and
enforces the controller limit during page fault.

More at Control Group v2 — The Linux Kernel documentation

Misc interface files

The Miscellaneous cgroup provides the resource limiting and tracking mechanism for the scalar resources which cannot be abstracted like the other cgroup resources. Controller is enabled by the CONFIG_CGROUP_MISC kernel config option.

More at Control Group v2 — The Linux Kernel documentation

Namespace

cgroup namespace provides a mechanism to virtualize the view of the /proc/$PID/cgroup file and cgroup mounts.

See Kernel documentation: Control Group v2 — The Linux Kernel documentation

Controlling resources usage with cgroup v2

To check if value successfully changed, run:

systemctl show --property $parameter $user_service

Here:

$parameter is the controller parameter
$user_service is the user account or the service name

Controlling CPU for a user

Let’s control CPU usage at 40% for 1001 user. Setting CPUQuota is the same command for cgroup version 1 and cgroup version 2:

sudo systemctl set-property user-1001.slice CPUQuota=40%
sudo systemctl daemon-reload

But in this case value is in cpu.max file:

vissol@debian:~$ cat /sys/fs/cgroup/user.slice/user-1001.slice/cpu.max
40000 100000

The 40000 figure represents the 40% CPUShare, and 100000 represents the time interval over which CPUShare is measured. The default time setting, which you see here, is 100 milliseconds.

Controlling the memory usage for a user

Let’s limitate the 1001 user to 1G of memory usage with the same command line than for cgroup version 1:

sudo systemctl set-property user-1001.slice MemoryMax=1G
sudo systemctl daemon-reload

In the cgroup version2 case, the value is written in the memory.max file:

vissol@debian:~$ cat /sys/fs/cgroup/user.slice/user-1001.slice/memory.max
1073741824

Important
In cgroup version 2, MemoryMax replaces MemoryLimit from cgroup version 1.

You need to understand that this MemoryMax setting is a hard limit. It means that 1001 user cannot use more memory than what MemoryMax allocates.

Note also in the systemd.resource-control man page for cgroup v2, there is 2 more parameters: MemoryLow and MemoryHigh and they are more a soft limit. They don't exist in cgroup version 1.

Controlling IO usage for a user

IO parameters have changed: the cgroup version 1 prefix BlockIO turned to IO prefix in cgroup version 2.

Like in version 1, you can set limits to an entire drive but not only a partition.

Let’s limit the rate of files transfer for 1001 user with IOReadBandwidthMax:

sudo systemctl set-property user-1001.slice IOReadBandwidthMax="/dev/sda 1M"
sudo systemctl daemon-reload

Now let’s look at the io.max file in 1001 user slice directory, which should look like this:

vissol@debian:~$ cat /sys/fs/cgroup/user.slice/user-1001.slice/io.max
8:0 rbps=1000000 wbps=max riops=max wiops=max

Here, we see another benefit of using Version 2. Instead of having four separate attribute files for the four available parameter settings, as we had with Version 1, Version 2 places the IOReadBandwidthMax, IOWriteBandwidthMax, IOReadIOPSMax, and IOWriteIOPSMax settings all in one file.

Note that the 8:0 you see at the beginning of the line in this io.max file represents the major and minor numbers of the entire sda drive as you can see here:

vissol@debian:~$ cd /dev
vissol@debian:/dev$ ls -l sd*
brw-rw---- 1 root disk 8, 0 Oct 10 08:06 sda
brw-rw---- 1 root disk 8, 1 Oct 10 08:06 sda1
brw-rw---- 1 root disk 8, 2 Oct 10 08:06 sda2
brw-rw---- 1 root disk 8, 3 Oct 10 08:06 sda3

Controlling resources usage for a service

The principle of user resources controlling is fully applicable to services: under the /sys/fs/cgroup/system.slice/ directory, each service has its own directory where you can find the same files than in the 1001 user directory.

The command to define limit is the same than version 1.

You must notice also that you can combine several parameters to apply to a service or a user.

Example:

systemctl set-property httpd.service CPUShares=600 MemoryLimit=500M

Exhaustive list of cgroup version 2 controllers

The following controllers are available for cgroup version 2:

io - A follow-up to blkio of cgroup v1.
memory - A follow-up to memory of cgroup v1.
pids - Same as pids in cgroup v1.
rdma - Same as rdma in cgroup v1.
cpu - A follow-up to cpu and cpuacct of cgroups v1. Guaranties a minimum number of "CPU shares" when system is busy. It provides CPU "bandwidth" (kernel config. CONFIG_CFS_BANDWIDTH=y) to define an upper limit on the CPU time allocated to the processes of cgroup.
cpuset - Supports only the core functionality (cpus{,.effective}, mems{,.effective}) with a new partition feature.
perf_event - Support is inherent, no explicit control file. You can specify a v2 cgroup as a parameter to the perf command that will profile all the tasks within that cgroup.
freezer - Same as freezer in cgroup v1

See more via man cgroup or cgroups(7) - Linux manual page.

Parameters evolution from version 1 to version 2

For CPU controlling:

CPUWeight (version 2) replaces CPUShares (version 1)
StartupCPUShares (version 2) replaces StartupCPUWeight (version 1)

For Memory controlling:

MemoryMax (version 2) replaces MemoryLimit (version 1)
MemoryLow and MemoryHigh are new in version 2 and do not exist in version 1

For IO:

All the parameters start with IO (version 2) instead of BlockIO (version 1)

More about resources management…

Resources distribution model

Depending on your objectives, you can apply one or more of the following resource distribution models:

Weights

The resource is distributed by adding up the weights of all sub-groups and giving each sub-group the fraction matching its ratio against the sum.

For example, if you have 10 cgroup, each with Weight of value 100, the sum is 1000 and each cgroup receives one tenth of the resource.

Weight is usually used to distribute stateless resources (resources that do not maintain persistent data across process invocation). The CPUWeight parameter is an implementation of this resource distribution model.

Limits

A cgroup can consume up to the configured amount of the resource, but you can also overcommit resources. Therefore, the sum of sub-group limits can exceed the limit of the parent cgroup.

The MemoryMax option is an implementation of this resource distribution model.

Protections

A protected amount of a resource can be set up for a cgroup. If the resource usage is below the protection boundary, the kernel will try not to penalize this cgroup in favor of other cgroup that compete for the same resource. An overcommit is also allowed.

The MemoryLow option is an implementation of this resource distribution model.

Allocations

Exclusive allocations of an absolute amount of a finite resources. An overcommit is not allowed. The sum of the allocations of children cannot exceed the amount of resources
available to the parent.

CPU time allocation

The most frequently used CPU time allocation policy options include:

CPUWeight

Assigns higher priority to a particular service over all other services. You can select a value from the interval 1–10,000. The default value is 100.

For example, to give httpd.service twice as much CPU as to all other services, set the value to CPUWeight=200.

Note that CPUWeight is applied only in cases when the operating system is overloaded.

CPUQuota

Assigns the absolute CPU time quota to a service. The value of this option specifies the maximum percentage of CPU time that a service will receive relative to the total CPU time available, for example CPUQuota=30%.

Note that CPUQuota= represents the limit value for particular resource distribution models.

For more information on CPUQuota=, see the man systemd.resource-control documentation.

Memory allocation

You can use the following options when using systemd to configure system memory allocation

MemoryMin

Hard memory protection. If the memory usage is below the limit, the cgroup memory will not be reclaimed.

MemoryLow

Soft memory protection. If the memory usage is below the limit, the cgroup memory can be reclaimed only if no memory is reclaimed from unprotected cgroups.

MemoryHigh

Memory throttle limit. If the memory usage goes above the limit, the processes in the cgroup are throttled and put under a heavy reclaim pressure.

MemoryMax

Absolute limit for the memory usage. You can use the kilo (K), mega (M), giga (G), tera (T) suffixes, for example MemoryMax=1G.

MemorySwapMax

Hard limit on the swap usage.

Info
When you exhaust your memory limit, the Out-of-memory (OOM) killer will stop the running service. To prevent this, lower the OOMScoreAdjust value to increase the memory tolerance.

IO bandwidth configuration

To manage the block layer I/O policies with systemd, the following configuration options are available:

IOWeight

Sets the default I/O weight. The weight value is used as a basis for the calculation of how much of the real I/O bandwidth the service receives in relation to the other services.

IODeviceWeight

Sets the I/O weight for a specific block device.

For example, IODeviceWeight=/dev/disk/by-id/dm-name-root 200.

IOReadBandwidthMax, IOWriteBandwidthMax

Sets the absolute bandwidth per device or a mount point.

For example, IOWriteBandwith=/var/log 5M

IOReadIOPSMax, IOWriteIOPSMax

A similar option to the previous one: sets the absolute bandwidth in Input/Output Operations Per Second (IOPS).

Important
Weight-based options are supported only if the block device is using the CFQ I/O scheduler. No option is supported if the device uses the Multi-Queue Block I/O queuing mechanism.

Configuring CPUSET controller using systemd

The systemd resource management API allows the user to configure limits on a set of CPUs and NUMA (Non Uniform Memory Access) nodes that a service can use. This limit restricts access to system resources utilized by the processes. The requested configuration is written in cpuset.cpus and cpuset.mems. However, the requested configuration may not be used, as the parent cgroup limits either cpus or mems. To access the current configuration, the cpuset.cpus.effective and cpuset.mems.effective files are exported to the users.

To set AllowedCPUs:

systemctl set-property $service_name.service AllowedCPUs=value

For example:

systemctl set-property $service_name.service AllowedCPUs=0-5

To set AllowedMemoryNodes:

systemctl set-property $service_name.service AllowedMemoryNodes=value

For example

systemctl set-property $service_name.service AllowedMemoryNodes=0

Mounting the cgroup v2 filesystem

Because cgroup has a virtual file system, you can mount it like this:

mount -t cgroup2 none $MOUNT_POINT

Option -t indicates the file system type: here this is a cgroup v2 file system
Option none means that there is no physical partition linked to the mount point

cgroup v2 has several mount options such as nsdelegate,memory_recursiveprot.

The mount options (from kernel documentation) are:

These options are set during the mounting process with the -o option. -o is followed by a comma-separated list of option such as notime, nodev, nosuid, noexec, etc. and, for cgroup v2, one, two, three or four previous options.

In a Debian 11 environment cgroup v2 is typically mounted like this:

$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

cgroup v2 is mounted by default in Debian 11. It is not necessary to mount it. However, in case you need to do it, run the following command:

mount -t cgroup2 -o rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprotnone /sys/fs/cgroup

The cgroup version 2 hierarchy in practice

Each cgroup version 2 hierarchy contains:

cgroup.controllers: read-only file containing the list of controllers available in this cgroup and its child cgroup nodes. Its content matches with cgroup.subtree_control file.
cgroup.subtree_control: contains the list of the active controllers of the cgroup (+ prefix for enabled and - prefix for disabled). the set of active controllers is a subset of cgroup.controllers list.

In Debian 11, the default list of controllers are:

$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma

This list is the full list of controllers available in the platform for all the cgroup processes (child processes of root control group).

But if you create a child cgroup where you want specific controllers to be enabled, you need to:

Enable the controllers you want to apply to child group. Here for example cpu and cpuset to control CPU consumption

$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma

$ sudo echo "+cpu" >> /sys/fs/cgroup/cgroup.subtree_control
$ sudo echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control
$ sudo echo "-io" >> /sys/fs/cgroup/cgroup.subtree_control

Important
Resources are distributed top-down and a cgroup can further distribute a resource only if the resource has been distributed to it from the parent. This means that all non-root “cgroup.subtree_control” files can only contain controllers which are enabled in the parent’s “cgroup.subtree_control” file. A controller can be enabled only if the parent has the controller enabled and a controller can’t be disabled if one or more children have it enabled.

2. Create process sub-directory (here foo cgroup process)

mkdir /sys/fs/cgroup/foo/

Automatically, Debian populates the folder with full control files:

$ ll /sys/fs/cgroup/foo/
-r—r—r--. 1 root root 0 Jun  1 10:33 cgroup.controllers
-r—r—r--. 1 root root 0 Jun  1 10:33 cgroup.events
-rw-r—r--. 1 root root 0 Jun  1 10:33 cgroup.freeze
-rw-r—r--. 1 root root 0 Jun  1 10:33 cgroup.max.depth
-rw-r—r--. 1 root root 0 Jun  1 10:33 cgroup.max.descendants
-rw-r—r--. 1 root root 0 Jun  1 10:33 cgroup.procs
-r—r—r--. 1 root root 0 Jun  1 10:33 cgroup.stat
-rw-r—r--. 1 root root 0 Jun  1 10:33 cgroup.subtree_control
…
-rw-r—r--. 1 root root 0 Jun  1 10:33 cpuset.cpus
-r—r—r--. 1 root root 0 Jun  1 10:33 cpuset.cpus.effective
-rw-r—r--. 1 root root 0 Jun  1 10:33 cpuset.cpus.partition
-rw-r—r--. 1 root root 0 Jun  1 10:33 cpuset.mems
-r—r—r--. 1 root root 0 Jun  1 10:33 cpuset.mems.effective
-r—r—r--. 1 root root 0 Jun  1 10:33 cpu.stat
-rw-r—r--. 1 root root 0 Jun  1 10:33 cpu.weight
-rw-r—r--. 1 root root 0 Jun  1 10:33 cpu.weight.nice
…
-r—r—r--. 1 root root 0 Jun  1 10:33 memory.events.local
-rw-r—r--. 1 root root 0 Jun  1 10:33 memory.high
-rw-r—r--. 1 root root 0 Jun  1 10:33 memory.low
…
-r—r—r--. 1 root root 0 Jun  1 10:33 pids.current
-r—r—r--. 1 root root 0 Jun  1 10:33 pids.events
-rw-r—r--. 1 root root 0 Jun  1 10:33 pids.max

The output shows files such as cpuset.cpus and cpu.max. These files are specific to the cpuset and cpu controllers. The cpuset and cpu controllers are manually enabled for the root’s (/sys/fs/cgroup/) direct child control groups using the /sys/fs/cgroup/cgroup.subtree_control file.

The directory also includes general cgroup.* control interface files such as cgroup.procs or cgroup.controllers, which are common to all control groups regardless to enabled controllers.

The files such as memory.high and pids.max relate to the memory and pids controllers, which are in the root control group (/sys/fs/cgroup/), and are always enabled by default.

By default, the newly created child group inherits access to all of the system’s CPU and memory resources, without any limits.

3. Enable the CPU-related controllers in /sys/fs/cgroup/foo/ to obtain controllers that are relevant only to CPU

echo "+cpu" >> /sys/fs/cgroup/foo/cgroup.subtree_control
echo "+cpuset" >> /sys/fs/cgroup/foo/cgroup.subtree_control

These commands ensure that the immediate child control group will only have controllers relevant to regulate the CPU time distribution — not to memory or pids controllers.

4. Create the /sys/fs/cgroup/foo/tasks/ directory:

mkdir /sys/fs/cgroup/foo/tasks/

The /sys/fs/cgroup/foo/tasks/ directory defines a child group with files that relate only to cpu and cpuset controllers.

5. Inspect the newly created folder

$ ll /sys/fs/cgroup/foo/tasks
-r—r—r--. 1 root root 0 Jun  1 11:45 cgroup.controllers
-r—r—r--. 1 root root 0 Jun  1 11:45 cgroup.events
-rw-r—r--. 1 root root 0 Jun  1 11:45 cgroup.freeze
-rw-r—r--. 1 root root 0 Jun  1 11:45 cgroup.max.depth
-rw-r—r--. 1 root root 0 Jun  1 11:45 cgroup.max.descendants
-rw-r—r--. 1 root root 0 Jun  1 11:45 cgroup.procs
-r—r—r--. 1 root root 0 Jun  1 11:45 cgroup.stat
-rw-r—r--. 1 root root 0 Jun  1 11:45 cgroup.subtree_control
-rw-r—r--. 1 root root 0 Jun  1 11:45 cgroup.threads
-rw-r—r--. 1 root root 0 Jun  1 11:45 cgroup.type
-rw-r—r--. 1 root root 0 Jun  1 11:45 cpu.max
-rw-r—r--. 1 root root 0 Jun  1 11:45 cpu.pressure
-rw-r—r--. 1 root root 0 Jun  1 11:45 cpuset.cpus
-r—r—r--. 1 root root 0 Jun  1 11:45 cpuset.cpus.effective
-rw-r—r--. 1 root root 0 Jun  1 11:45 cpuset.cpus.partition
-rw-r—r--. 1 root root 0 Jun  1 11:45 cpuset.mems
-r—r—r--. 1 root root 0 Jun  1 11:45 cpuset.mems.effective
-r—r—r--. 1 root root 0 Jun  1 11:45 cpu.stat
-rw-r—r--. 1 root root 0 Jun  1 11:45 cpu.weight
-rw-r—r--. 1 root root 0 Jun  1 11:45 cpu.weight.nice
-rw-r—r--. 1 root root 0 Jun  1 11:45 io.pressure
-rw-r—r--. 1 root root 0 Jun  1 11:45 memory.pressure

6. Ensure the processes that you want to control for CPU time compete on the same CPU:

echo "1" > /sys/fs/cgroup/foo/tasks/cpuset.cpus

The previous command ensures that the processes you will place in the foo/tasks child control group, compete on the same CPU. This setting is important for the cpu controller to activate.

Important
The cpu controller is only activated if the relevant child control group has at least 2 processes which compete for time on a single CPU.

Processes

When only the root cgroup exists to which all processes belong. A child cgroup can be created by creating a sub-directory:

mkdir $CGROUP_NAME

When you create any child directory of root control group, it is automatically populated by control files. You can create a complete structure of cgroup nodes without limitations where each cgroup has a read-writable interface file cgroup.procs to list PIDs of all processes belonging to the cgroup one-per-line.

The PIDs are not ordered and the same PID may show up more than once if the process got moved to another cgroup and then back or the PID got recycled while reading.

A process can be migrated into another cgroup by writing its PID to the target cgroup's cgroup.procs file. If a process is composed of multiple threads, writing the PID of any thread migrates all threads of the process.

After exiting, a process stays associated with the cgroup that it belonged to at the time of exit until it's reaped; however, a zombie process does not appear in cgroup.procs and thus can't be moved to another cgroup.

A cgroup which doesn't have any children or live processes can be destroyed by removing the directory.

Info
Note that a cgroup which doesn't have any children and is associated only with zombie processes is considered empty and can be removed (minimum it is removed at system reboot).

rmdir $CGROUP_NAME

If a PID is member of cgroup, you can find its cgroup hierarchy positioning in /proc/$PID/cgroup: this file could contain multiple lines and always in the format 0::$PATH.

Example:

cat /proc/34715/cgroup
0::/user.slice/user-1000.slice/user@1000.service/app.slice/app-marktext-52feaf77faa24c5c80878368e88bf7ff.scope

Here we see the process of marktext (a Markdown editor) depending on the 1000 user processes hierarchy as application running (that is why its file gets *.scope prefix). See systemd chapter for more information.

If the process becomes a zombie and the associated cgroup is removed subsequently, (deleted) is appended to the path:

cat /proc/34715/cgroup
0::/user.slice/user-1000.slice/user@1000.service/app.slice/app-marktext-52feaf77faa24c5c80878368e88bf7ff.scope (deleted)

Threads

cgroup v2 supports thread granularity for a subset of controllers.

By default, all threads of a process belong to the same cgroup but the thread mode allows threads to be spread across a subtree while still maintaining the common resource domain for them.

Controllers which support thread mode are called threaded controllers (the others are called domain controllers).

In a cgroup, the cgroup.type file indicates whether the cgroup is a normal domain (value domain) or a threaded cgroup (value threaded).

On creation a cgroup** is always domain and can be made threaded by writing threaded to the cgroup.type file. Once threaded, the cgroup can't be made a domain again.

echo threaded > cgroup.type

To enable the thread mode, you must meet the following conditions:

As the cgroup will join the parent’s resource domain. The parent must either be a valid (threaded) domain or a threaded cgroup.
When the parent is an unthreaded domain, it must not have any domain controllers enabled or populated domain children. The root cgroup is exempt from this requirement.

A domain cgroup is turned into a threaded domain when one of its child cgroup becomes threaded or threaded controllers are enabled in the cgroup.subtree_control file while there are processes in the cgroup. A threaded domain reverts to a normal domain when the conditions clear.

cgroup.threads contains the list of the thread IDs of all threads in the cgroup. If no threads, this file is empty. While cgroup.threads can be written to in any cgroup, as it can only move threads inside the same threaded domain, its operations are confined inside each threaded subtree.

Create or remove cgroup

When you create a new cgroup, you choose the name of the cgroup and you attach one of the controllers above to the new cgroup.

The name of the cgroup is $controllerResource-$cgroupName

$controllerResource: 1 of the 9 controllers, example memory
$cgroupName: name you can choose

When you remove cgroup you simply delete the corresponding cgroupfs folder but ensure before:

there is no child cgroup(s)
there is no zombie processes

cgroup v2 release notification

To precise, the Kernel provides notifications when cgroup becomes empty (no child cgroups and no member process.

This mechanism is managed by 2 files

release_agent: in the root directory of each cgroup hierarchy. By default this file is empty or it contains the pathname of a program invoked when a cgroup in a hierarchy becomes empty.
notify_on_release: in the corresponding cgroup directory. If this file contains 0 value, the release_agent is not invoked. If it contains 1 value, the release_agent is invoked.

cgroups v2 cgroup.events file

Feature introduced in Linux 5.2.

Each non root cgroup contains cgroup.events read-only file whose contains key-value pairs.

‘populated 1’: cgroup and any of its descendants has member processes, or otherwise 'populated 0'
‘frozen 1’: cgroup is frozen. If not: 'frozen 0'

cgroups v2 cgroup.stat file

Feature introduced in Linux 4.2.

Read-only file in each cgroup containing key-value pairs:

nr_descendants key: value is the total number of visible (i.e. living) descendant cgroups underneath in this cgroup
nr_dying_descendants key: value is the total number of dying descendant cgroups underneath this cgroup

Limiting the number of descendant cgroups

Feature introduced in Linux 4.14.

Each cgroup contains 2 files (cgroup.max.depth, cgroup.max.descendants) used to view and set limits on the number of descendant cgroup nodes under that cgroup.

cgroup.max.depth file: defines a limit on the depth of nesting of descendant cgroups. A value of 0 in this file means that no descendant cgroups can be created
cgroup.max.descendants file: defines a limit on the number of live descendant cgroups that this cgroup may have.

For both files, writing the string max to this file means that no limit is imposed. The default value in this file is max.

cgroups v2 delegation process

Since cgroup v2 it is possible to delegate the management of some subtree of the cgroup hierarchy to non privileged users.

A cgroup can be delegated in two ways. First, to a less privileged
user by granting write access of the directory and its cgroup.procs,
cgroup.threads and cgroup.subtree_control files to the user.
Second, if the nsdelegate mount option is set, automatically to a
cgroup namespace on namespace creation.

cgroups v1 supports for delegation based on files permissions in the cgroups hierarchy.
cgroups v2 supports for delegation with containment by explicit design. The unprivileged delegatee can move processes betweencgroups within the delegated subtree but not from outside to delegated subtree or vice versa. The consequence of these containment rules is that unprivileged delegatee can't place the 1st process into the delegated subtree. Delegater must place the first process into the delegated subtree.

The cgroups v2 delegation process is based on delegation rules, i.e. delegater makes certain directories and files writable by the delegatee

Assuming that we want to delegate the hierarchy rooted at (say) /dlgt_grp and that there are not yet any child cgroups under that cgroup, the ownership of the following is changed to the user ID of the delegatee:

/dlgt_grp: change the ownership of the root of the subtree means that any new cgroups created under the subtree (and the files they contain) will also owned by the delegatee
/dlgt_grp/cgroup.procs: change the ownership of this file means that the delegatee can move processes into the root of the delegated subtree
/dlgt_grp/cgroup.subtree_control: Changing the ownership of this file means that the delegatee can enable controllers (that are present in /dlgt_grp/cgroup.controllers) in order to further redistribute resources at lower levels in the subtree.
/dlgt_grp/cgroup.threads: Changing the ownership of this file is necessary if a threaded subtree is being delegated. This permits the delegatee to write thread IDs to the file to move a thread between domain cgroups

cgroups v2 adds a 2nd way to perform cgroup delegation by mounting or remounting the cgroupfs with nsdelegate mount option. The effect of this option is to cause cgroup namespaces to automatically become delegation boundaries.

Note: On some systems, systemd automatically mounts the cgroup v2 filesystem. In order to experiment with the nsdelegate operation, it may be useful to boot the kernel with the following command-line options:

cgroup_no_v1=all systemd.legacy_systemd_cgroup_controller

These options cause the kernel to boot with the cgroups v1 controllers disabled (meaning that the controllers are available in the v2 hierarchy), and tells systemd not to mount and use the cgroup v2 hierarchy, so that the v2 hierarchy can be manually mounted with the desired options after boot-up.

cgroup for resource mgt & performance

cgroup v2 enforces a so-called "no internal processes" rule. this rule means that, with the exception of the root cgroup, processes may reside only in leaf nodes

More precisely, the rule is that a (non-root) cgroup can't both have member processes, and distribute resources into child cgroups - that is, have a non-empty cgroup.subtree_control file. Thus, it is possible for a cgroup to have both member processes and child cgroups, but before controllers can be enabled for that cgroup, the member processes must be moved out of the cgroup.

With the Linux 4.14 addition of “thread mode”, the “no internal processes” rule has been relaxed in some cases.

References

Here the officiel references of cgroup v2:

cgroup v2 in details

Deprecated cgroup version 1 core features

Understanding cgroup version 2

Moving to cgroup version 2

cgroup v2 overview

Interface files

Cpu interface files

Memory interface files

Usage Guidelines

Memory Ownership

IO interface files

Writeback

IO interface files

PID interface files

Cpuset interface files

Device controller

RDMA interface files

HugeTLB interface files

Misc interface files

Namespace

Controlling resources usage with cgroup v2

Controlling CPU for a user

Controlling the memory usage for a user

Controlling IO usage for a user

Controlling resources usage for a service

Exhaustive list of cgroup version 2 controllers

Parameters evolution from version 1 to version 2

More about resources management…

Resources distribution model

CPU time allocation

Memory allocation

IO bandwidth configuration

Configuring CPUSET controller using systemd

Mounting the cgroup v2 filesystem

The cgroup version 2 hierarchy in practice

Processes

Threads

Create or remove cgroup

cgroup v2 release notification

cgroups v2 cgroup.events file

cgroups v2 cgroup.stat file

Limiting the number of descendant cgroups

cgroups v2 delegation process

cgroup for resource mgt & performance

References

Written by Charles Vissol