Cgroup v1 in details

Charles Vissol
10 min readDec 24, 2023

--

(Credit: Charles Vissol)

In this article, I describe in details the version 1 of cgroup and its usage with systemd.
This article requires a previous knowledge about systemd and cgroup: see my previous articles:

cgroup filesystem

cgroup file system is in /sys/fs/cgroup virtual file system (it only exists in memory and disappears when you shut down the machine.

In the /sys/fs/cgroupdirectory you see something like this:

blkio         
cpu -> cpu,cpuacct
cpu,cpuacct
cpuacct -> cpu,cpuacct
cpuset
devices
freezer
hugetlb
memory
net_cls -> net_cls,net_prio
net_cls,net_prio
net_prio -> net_cls,net_prio
perf_event
pids
rdma
systemd
unified

Each of these directories represents a cgroup resource controller (or subsystem or controller).

Controllers

The following controllers are available for cgroup v1 only:

  • blkio: Short for Block Input/Output. It allows to set limits on how fast processes and users can read from or write to block devices. (A block device is something such as a hard drive or a hard drive partition.)
  • cpu , cpuacct: these controllers are combined into one single controller. This controller lets you control CPU usage for either processes or users. On a multi-tenant system, it allows you to monitor user's CPU usage.
  • cpuset: In case of multiple CPU cores systems, this allows you to assign a process to one specific CPU core or a set of CPU cores. This enhances performance by forcing a process to use a portion of the CPU cache that's already been filled with the data and the instructions that the process needs. By default, the Linux kernel scheduler can move processes around from one CPU core to another, or from one set of CPU cores to another. Every time this happens, the running process must access the main system memory to refill the CPU cache. This costs extra CPU cycles, which can hurt performance.
  • devices: it allows controlling access to system devices.
  • freezer: It allows you to suspend running processes in a cgroup. This is interesting when you need to move a process from one cgroup to another.
  • memory: it allows you to set limits on the amount of system memory that a process or a user can use. It generates automatic reports on memory resources used by those processes & users.
  • net_cls, net_prio: it allows you to tag network packets with a class identifier (classid) that enables the Linux traffic controller & Linux firewalls (the tc command) to identify packets that originate from a particular process. This is useful to control and prioritize network traffic for various cgroups.
  • pids: it can set a limit on the a number of processes that can run in a cgroup.
  • perf_event: can group tasks for monitoring by the perf performance monitoring and reporting utility.
  • rdma: Remote Direct Memory Access allows one computer to directly access the memory of another computer without having to involve either computer's operating system. This controller is mainly used for parallel computing clusters.
  • hugetlb: allows you to limit the usage of huge memory pages by cgroup processes.

Tuning options

Each controller directory has a set of files representing the cgroup tuning options. These files hold information about any resource control or tuning parameters that you would set.

For example for blkio, you find:

blkio.bfq.io_service_bytes
blkio.bfq.io_service_bytes_recursive
blkio.bfq.io_serviced
blkio.bfq.io_serviced_recursive
blkio.reset_stats
blkio.throttle.io_service_bytes
blkio.throttle.io_service_bytes_recursive
blkio.throttle.io_serviced
blkio.throttle.io_serviced_recursive
blkio.throttle.read_bps_device
blkio.throttle.read_iops_device
blkio.throttle.write_bps_device
blkio.throttle.write_iops_device
cgroup.clone_children
cgroup.procs
cgroup.sne_behavior
init.scope
machine.slice
notify_on_release
release_agent
system.slice
tasks
user.slice

Each file represents a parameter you can tune for a best performance. We also see directories such as init.scope, machine.slice, system.slice, and user.slice. Each also has its own set of tuning parameters.

Controlling resources

Resource controllers v1

For examining resource controllers of cgroup v1, you should install cgroup-tools:

sudo apt install cgroup-tools

Once installed, you can run lssubsys to view the active resource controllers:

$ lssubsys

cpuset
cpu,cpuacct
blkio
memory
devices
freezer
net_cls,net_prio
perf_event
hugetlb
pids
rdma

In this case all resource controllers of cgroup v1 are active.

If you go to /sys/fs/cgroup directory, you see that each resource controller has its own directory:

$ ls /sys/fs/cgroup

blkio
cpu -> cpu,cpuacct
cpu,cpuacct
cpuacct -> cpu,cpuacct
cpuset
devices
freezer
hugetlb
memory
net_cls -> net_cls,net_prio
net_cls,net_prio
net_prio -> net_cls,net_prio
perf_event
pids
rdma
systemd
unified

Note

Ignore systemd and unified:

systemd is for the root cgroup

unified is for Version 2 controllers

There is 2 symbolic links with cpu and cpuacct controllers because these 2 separate controllers are now combined into one.

This is the same for net_cls and net_prio controllers.

Here we focus on 3 resources controllers: cpu, memory, blkio because they can be directly configured via systemd.

Controlling user CPU usage

Imagine a user consuming 100% of your CPU…

First you need to identify him with systemd-cgls command to get it's user number (let's give 1001 here).

If you want for example reduce drastically its CPU quotas usage to 10%, run:

sudo systemctl set-property user-1001.slice CPUQuota=10%

This command creates some new files in the /etc/systemd so you need to run:

sudo systemctl daemon-reload

…as you created a new unit file.

Once done, you can see that the user will be limited to 10% but… to explain more in details: for a 4 CPU cores system, defining a CPU quota to 10% means 2.5% per core for a total of 10% spread in 4 cores.

So if you allocate CPUQuota=100% in reality you allocate 25% by core in this case.

You can simulate CPU stress on a Linux system by installing stress-ng:

sudo apt install stress-ng

To stress 4 CPU cores you can run with 1001 user profile for example:

#1001 user stress 4 CPU cores
stress-ng -c 4

In this case, 1001 user takes all the CPU resources because by default on Linux all users have unlimited usage of the system resources.

The first time you execute a systemctl set-property command, you create the system.control directory under the /etc/systemd directory, which looks like this:

$ cd /etc/systemd 
$ ls -ld system.control/

drwxr-xr-x 3 root root 4096 Jul 14 19:59 system.control/

Under this directory you can find 1001 user slice and under its user slice directory you can find the configurations files for its CPUQuota:

$ cd /etc/systemd/system.control
$ ls -l

total 4
drwxr-xr-x 2 root root 4096 Jul 14 20:25 user-1001.slice.d

$ cd user-1001.slice.d/
$ ls -l

total 4
-rw-r--r-- 1 root root 143 Jul 14 20:25 50-CPUQuota.conf

$ cat 50-CPUQuota.conf

# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Slice]
CPUQuota=10%

Now, be aware that you only need to do a daemon-reload when you create this file for the first time. Any subsequent changes you make to this file with the systemctl set-property command will take effect immediately.

In the cgroup file system, under 1001 user slice directory, you see her current CPUQuota setting in the cpu.cfs_quota_us file. Here is what it looks like when set to 10%:

$ cd /sys/fs/cgroup/cpu/user.slice/user-1001.slice
$ cat cpu.cfs_quota_us

10000

Controlling service CPU usage

To show how to control CPU for a service, I create a specific service: cputest.service managed by root and I reuse the stress-ng tool of previous chapter.

To create the service, run:

sudo systemctl edit --full --force cputest.service

It creates a file in /usr/lib/systemd/system and the content should looks like this, assuming we have 4 CPU cores:

[Unit]
Description=CPU stress test service
[Service]
ExecStart=/usr/bin/stress-ng -c 4

Start the service:

sudo systemctl daemon-reload
sudo systemctl start cputest.service

At this time, the service should take 100% of the CPU resources.

As made for the user, you can set a CPU quota to the service, for example 90% in this case by running:

sudo systemctl set-property cputest.service CPUQuota=90%
sudo systemctl daemon-reload

This command creates a directory in /etc/systemd/system.control/:

$ cd /etc/systemd/system.control
$ ls -l

total 8
drwxr-xr-x 2 root root 4096 Jul 15 19:15 cputest.service.d
drwxr-xr-x 2 root root 4096 Jul 15 17:53 user-1001.slice.d

Inside the /etc/systemd/system.control/cputest.service.d/ directory, you'll see the 50-CPUQuota.conf file:

$ cd /etc/systemd/system.control/cputest.service.d
$ cat 50-CPUQuota.conf

# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Service]
CPUQuota=90%

This allows cputest.service to use only about 22.5% of each CPU core... remember last chapter...

In the cgroup file system, you can see the CPUQuota set to 90%:

$ cd /sys/fs/cgroup/cpu/system.slice/cputest.service
$ cat cpu.cfs_quota_us

90000

This limit is only placed on the service, and not on the root user who owns the service. The root user can still run other programs and services without any limits.

Another way to limit the CPU usage for this service is to directly specify the quota inside the service file:

#Stop the service
$ sudo systemctl stop cputest.service
#delete the cputest.service.d/ directory that you created with the systemctl set-property command
$ cd /etc/systemd/system.control
$ sudo rm -rf cputest.service.d/
#reload service
$ cd /etc/systemd/system.control
$ sudo systemctl daemon-reload
# start cputest.service.
$ cd/etc/systemd/system.control
$ sudo systemctl start cputest.service

Then stop the service and edit the unit file:

$ sudo systemctl edit --full cputest.service

Add the CPUQuota=90% line, so that the file now looks like this:

[Unit]
Description=CPU stress test service
[Service]
ExecStart=/usr/bin/stress-ng -c 4
CPUQuota=90%

Save the file and start the service.

Controlling the memory usage for a user

Again, imagine a 1001 user running process that is using all the system memory.

You can simulate a high memory usage of your 1001 user by running stress-ng:

stress-ng --brk 4

If you want to limit 1001 user programs to 1GB max memory for example, run the following command and execute daemon-reload:

sudo systemctl set-property user-1001.slice MemoryLimit=1G
sudo systemctl daemon-reload

Like for CPU quota, this command creates specific file into /etc/systemd/system.control/user-1001.slice.d:

$ cd /etc/systemd/system.control/user-1001.slice.d
$ cat 50-MemoryLimit.conf

# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Slice]
MemoryLimit=1073741824

If you want to create temporary modifications just add --runtime option:

sudo systemctl set-property --runtime user-1001.slice MemoryLimit=1G
sudo systemctl daemon-reload

In this case, you loose your modifications when you reboot the system and it creates a temporary configuration file in the /run/systemd/system.control/user-1001.slice.d/ directory, which looks like this:

$ cd /etc/systemd/system.control/user-1001.slice.d
$ cat 50-MemoryLimit.conf

# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Slice]
MemoryLimit=1073741824

Controlling the memory usage for a service

In short, adding memory control to a service is close to user process. Execute the following command (here example of Apache2 service):

sudo systemctl set-property apache2.service MemoryLimit=1G

Then edit the Apache2 service file and in the section [Service], add the new parameter without double quotes:

[Service]
...
MemoryLimit=1G

Next, run sudo systemctl daemon-reload.

Controlling IO usage for a user

Controlling IO means controlling blkio controller.

To follow the usage of IO, you can install iotop:

sudo apt install iotop

iotop allows you to view live consumption of disk IO.

Imagine a 1001 user from whom you want to manage IO in a specific partition. To simulate this control, first let's create a dedicated partition with dd command with the 1001 user's profile:

$ dd if=/dev/zero of=afile bs=1M count=10000

10000+0 records in
10000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 17.4288 s, 602 MB/s

We have created a 10GB file and next the user copies contents of file over the /dev/null device.

$ dd if=afile of=/dev/null

20480000+0 records in
20480000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 69.2341 s, 151 MB/s

It appears that the user reads this file at an average rate of 151 MB per second.

First we have to know where the user reads the file:

$ lsblk

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 99.4M 1 loop /snap/core/11316
. . .
. . .
sda 8:0 0 1T 0 disk
├─sda1 8:1 0 1M 0 part
├─sda2 8:2 0 1G 0 part /boot
└─sda3 8:3 0 930.5G 0 part
└─sda3_crypt 253:0 0 930.5G 0 crypt
├─debian--vg-root 253:1 0 23.3G 0 lvm /
├─debian--vg-var 253:2 0 9.3G 0 lvm /var
├─debian--vg-swap_1 253:3 0 976M 0 lvm [SWAP]
├─debian--vg-tmp 253:4 0 1.9G 0 lvm /tmp
└─debian--vg-home 253:5 0 895.1G 0 lvm /home
sdb 8:16 0 10G 0 disk
└─sdb1 8:17 0 10G 0 part /media/backup
sr0 11:0 1 1024M 0 rom

1001 user has his own home directory. The /home directory is mounted as logical volume on the /dev/sda3 drive.

Let’s say we want to limitate bandwidth to 1MB, in this case we run the command:

sudo systemctl set-property user-1001.slice BlockIOReadBandwidth="/dev/sda3 1M"
sudo systemctl daemon-reload

In this case the system modification appears permanent by creating a new file in /etc/systemd/system.control/user-1001.slice.d:

$ cd $/etc/systemd/system.control/user-1001.slice.d
$ cat 50-BlockIOReadBandwidth.conf

# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Slice]
BlockIOReadBandwidth=1000000

Regarding cgroup the following file appears:

$ cd /sys/fs/cgroup/blkio/user.slice/user-1001.slice
$ cat blkio.throttle.read_bps_device

8:0 1000000

In this blkio.throttle.read_bps_device file, the 8:0 represents the major and minor numbers of the /dev/sda device:

vissol@debian:/dev$ ls -l sda

brw-rw---- 1 root disk 8, 0 Oct 5 08:01 sda

Controlling IO usage for a service

As for a user, you can control IO for a service using BlockIOReadBandwidth parameter for this service. Here an example with Apache2 to control IO by command line:

sudo systemctl set-property apache2.service BlockIOReadBandwidth="/dev/sda 1M"

If you want to set this BlockIOReadBandwidth parameter in a service file, you need to know that you have to surround the /dev/sda 1M part with a pair of double quotes. But when you set this in a service file, you do not surround the /dev/sda 1M within double quotes.

Edit the Apache2 service file and in the section [Service], add the new parameter without double quotes:

[Service]
. . .
.. .
BlockIOReadBandwidth=/dev/sda 1M

Next, run sudo systemctl daemon-reload.

This modification must add the following file & values in cgroup file system:

$ cd /sys/fs/cgroup/blkio/system.slice/vsftpd.service
$ cat blkio.throttle.read_bps_device

8:0 1000000

In this blkio.throttle.read_bps_device file, the 8:0 represents the major and minor numbers of the /dev/sda device as already said:

vissol@debian:/dev$ ls -l sda
brw-rw---- 1 root disk 8, 0 Oct 5 08:01 sda

More about systemd resource control

If you want more parameters to control systemd via cgroup, run:

man systemd.resource-control

You will find all the parameters you can set to control users and services programs.

Simply be careful of the cgroup version supported by your system…

--

--