Systemd and cgroup

Charles Vissol
8 min readDec 16, 2023

--

(Credit: Charles Vissol)

This article requires a good knowledge of Linux and also a minimum knowledge about cgroup (read my previous article: Cgroup introduction). It describes the combination between systemdand cgroup in Linux systems.

In recent Debian distros, systemd automatically mounts the cgroupfs (cgroup file system) version 2 at /sys/fs/cgroup during the boot process. So systemd and service manager utilize cgroup to organize all units and services, meaning that systemd and cgroup work together.

What is systemd?

systemd is a system and service manager for Linux operating systems. systemd is started during the early boot of the system and runs as the first process (PID1).

systemd acts as "init" system that brings up and maintains user-space services.

systemd is usually not invoked directly by the user but the user manager instances are started automatically through the user@.service service.

Additionally, systemd provides a number of interfaces used to create & manage set of processes for monitoring and controlling them regarding their resource usage.

As a consequence, the main cgroup tree becomes private property of that user-space
component and is no longer a shared resource.

On systemd distros, the PID 1 process takes this role and hence needs to provide APIs for clients to take benefit of the cgroup features.

Note

Services running on systemd distros may manage their own sub-trees of the cgroup tree, as long as they explicitly turn on delegation mode for them.

systemd has 2 categories of instances:

1st Category: System instance

When it runs as a system instance, systemd interprets the configuration file /etc/systemd/system.conf and the files in /etc/systemd/system.conf.d directory;

See systemd-system.conf for more information.

2nd Category: User instance

When it runs as a user instance, systemd interprets usually the configuration file ~/.config/systemd/user.conf and the files in /etc/systemd/user.conf.d directory.

Note

In some case, you can find files (services) in in ~/.config/systemd/user/ .

systemd provides a dependency system between various entities called "units" of 11 different types. Units encapsulate various objects that are relevant for system boot-up and maintenance. The majority of those units are configured in unit configuration files (See systemd.unit for syntax details)

Let’s describe them a little:

  1. .service units, which start and control daemons and the processes they consist of. For details, see systemd.service(5).
  2. .socket units, which encapsulates local IPC (Inter-Process Call) or network sockets in the system, useful for socket-based activation. For details about socket units, see systemd.socket(5) and for details on socket-based activation and other forms of activation, see daemon(7).
  3. .target units are useful to group units, or provide well-known synchronization points during boot-up, see systemd.target(5).
  4. .device units expose kernel devices in systemd and may be used to implement device-based activation. For details, see systemd.device(5)
  5. .mount units control mount points in the file system, for details see systemd.mount(5).
  6. .automount units provide auto-mount capabilities, for on-demand mounting of file systems as well as parallelized boot-up. See systemd.automount(5).
  7. .timer units are useful for triggering activation of other units based on timers. You may find details in systemd.timer(5).
  8. .swap units are very similar to mount units and encapsulate memory swap partitions or files of the operating system. They are described in systemd.swap(5).
  9. .path units may be used to activate other services when file system objects change or are modified. See systemd.path(5).
  10. .slice units may be used to group units which manage system processes (such as service and scope units) in a hierarchical tree for resource management purposes. See systemd.slice(5).
  11. .scope units are similar to service units, but manage foreign processes instead of starting them as well. See systemd.scope(5). Units are named as their configuration files. Some units have special semantics. A detailed list is available in systemd.special(7).

cgroup & systemd

cgroup (invented by Google) is a independent from systemd (invented by Red Hat) and is older.

Today, enterprise grade Linux systems enable cgroup version 2 by default with systemd.

Info

Fedora, Arch, Ubuntu 21.10+, and Debian 11 are the only Linux distros that run cgroup Version 2 by default at this date. However, many container technologies still in version 1.

More accurately, systemd is built on top of kernel's cgroup API that requires that each individual cgroup is managed by a single writer only.

By default, systemd creates a new cgroup under the system.slice for each service it monitors and you can change this behavior by editing the systemd service files.

There are three options with regards to cgroup management with systemd:

  • Editing the service file itself.
  • Using drop-in files.
  • Using systemctl set-property commands, which are the same as manually editing the files, but systemctl creates the required entries for you.

More at https://www.redhat.com/sysadmin/cgroups-part-four

Structure of cgroup

I focus on slice unit which is the unit of systemd to interact with cgroup.

A slice does not contain any processes. It’s a group of hierarchically organized units. A slice manages processes that are running in either scopes or services. The four default slices are as follows:

  • -.slice: root slice, which is the root of the whole slice hierarchy. Normally, it won't directly contain any other units. However, you can use it to create default settings for the entire slice tree.
  • system.slice: system services that have been started by systemd.
  • user.slice: user-mode services. An implicit slice is assigned to each logged-in user.
  • machine.slice: services dedicated to running containers or virtual machines.

Note

services are started by systemd but scopes are started external means (virtual machines, containers, user sessions...).
A sysadmin can define custom slices and assign scopes and services to them.

To see graphical representation of these processes, run the command systemd-cgls:

systemd-cgls

The output in Debian 11.4 should be like (unchanged for version 1 and version 2):

Control group /:
-.slice
├─user.slice
│ └─user-1000.slice
│ ├─user@1000.service
│ │ ├─background.slice
│ │ │ └─plasma-kglobalaccel.service
│ │ │ └─1977 /usr/bin/kglobalaccel5
│ │ ├─app.slice
│ │ │ ├─app-org.kde.kate-b498c13a5e274a0c882c324e5d1f72f7.scope
│ │ │ │ └─38353 /usr/bin/kate -b /home/vissol/Downloads/linux-5.19-rc8/Documentation/vm/numa.rst
│ │ │ ├─app-org.kde.kate-bd04ec663c48458388b9fa5763b21475.scope
│ │ │ │ └─36045 /usr/bin/kate -b /home/vissol/Downloads/linux-5.19-rc8/Documentation/admin-guide/tainted-kernels.rst
│ │ │ ├─app-org.kde.kate-701de1e0f47c4040b04c2b14b0736814.scope
│ │ │ │ └─36107 /usr/bin/kate -b /home/vissol/Downloads/linux-5.19-rc8/Documentation/admin-guide/perf-security.rst
│ │ │ ├─xdg-permission-store.service
│ │ │ │ └─1877 /usr/libexec/xdg-permission-store
│ │ │ ├─app-\x2fusr\x2fbin\x2fkorgac-fba6fc922f304fd892acdbd09d5c57e6.scope
│ │ │ │ └─2059 /usr/bin/korgac -session 10dfd7e29f000165373856000000016460011_1659082942_32052
│ │ │ ├─xdg-document-portal.service
│ │ │ │ ├─1873 /usr/libexec/xdg-document-portal
│ │ │ │ └─1883 fusermount -o rw,nosuid,nodev,fsname=portal,auto_unmount,subtype=portal -- /run/user/1000/doc
│ │ │ ├─app-org.kde.kate-3c8915e087fd4680ba5fed65f42a4f88.scope
│ │ │ │ └─36427 /usr/bin/kate -b /home/vissol/Downloads/linux-5.19-rc8/Documentation/admin-guide/sysctl/vm.rst
│ │ │ ├─app-org.kde.kate-1a49d3c474e34ad283440d3d1298394a.scope
│ │ │ │ └─36893 /usr/bin/kate -b /home/vissol/Downloads/linux-5.19-rc8/Documentation/admin-guide/laptops/laptop-mode.rst
│ │ │ ├─xdg-desktop-portal.service
│ │ │ │ └─1864 /usr/libexec/xdg-desktop-portal
│ │ │ ├─app-org.kde.kate-2b0ef5011a5344989296587e17dde86e.scope
[lines 1-29]

Info

On any desktop machine, you’ll always have a lot more running services than you’d have on a strictly text-mode machine.

The first cgroup is / cgroup, which is the root cgroup. The second line begins the listing for the root slice (-.slice) with a direct child user.slice and next user-1000.slice. Here 1000 corresponds to my User ID.

Important

To see user slices, you need to run systemd-cgls outside of the cgroup file system. The more you dive into /sys/fs/cgroup filesystem, the less you see with systemd-cgls.

user.slice

The user.slice is defined by the /lib/systemd/system/user.slice unit file, which looks like:

[Unit]
Description=User and Session Slice
Documentation=man:systemd.special(7)
Before=slices.target

This slice has to finish starting before slices.target (in the same directory than user.slice) which contains:

[Unit]
Description=Slices
Documentation=man:systemd.special(7)
Wants=-.slice system.slice
After=-.slice system.slice

slices.target is responsible for setting up the slices that run when you boot up your machine: by default, it starts up system.slice and the root slice ( -.slice ) as we can see in Wants and After parameters.

Note

We can add more slices to the current list in user.slice and slices.target.

At the same level than user.slice, we have init.scope and system.slice:

-.slice
├─user.slice
. . .
├─init.scope
. . .
├─system.slice
│. . .

user-1000.slice

The first child of user-1000.slice is user@1000.slice. user@1000.slice is responsible for all services running in 1000 user's slice and is set up by user@.service template (in /lib/systemd/system/user@.service).

The user@.service template has 2 sections:

  • [Unit]
[Unit]
Description=User Manager for UID %i
Documentation=man:user@.service(5)
After=systemd-user-sessions.service user-runtime-dir@%i.service dbus.service
Requires=user-runtime-dir@%i.service
IgnoreOnIsolate=yes
  • [Service]
[Service]
User=%i
PAMName=systemd-user
Type=notify
ExecStart=/lib/systemd/systemd --user
Slice=user-%i.slice
KillMode=mixed
Delegate=pids memory
TasksMax=infinity
TimeoutStopSec=120s
KeyringMode=inherit

Important

At runtime, the %i is replaced by the user ID number.

Let’s see some interesting clauses of [Service] section:

  • ExecStart: systemd starts a new systemd session for each user who logs in
  • Slice: create a separate slice for each user
  • TaskMax: Limit or not the number of processes. Here infinity means there is no limits
  • Delegate: Allows delegation for controllers listed here, means pids and memory (delegation is for cgroup version 2 only)

All services running in 1000 user's slice are children of the user@1000.service. In this tree, we can see also scope corresponding to user's local programs execution

systemd-cgls | grep scope
│ │ │ ├─app-gnome\x2dtodo-c8bbb1ea42124e4eabe95eca8c02e5f7.scope
│ │ │ ├─app-google\x2dchrome-fa92ae1ce8f54f4298975211065460e7.scope
│ │ │ ├─app-protonvpn-8065209d48094182b0b0c0352d51cd10.scope
│ │ │ ├─app-org.kde.konsole-de12e69356754c5dae23bdfdc108d53a.scope
│ │ │ │ └─44710 grep scope
│ │ │ ├─app-firefox\x2desr-73828cd3b1754ee0b4ffafdd6750507d.scope
│ │ │ ├─app-\x2fusr\x2flib\x2fx86_64\x2dlinux\x2dgnu\x2flibexec\x2fDiscoverNotifier-72a3c7b93c0947d5bbcf6c3beee4e003.scope
│ │ │ ├─app-marktext-d9ed06b1e1724ec482580148c7aa057c.scope
│ │ │ ├─app-\x2fusr\x2fbin\x2fkorgac-9e6550aaf81c4723a7afd6ba45888d14.scope
│ │ └─init.scope
│ └─session-3.scope
├─init.scope

Info

Local terminal session is designated by session-2.scope
Remote terminal session is designated by session-3.scope
Here the terminal session is hosted by konsole, the KDE terminal program

Important

.scope are only created programmatically at runtime (not created using unit files). So you can't expect to see any .scope files in /lib/systemd/system/ directory.

machine.slice

In my configuration, Podman is running (libpod is podman-docker container) and we can see machine.slice representation like this:

. . .
└─machine.slice
└─libpod-cc06c35f21cedd4d2384cf2c048f013748e84cabdc594b110a8c8529173f4c81.sco>
├─1438 apache2 -DFOREGROUND
├─1560 apache2 -DFOREGROUND
├─1561 apache2 -DFOREGROUND
├─1562 apache2 -DFOREGROUND
├─1563 apache2 -DFOREGROUND
└─1564 apache2 -DFOREGROUND

Managing tree view of systemd processes

When systemd-cgls is running without parameters, it returns the entire cgroup hierarchy. The highest level of the cgroup tree is formed by slices and can look as follows:

├─system
│ ├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 20
│ ...

├─user
│ ├─user-1000
│ │ └─ ...
│ ├─user-2000
│ │ └─ ...
│ ...

└─machine
├─machine-1000
│ └─ ...
...

Info

The machine slice is present only if you are running a virtual machine or a container.

To reduce the output of systemd-cgls, and to view a specified part of the hierarchy, execute:

$ systemd-cgls $NAME

$NAME is the resource controller you want to inspect.

Example: memory controller

$ systemd-cgls memory
memory:
├─ 1 /usr/lib/systemd/systemd --switched-root --system --deserialize 23
├─ 475 /usr/lib/systemd/systemd-journald
[...]

systemd also provides the machinectl command dedicated to monitoring Linux containers.

Linux provides also systemctl to get the tree view of processes using systemd units as parameter to filter queries with syntax: systemctl status $systemd_unit.

Ex: systemctl status user.slice

Why cgroup is important?

Nowadays, servers come with one or more multi-core CPUs and quantity of memory. Resource management on these “monsters” is more important than it was in old systems. In fact a server can run multiple services, multiple virtual machine, multiple containers and multiple user accounts in the same time so managing resource becomes a priority.

This situation requires more powerful tools to ensure that all these processes and users play nicely. Here is the purpose of cgroup.

What can do a sysadmin with cgroup?

  • Manage resource usage by either processes or users.
  • Keep track of resource usage by users on multi-tenant systems to provide accurate billing.
  • More easily isolate running processes from each other. This not only makes for better security but also allows us to have better containerization technologies than we had previously.
  • Run servers that are densely packed with virtual machines and containers due to better resource management and process isolation.
  • Enhance performance by ensuring that processes always run on the same CPU core or set of CPU cores, instead of allowing the Linux kernel to move them around to different cores.
  • Whitelist or blacklist hardware devices.
  • Set up network traffic shaping.

--

--