Systemd and cgroup
This article requires a good knowledge of Linux and also a minimum knowledge about cgroup
(read my previous article: Cgroup introduction). It describes the combination between systemd
and cgroup
in Linux systems.
In recent Debian distros, systemd
automatically mounts the cgroupfs
(cgroup
file system) version 2 at /sys/fs/cgroup
during the boot process. So systemd
and service manager utilize cgroup
to organize all units and services, meaning that systemd
and cgroup
work together.
What is systemd?
systemd
is a system and service manager for Linux operating systems. systemd
is started during the early boot of the system and runs as the first process (PID1).
systemd
acts as "init" system that brings up and maintains user-space services.
systemd
is usually not invoked directly by the user but the user manager instances are started automatically through the user@.service service.
Additionally, systemd
provides a number of interfaces used to create & manage set of processes for monitoring and controlling them regarding their resource usage.
As a consequence, the main cgroup
tree becomes private property of that user-space
component and is no longer a shared resource.
On systemd
distros, the PID 1 process takes this role and hence needs to provide APIs for clients to take benefit of the cgroup
features.
Note
Services running on
systemd
distros may manage their own sub-trees of thecgroup
tree, as long as they explicitly turn on delegation mode for them.
systemd
has 2 categories of instances:
1st Category: System instance
When it runs as a system instance, systemd
interprets the configuration file /etc/systemd/system.conf
and the files in /etc/systemd/system.conf.d
directory;
See systemd-system.conf for more information.
2nd Category: User instance
When it runs as a user instance, systemd
interprets usually the configuration file ~/.config/systemd/user.conf
and the files in /etc/systemd/user.conf.d
directory.
Note
In some case, you can find files (services) in in
~/.config/systemd/user/
.
systemd
provides a dependency system between various entities called "units" of 11 different types. Units encapsulate various objects that are relevant for system boot-up and maintenance. The majority of those units are configured in unit configuration files (See systemd.unit for syntax details)
Let’s describe them a little:
.service
units, which start and control daemons and the processes they consist of. For details, see systemd.service(5)..socket
units, which encapsulates local IPC (Inter-Process Call) or network sockets in the system, useful for socket-based activation. For details about socket units, see systemd.socket(5) and for details on socket-based activation and other forms of activation, see daemon(7)..target
units are useful to group units, or provide well-known synchronization points during boot-up, see systemd.target(5)..device
units expose kernel devices insystemd
and may be used to implement device-based activation. For details, see systemd.device(5).mount
units control mount points in the file system, for details see systemd.mount(5)..automount
units provide auto-mount capabilities, for on-demand mounting of file systems as well as parallelized boot-up. See systemd.automount(5)..timer
units are useful for triggering activation of other units based on timers. You may find details in systemd.timer(5)..swap
units are very similar to mount units and encapsulate memory swap partitions or files of the operating system. They are described in systemd.swap(5)..path
units may be used to activate other services when file system objects change or are modified. See systemd.path(5)..slice
units may be used to group units which manage system processes (such as service and scope units) in a hierarchical tree for resource management purposes. See systemd.slice(5)..scope
units are similar to service units, but manage foreign processes instead of starting them as well. See systemd.scope(5). Units are named as their configuration files. Some units have special semantics. A detailed list is available in systemd.special(7).
cgroup & systemd
cgroup
(invented by Google) is a independent from systemd
(invented by Red Hat) and is older.
Today, enterprise grade Linux systems enable cgroup
version 2 by default with systemd
.
Info
Fedora, Arch, Ubuntu 21.10+, and Debian 11 are the only Linux distros that run
cgroup
Version 2 by default at this date. However, many container technologies still in version 1.
More accurately, systemd
is built on top of kernel's cgroup
API that requires that each individual cgroup
is managed by a single writer only.
By default, systemd
creates a new cgroup
under the system.slice
for each service it monitors and you can change this behavior by editing the systemd
service files.
There are three options with regards to cgroup
management with systemd
:
- Editing the service file itself.
- Using drop-in files.
- Using
systemctl set-property
commands, which are the same as manually editing the files, butsystemctl
creates the required entries for you.
More at https://www.redhat.com/sysadmin/cgroups-part-four
Structure of cgroup
I focus on slice unit which is the unit of systemd
to interact with cgroup
.
A slice does not contain any processes. It’s a group of hierarchically organized units. A slice manages processes that are running in either scopes or services. The four default slices are as follows:
-.slice
: root slice, which is the root of the whole slice hierarchy. Normally, it won't directly contain any other units. However, you can use it to create default settings for the entire slice tree.system.slice
: system services that have been started by systemd.user.slice
: user-mode services. An implicit slice is assigned to each logged-in user.machine.slice
: services dedicated to running containers or virtual machines.
Note
services are started by
systemd
but scopes are started external means (virtual machines, containers, user sessions...).
A sysadmin can define custom slices and assign scopes and services to them.
To see graphical representation of these processes, run the command systemd-cgls
:
systemd-cgls
The output in Debian 11.4 should be like (unchanged for version 1 and version 2):
Control group /:
-.slice
├─user.slice
│ └─user-1000.slice
│ ├─user@1000.service
│ │ ├─background.slice
│ │ │ └─plasma-kglobalaccel.service
│ │ │ └─1977 /usr/bin/kglobalaccel5
│ │ ├─app.slice
│ │ │ ├─app-org.kde.kate-b498c13a5e274a0c882c324e5d1f72f7.scope
│ │ │ │ └─38353 /usr/bin/kate -b /home/vissol/Downloads/linux-5.19-rc8/Documentation/vm/numa.rst
│ │ │ ├─app-org.kde.kate-bd04ec663c48458388b9fa5763b21475.scope
│ │ │ │ └─36045 /usr/bin/kate -b /home/vissol/Downloads/linux-5.19-rc8/Documentation/admin-guide/tainted-kernels.rst
│ │ │ ├─app-org.kde.kate-701de1e0f47c4040b04c2b14b0736814.scope
│ │ │ │ └─36107 /usr/bin/kate -b /home/vissol/Downloads/linux-5.19-rc8/Documentation/admin-guide/perf-security.rst
│ │ │ ├─xdg-permission-store.service
│ │ │ │ └─1877 /usr/libexec/xdg-permission-store
│ │ │ ├─app-\x2fusr\x2fbin\x2fkorgac-fba6fc922f304fd892acdbd09d5c57e6.scope
│ │ │ │ └─2059 /usr/bin/korgac -session 10dfd7e29f000165373856000000016460011_1659082942_32052
│ │ │ ├─xdg-document-portal.service
│ │ │ │ ├─1873 /usr/libexec/xdg-document-portal
│ │ │ │ └─1883 fusermount -o rw,nosuid,nodev,fsname=portal,auto_unmount,subtype=portal -- /run/user/1000/doc
│ │ │ ├─app-org.kde.kate-3c8915e087fd4680ba5fed65f42a4f88.scope
│ │ │ │ └─36427 /usr/bin/kate -b /home/vissol/Downloads/linux-5.19-rc8/Documentation/admin-guide/sysctl/vm.rst
│ │ │ ├─app-org.kde.kate-1a49d3c474e34ad283440d3d1298394a.scope
│ │ │ │ └─36893 /usr/bin/kate -b /home/vissol/Downloads/linux-5.19-rc8/Documentation/admin-guide/laptops/laptop-mode.rst
│ │ │ ├─xdg-desktop-portal.service
│ │ │ │ └─1864 /usr/libexec/xdg-desktop-portal
│ │ │ ├─app-org.kde.kate-2b0ef5011a5344989296587e17dde86e.scope
[lines 1-29]
Info
On any desktop machine, you’ll always have a lot more running services than you’d have on a strictly text-mode machine.
The first cgroup is /
cgroup, which is the root cgroup. The second line begins the listing for the root slice (-.slice
) with a direct child user.slice
and next user-1000.slice
. Here 1000
corresponds to my User ID.
Important
To see user slices, you need to run
systemd-cgls
outside of thecgroup
file system. The more you dive into/sys/fs/cgroup
filesystem, the less you see withsystemd-cgls
.
user.slice
The user.slice
is defined by the /lib/systemd/system/user.slice
unit file, which looks like:
[Unit]
Description=User and Session Slice
Documentation=man:systemd.special(7)
Before=slices.target
This slice has to finish starting before slices.target
(in the same directory than user.slice
) which contains:
[Unit]
Description=Slices
Documentation=man:systemd.special(7)
Wants=-.slice system.slice
After=-.slice system.slice
slices.target
is responsible for setting up the slices that run when you boot up your machine: by default, it starts up system.slice
and the root slice ( -.slice
) as we can see in Wants
and After
parameters.
Note
We can add more slices to the current list in
user.slice
andslices.target
.
At the same level than user.slice
, we have init.scope
and system.slice
:
-.slice
├─user.slice
. . .
├─init.scope
. . .
├─system.slice
│. . .
user-1000.slice
The first child of user-1000.slice
is user@1000.slice
. user@1000.slice
is responsible for all services running in 1000
user's slice and is set up by user@.service
template (in /lib/systemd/system/user@.service
).
The user@.service
template has 2 sections:
[Unit]
[Unit]
Description=User Manager for UID %i
Documentation=man:user@.service(5)
After=systemd-user-sessions.service user-runtime-dir@%i.service dbus.service
Requires=user-runtime-dir@%i.service
IgnoreOnIsolate=yes
[Service]
[Service]
User=%i
PAMName=systemd-user
Type=notify
ExecStart=/lib/systemd/systemd --user
Slice=user-%i.slice
KillMode=mixed
Delegate=pids memory
TasksMax=infinity
TimeoutStopSec=120s
KeyringMode=inherit
Important
At runtime, the
%i
is replaced by the user ID number.
Let’s see some interesting clauses of [Service]
section:
ExecStart
:systemd
starts a newsystemd
session for each user who logs inSlice
: create a separate slice for each userTaskMax
: Limit or not the number of processes. Hereinfinity
means there is no limitsDelegate
: Allows delegation for controllers listed here, meanspids
andmemory
(delegation is forcgroup
version 2 only)
All services running in 1000
user's slice are children of the user@1000.service
. In this tree, we can see also scope corresponding to user's local programs execution
systemd-cgls | grep scope
│ │ │ ├─app-gnome\x2dtodo-c8bbb1ea42124e4eabe95eca8c02e5f7.scope
│ │ │ ├─app-google\x2dchrome-fa92ae1ce8f54f4298975211065460e7.scope
│ │ │ ├─app-protonvpn-8065209d48094182b0b0c0352d51cd10.scope
│ │ │ ├─app-org.kde.konsole-de12e69356754c5dae23bdfdc108d53a.scope
│ │ │ │ └─44710 grep scope
│ │ │ ├─app-firefox\x2desr-73828cd3b1754ee0b4ffafdd6750507d.scope
│ │ │ ├─app-\x2fusr\x2flib\x2fx86_64\x2dlinux\x2dgnu\x2flibexec\x2fDiscoverNotifier-72a3c7b93c0947d5bbcf6c3beee4e003.scope
│ │ │ ├─app-marktext-d9ed06b1e1724ec482580148c7aa057c.scope
│ │ │ ├─app-\x2fusr\x2fbin\x2fkorgac-9e6550aaf81c4723a7afd6ba45888d14.scope
│ │ └─init.scope
│ └─session-3.scope
├─init.scope
Info
Local terminal session is designated by
session-2.scope
Remote terminal session is designated bysession-3.scope
Here the terminal session is hosted bykonsole
, the KDE terminal programImportant
.scope
are only created programmatically at runtime (not created using unit files). So you can't expect to see any.scope
files in/lib/systemd/system/
directory.
machine.slice
In my configuration, Podman is running (libpod
is podman-docker
container) and we can see machine.slice
representation like this:
. . .
└─machine.slice
└─libpod-cc06c35f21cedd4d2384cf2c048f013748e84cabdc594b110a8c8529173f4c81.sco>
├─1438 apache2 -DFOREGROUND
├─1560 apache2 -DFOREGROUND
├─1561 apache2 -DFOREGROUND
├─1562 apache2 -DFOREGROUND
├─1563 apache2 -DFOREGROUND
└─1564 apache2 -DFOREGROUND
Managing tree view of systemd processes
When systemd-cgls
is running without parameters, it returns the entire cgroup
hierarchy. The highest level of the cgroup
tree is formed by slices and can look as follows:
├─system
│ ├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 20
│ ...
│
├─user
│ ├─user-1000
│ │ └─ ...
│ ├─user-2000
│ │ └─ ...
│ ...
│
└─machine
├─machine-1000
│ └─ ...
...
Info
The machine slice is present only if you are running a virtual machine or a container.
To reduce the output of systemd-cgls
, and to view a specified part of the hierarchy, execute:
$ systemd-cgls $NAME
$NAME
is the resource controller you want to inspect.
Example: memory
controller
$ systemd-cgls memory
memory:
├─ 1 /usr/lib/systemd/systemd --switched-root --system --deserialize 23
├─ 475 /usr/lib/systemd/systemd-journald
[...]
systemd
also provides the machinectl
command dedicated to monitoring Linux containers.
Linux provides also systemctl
to get the tree view of processes using systemd
units as parameter to filter queries with syntax: systemctl status $systemd_unit
.
Ex: systemctl status user.slice
Why cgroup is important?
Nowadays, servers come with one or more multi-core CPUs and quantity of memory. Resource management on these “monsters” is more important than it was in old systems. In fact a server can run multiple services, multiple virtual machine, multiple containers and multiple user accounts in the same time so managing resource becomes a priority.
This situation requires more powerful tools to ensure that all these processes and users play nicely. Here is the purpose of cgroup
.
What can do a sysadmin with cgroup?
- Manage resource usage by either processes or users.
- Keep track of resource usage by users on multi-tenant systems to provide accurate billing.
- More easily isolate running processes from each other. This not only makes for better security but also allows us to have better containerization technologies than we had previously.
- Run servers that are densely packed with virtual machines and containers due to better resource management and process isolation.
- Enhance performance by ensuring that processes always run on the same CPU core or set of CPU cores, instead of allowing the Linux kernel to move them around to different cores.
- Whitelist or blacklist hardware devices.
- Set up network traffic shaping.