Building an init system
“We build our computer (systems) the way we build our cities: over time, without a plan, on top of ruins. — Ellen Ullman
Basics of a boot process
Init is just a userspace software. The almighty kernel does most of the job, so let’s review its work :
BIOS or UEFI bootloader loads the kernel file
Optionally, it loads an init-ramdisk.
Most steps are called from init/main.c::start_kernel() :
- Initialize boot CPU, cgroups, interrupts, arch-specific, memory zones, memory pages
- Parse early command line params, common and arch-dependant. See calls to early_param()
- Initialize a lot of internal structures, like caches, locks, trap, scheduler, rcu, timers, softirq, profiling, numa, process table, signals, /proc, cpuset, ftrace…
- Finally, call rest_init() (for all non core-init stuff)
rest_init() create a thread for function kernel_init(). As this is the first call to kernel_thread, it will be the PID 1.
Just after creating init, creates another kernel thread for kthreadd (in kernel/kthread.c) that will be PID 2.
kernel_init() is now running as PID1 (still kernel-space), and will call kernel_init_freeable() that will :
- Wait for kthread to be fully started
- Initialize all SMP CPUs (before we only had the boot CPU, often CPU0)
- Call do_basic_setup() to finalize the initialization :
- Initialize the cpusets for SMP
- Initialize the shared memory
- Load all drivers from driver_init(), the core driver pieces being : devtmpfs, devices, buses, classes, firmware, hypervisor, platform_bus, cpu_dev, memory_dev, container_dev (mostly located in drivers/base/)
Some drivers will require firmware from userspace through request_firmware() function. As of 4.10, only a handful of DVB drivers requires it. On the userspace, this is taken care by udev.
- Initialize interrupt lines.
- Start initcalls (early, core, postcore, arch, subsys, fs, device, late).
- If not “rdinit=” parameter is set, it will be defaulted to “/init” and tested if available. If not, reset to null.
Once returning from kernel_init_freeable(), kernel_init() will clean the “init memory”, and set the system in “running” state. Once all cleanup done, it will call run_init_process() to try to load the ramdisk init.
If none is set, the one specified by “init=” boot parameter.
If none is set, it will try to load : /sbin/init, /etc/init, /bin/init, /bin/sh.
run_init_process() will try to call “do_execve” to run the specified binary as userspace. Note that the init can be in any format that has a “binfmt” handler available in the kernel (look under fs/binfmt_* for all, but mostly you’ll have an ELF or SCRIPT format). Also, the MISC handler can be used to handle any format, like JRE or PE… Well, forgot what I said, don’t do it for init ;-)
Now, we are in init, and our job begins !
What shall an init do
The very basics of an init system is to :
- Keep running. Init must always be up and running, otherwise it’s a kernel panic done by find_child_reaper()
- Manage services (process): provide their dependencies, create an environment for them (cgroup, limits, namespace… maybe filesystem and network, but it’s out of scope for the beginning), start them in this environment and monitor them (restart / respawn).
- Reap zombies. All process are zombies at the end of their life, but for such a small amount of time, we never see it. When a child terminates, it releases all resources (thread, memory, fd…), then the kernel keep it in the process table and send a SIGCHLD to its parent to inform it its child has finished its job. The parent must consume this SIGCHLD (using the wait() like syscalls) for the child to be fully removed from the process table. If it does not consume it, the zombie is visible.
As of Linux 3.4, a new flag “PR_SET_CHILD_SUBREAPER” was added to prctl(). It is mostly used for containers, where a supervisor would manage the life of the whole namespace and their applications.
- Halting / Rebooting the system. Aside stopping the process, this is done by using syscalls shutdown(48) and reboot(169).
There is also non-core features that init usually do, like :
- Spawning terminal (tty0, ttyS0)
- Handle special events, like “Ctrl+Alt+Del”, or ACPI Calls (power button pressed)
- Delegate device management to udev
What is our target auditorium ?
In the current days, init may be run on OS that has different purposes :
- Bare-metal server : Standard server, hypervisor, host-container, storage server… For these servers, stability is the keyword. The hardware checks of startup are way longer than a boot sequence, gaining 30 seconds on the boot process is totally useless.
All core systems (cpu, memory, network, init) should not change, and you’ll want to remove all feature that may affect them. No need for NetworkManager, just configure the NICs and never touch them again. Should init be updated, it must be re-executable without rebooting the whole system.
- Virtual : Let it be a container or a VM, we are in a safer environment, as most of the hardware parts are virtualized / abstracted by host-kernel. There may be a greater need for speed, memory restriction and service management in this kind of instance. But most likely, you don’t want the guests to go back to the host,
- Embedded : These are maybe the most critical pieces of IT. Embedded is not just your home-assistant running on a raspberry pi, its mostly industrial piece of equipment controlling lasers, injection systems… Running most likely on a “RealTime-Kernel”, predictability and minimal jitter are the requirement here. Startup does not need to be fast, it must be constant. Process management too. The core applications must not be disturbed by another process that may start. Also, memory is constrained, so no fancy stuff.
- Desktop : Modern computers have decent hardware, but it’s not a reason to consume in excess (even for debugging). The bios boot process is quite fast related to the bare-metal servers, so we want a fast boot that allows to work quickly. Systemd does quite a good job in that setup, as it
- Laptop : Same as desktop, but with the mobility feature. That mostly means a whole lot of changes in network, and power management. For this NetworkManager is really recommended, in contrary to the server world.
Where to place the slider ?
As we saw, there is a wide range of usage of systems, some totally incompatible with others (embedded VS laptop, server VS virtual). This leads to compromises :
- Use standard, existing tools with a known and tested behaviour, and paying the fork() price, or re-implement them in the init system ?
- Use buffers and increase latency in sake of throughput, or keep memory footprint low but startup slower ?
- Do IPC with Unix sockets, loopback or shared memory ?
- How to handle transient errors, retry ? up to which point ? Or just fail to keep data safe ?
We can still fulfill most of these with a single system, if it is configurable enough and loosely coupled with each part !
Let the user choose if he wants parallel or sequential startup, custom logging or standard syslog, embedded network management or pure-oneshot.
There is no “one size fits all” in IT. Any product that would force users to do it their way will be flawed and generate a lot of anger.
What features should be provided ?
Use cgroups and namespaces for following the process life. They provide reliable way to track process and their consumption, without possibility of escaping (double fork). However, all systems don’t have them. So make it a nice feature, not a mandatory requirement.
Non-root users want to use services too. Either by providing their own configuration with non-privileged ports filesystem, keeping the nice daemon/service/keepalive features.
Multiple instances are useful. OpenRC provides them by symlinking to main script and checking the name from which they were called. This allows to use a single script with different included configuration.
Dependencies are hard. Don’t dive into code asap, think a lot about the different situations and use-cases the users will want to get the best architecture available.
Avoid maintainers having to write startup scripts (they proved to be unequal quality) by providing unified management. But do provide custom-features too. There will be corner cases you can’t even think about..
How to keep product safe ?
Such a critical piece of software is invariably put at stress. So multiple rules must be enforced during development.
Somewhere, someday, things will go terribly wrong. Stability is the key, especially in the worse situation : not enough memory (filled or fragmented), not enough disk space, stalled I/O due to bus error, corrupted environment variables, garbled communications, flapping network while on NFS diskless env… init *must* be paranoid and tolerant to faults, with intelligent timeout system.
There will be bugs. Do fuzzing before releasing a stable version. You’ll avoid the situation where a end-user hit them and have to report it. Use a buildchain to have repeatable and logged builds.
Don’t add features where there is no need. You don’t want to replace crontab, ntp, syslog… Once the pid1 process is production-ready, you may try to provide features like that, but let the end-user chose whether they want it or not. That will seriously reduce the attack surface.
Be transparent. The biggest issue in a software is not how critical bugs are, it’s how developers reacts to them.