A Generic Health Check Framework for systemd
Not too long ago, I applied to Google Summer of Code for the student scholarship position together with a Fedora project ideated by Peter Robinson, who is the principal IoT architect at Red Hat, named
Fedora IoT: Atomic Host Upgrade Daemon. As you may be guessing by now, I was very fortunate and the proposal was accepted! The coding phase started on the 14th of May and in this blog post I’ll try to give a little insight into my first month working on the project.
Swimming in Cold Waters
aka Bash in my Head
On the first day my mentors Peter, Dusty Mabe, Jonathan Lebon and I (as well as a few others!) had our first Video Chat Meeting to discuss Design and Milestones of the GSoC project.
The proposed Upgrade Daemon should frequently check for available updates, apply them, restart the system, check the system health, and in case of failures, reboot into the old, previously working version of the operating system. It should also enable the user to define their custom, highly system-specific health check routines that are run as part of it in order to determine the system health status.
While I had been using (and absolutely fallen in love with)
Fedora Atomic Host for a while, I had only a very slight idea of what was about to come.
A program as described in our proposal touches on quite a number of moving parts in the Linux world: Grub, systemd as well as OSTree (libostree & rpm-ostree) which is the base for Project Atomic (a project labeled ‘Git for Operating Systems’ by its architect Colin Walters).
Much of the needed functionality already exists in various places, now I just had to tie them together in a sensible manner. This may sound simple for some, but for me it was A LOT of information to digest and to read up on.
As of recently OSTree provides auto-update functionality, so a few days into GSoC we agreed that re-using systemd for much of the missing parts was a good way forward:
On boot up systemd would run check services grouped under a health check target. Whether the target is reached determines the system’s health status. Another systemd service then does something on success/something else on failure.
On atomic systems it could call
rpm-ostree rollback --reboot on failure, downgrading the system to its previous version.
This approach is also abstract and versatile enough to be useful on non-OSTree based systems, essentially making it a ‘Generic Health Check Framework for systemd’.
For the sake of moving quickly, I decided to prototype using bash scripts instead of C like I had originally planned (do not worry, this is still planned for future releases). And while I hadn’t produced a single working LOC in the first week, at times flicking through man pages like a mad man, I now finally had a clear idea of how to approach this whole thing. And even better, I almost immediately came up with a pretty cool name for it.
Putting on the Boots
On systems running Fedora 28, you can install, run and check the output of the current alpha version of
greenboot like this (you could additionally install
greenboot-reboot to reboot on RED status):
dnf copr enable lorbus/greenboot
dnf install greenboot greenboot-notificationssystemctl enable greenboot.target
systemctl start greenbootjournalctl -u greenboot.target
journalctl -u greenboot
journalctl -t greenboot.sh
The following directory structure is created to place your custom scripts in:
Customize Health Checking Behaviour
You now have multiple options to customize greenboot’s health checking behaviour:
- Put scripts representing health checks that MUST NOT FAIL in order to reach a GREEN boot status into
- Put scripts representing health checks that MAY FAIL into
- Create oneshot health check service units that MUST NOT FAIL like the following and put them into
Description=Custom Required Health Check
- Create oneshot health check service units that MAY FAIL like the following and put them into
Description=Custom Wanted Health Check
Customize GREEN Status Behaviour
- Put scripts representing procedures you want to run after a GREEN boot status has been reached into
Customize RED Status Behaviour
- Put scripts representing procedures you want to run after a RED boot status has been reached into
While the prototype is working reasonably well now, there are still quite a few things on the ToDo list for greenboot:
- Create reasonable default health checks and red/green procedures
- Support/document hardware watchdog services as health checks
- Create Documentation
- CI on Fedora Atomic
- Convert source to C
- Create protocol for tracking number of boot attempts to avoid endless reboot loops (this will probably not be part of greenboot itself)
In conclusion I can say this past month has been an amazing experience! For the first time I got to work side by side with professionals. I have already learned so much and it just keeps on getting better!
I would like to thank
- Google for making it possible for me to work on Open Source,
- my mentors Peter, Dusty and Jonathan for their amazing help, patience and reviews,
- the Fedora Atomic Community for their great attitude, helpfulness and creating a friendly and productive atmosphere, and
- Everyone who has already weighed in on
greenbootor shared it online!
Feel free to comment and give feedback on:
- The Fedora IoT Pagure Project: https://pagure.io/fedora-iot/issues
- The systemd mailing list: https://lists.freedesktop.org/archives/systemd-devel/2018-May/040796.html
- The GitHub repository: https://github.com/LorbusChris/greenboot
- My Twitter: https://twitter.com/LorbusChris