IoT Lightweight Linux Containers
As Linux containers continue to evolve rapidly, their adoption in the Embedded and IoT industry is subject sometimes to adaptation to mainly fit the final run-time environment. In this article we will see how we can use some Linux features to have some sort of Container privacy.
Everyone has his own definition of “Containers” but maybe some will agree that they should offer: 1) A quick develop and ship workflow by isolating the program and its resources from the rest of the system, and 2) A sandbox mechanism. There are already tools to solve these needs, however from an Embedded and IoT context maybe some of these tools need to be adapted to be more friendly to their environment and to the final devices where apps are shipped.
IoT Containers and Sandboxes
Resin.io  is one notable example, they did manage to offer a platform running the resinOS that is generated with a Yocto layer named “meta-resin”, this whole infrastructure allows them to support multiple devices and bring Linux Containers and docker support transparently. There are also other examples, and other IoT vendors that seem to use lightweight sandbox tools to mimic at some degree containers functionality. Maybe we will see more flavors.
Using one specific resource bundling format in an IoT world can be a bit restrictive, one has to ensure that apps can be ported from one container engine or version to another easily, not to mention the security aspect if this fails. Therefore, I think it would be better maybe to have a generic setup where the container resources are in a form of a simple directory tree, a tarball, subvolume, raw image, etc. Linux kernel asserts backward compatibility, mount namespaces can be used to solve resource bundling and isolation, and finally the tools to encode and decode such generic formats are already there. All this brings us to the next point Lightweight Containers and Sandboxes.
Lightweight Containers and Sandboxes
First, probably it just makes sense to cite the building blocks that allow to make Linux containers which are: namespaces and cgroups, and point to the following LWN article “Namespaces in operation”  that explains Linux namepsaces. As said in the beginning, everyone has his own view on what “Containers” mean, so I will spare you the pain and say that “Lightweight Containers” are Containers without all the Linux namespace features.
Lennart Poettering already presented the “portable system services” idea which is basically integrate services or apps with the system, have some kind of resource bundling, and let the sandbox mechanism handle host and services security. You can read more about it in this LWN “Portable system services”  article.
Some of these ideas make lot of sense in IoT world, we may need some Linux namespaces but not all of them. The mount namespace is the most important one as we need it to ship apps and related resources, at the same time they need to be isolated from each other.
In the other hand the PID (Process ID) namespace  and depending on the use-cases can be omitted. Creating PID namespaces consumes resources, in each PID namespace we need a reaper, Linux kernel implementation assumes that there is always a reaper per PID namespace. Container folks are aware of this, hence they introduce workarounds to have a stub process to collect signals, reap processes, etc.
I have been using ResinOS, but at same time I also built my own setup, and came to the conclusion that: in some cases I really should omit using PID namespaces. I need one supervisor plus a monitoring feature, why should I duplicate some functionality on each PID namespace ? then why I am using PID namespaces at all ? and most of the time the response is: “privacy”.
If I have some apps shipped with all their resources on my IoT, I do not want them to peek on each other. Currently in Linux you can access other processes memory by using the ptrace syscall  or some other direct system calls to check or compare processes related resources, all these can be restricted by a sandbox mechanism using seccomp filters  or by running apps under different UIDs. In the other hand, procfs /proc/<pids>/  can still be used to peek on processes by using filesystem system calls. A process can peek on /proc/<pids>/ and seccomp filters will not help in this case.
There is a procfs protection “hidepid” that allows to hide processes inside procfs that we can not ptrace. It is really effective and it comes from Openwall and grsecurity patches. It was designed back in the times for shared hosting to prevent users and scripts to reveal website access activities, it was upstreamed by Vasiliy Kulikov heya :)! thanks to him!
However, with today’s use-cases the protection needs to be modernized, it comes in a way “all or nothing”, if you have one app that needs to read /proc/<pids>/ then applying “hidepid” protection will affect all apps including the app that needs to read /proc/<pids>/ which is too restrictive. The only way to go around it is to create a new PID namespace which brings us back to the first issue.
The real issue is: internally procfs mounts are shared inside the same PID namespace, updating one procfs mount will propagate to all other mounts.
Different Linux subsystems have been modernized to work better under containers or sandboxes. The devpts filesystem  is one real example where different mounts used to be shared, now by default each new devpts mount is a distinct mount totally separated from others. It seems that procfs needs also to go under the same modernization which will allow to have separate procfs mounts per app under the same PID namespace, allowing to block processes from peeking on each other by default, without creating new PID namespaces for each app.
I had some private patches that allowed to block processes from reading other processes through /proc/<pids>/ without affecting other procfs mounts or apps, I was successfully using them but I did never submit them to upstream. After a discussion with Andy Lutomirski it turned out that this will not really fit upstream, therefor and as suggested by Andy another approach was proposed to upstream. We allow procfs to have new separated instances, each app can have its own private instance where mount or remount operations will not propagate to other procfs mounts that are inside the same PID namespace. This is better for Linux and it allows to modernize and fix some leftover things.
The actual patches: Patches: RFC v2 proc: support private proc instances per pidnamespace .
With these patches, apps will not be able to notice other processes through /proc/<pids>/ , this means running “ps” or “top” won’t show other processes. Of course there are other ways to know if a pid is active or not, these can be solved on their own later.
The referenced proc patches allow to modernize procfs, they are still an RFC that will hopefully be cleaned and submitted again.
The other improvement of these patches is that it will give Linux Security Modules a security path inside procfs for pids. LSMs can use it to tighten access, as an example Yama LSM can be updated to only allow access to inferior processes through /proc/<pids>/ , allowing maybe later to have some privacy for processes that run under same user…
Without getting out of the context and to conclude, this article focused mainly on Embedded and IoT Linux, of course everyone has its own use-cases, some will use all the features, others will only use a subset, that is totally fine!
Maybe in next article we will see how to improve and optimize apps monitoring or lightweight monitoring witch such setups.