Containers from scratch — Part 2
It's time to take a step further and beyond namespaces. First, we need to finish our container's process isolation and then isolate the container's filesystem.
This post is the second post of my series Containers from Scratch. If you want to access the first part, click here. Also, I recommend looking at SumUp's Medium page, which is full of exciting content.
In the last post, we finished with a piece of code that allowed us to execute a shell inside our container. We were unsharing the PID and Mount namespaces, and we'll keep only these two for now. You may access the gist for this code by clicking here.
In the image below, we can see an issue because we can see all processes from the host inside our container. Therefore, unsharing the PID namespace is not enough, and it's time to understand how this namespace and the ps command work before moving forward.
The PID Namespace
As I said in the last post, the PID Namespace isolates the PID numbers space, which means we could have two processes with the same ID if they are in different namespaces. But, this doesn't mean that tasks in one PID namespace can't see the ones in another namespace.
As mentioned before, namespaces have parents. The parent PID namespace can see all the processes in its child, but the child can't see its parent's processes. So, as we can see in the image below, a mapping is built between parent and child.
Host's process number two is mapped to be the child process number one, the init process, so if you kill the host's process with ID two, you will kill the child namespace. But, as you can see, the child process has no mapping to process number one in its parent because it can't see its parent's tasks.
If the child isn't supposed to see its parents' processes, why can we do it when running the ps command? To understand it better, we shall trace all syscalls made by the ps command.
As we can see, the program uses the /proc directory to search for all process information. Our proc filesystem is the same as the host's because we didn't isolate anything at a filesystem level yet, so it reads the host's processes information. The proc filesystem is unique, and the linux.com website has a good post about it.
We run the mount command shown above inside the container, and the issue is solved, but don't forget that you just changed the host's proc filesystem, so you need to rerun it after leaving the container to restore the host's proc filesystem. We shall adequately fix this by creating filesystem isolation.
A new filesystem
I decided to use Alpine's Mini Root Filesystem because it's light and has everything I need. So, after downloading it, I created a rootfs directory and extracted the whole filesystem into it.
Our run_child function has changed, and now we can't see the host's processes because we mounted a new proc filesystem. Also, it won't affect our host because we did it after changing the root, which means that the mounting point is rootfs/proc for the host but /proc for the container. It's all about perspective.
A group of processes within a Mount Namespace can't see the filesystem mounting points of the other, even if it's his parent. It's why we can't see the rootfs/proc files from our host.
As I mentioned in the last post, we got our process inside a chroot jail. Therefore, we achieved also a filesystem isolation level. If we had several containers, we would have to copy and paste the rootfs directory for all of them, which is not intelligent and is why we have better ways to deal with this, for example, by using copy-on-write filesystems.
I believe we have enough to say we had created a container and not just some random namespaces unsharing. We also gathered enough knowledge to keep studying and improving this code.
I will stop here, but I recommend understanding and adding Cgroups and User Namespaces. I intend to post only about User Namespaces and how we can get rootless containers with them in the future.