Containers from scratch — Part 1
There is no better way to learn something than by building it. So let's understand and build a container from scratch.
Nowadays, almost everything is using containers, and at SumUp, it isn't different. Since we use it a lot, I want to improve my understanding of containers and how it works behind the scenes, and this is why we are going to build a container from scratch.
The old containers were not more than a mix of Linux Namespaces and Cgroups. Modern containers like Docker and Podman have reached a complex state with more than that. We'll keep ourselves in the namespaces field in this post and slowly explore more possibilities in future posts.
We have namespaces, which isolate the Linux kernel global system resources like mounting points or network devices. We also have Cgroups, which limit resource usages like memory and CPU. One guarantees isolation, while the other guarantees the container won't make the host machine starve.
We have a few types of namespaces, each wrapping specific global system resources in an abstraction. For example, the PID Namespace isolates the process ID number space. It allows every container to have its own init process, and it is the first step to creating containers that won't be able to see the host's processes.
The second step involves calling the mount syscall, which is something we will be doing in subsequent posts. Unsharing the PID Namespace won't be enough.
Processes created during the booting are within the root namespaces, and their children spawned by a fork or clone syscall will inherit them. The only way to not inherit them is if you ask to unshare the namespaces. We can do it by passing specific flags to the clone syscall or by calling the unshare syscall.
The code above is simple, even though it's written using C. The first part prepares and calls the clone syscall, and then the parent process waits for the child while it prints its PID. It's important to note that the child process does not share the same PID Namespace with its parent.
The output of the above code is interesting because it shows the child process has PID 1, and it's different from the host's PID 1.
vitor@sumup $ sudo ./bin # run the compiled code
Child with PID 1 here!
Parent with PID 21529 here!vitor@sumup $ ps -o user,pid,command -p 1 # get host's init process
USER PID COMMAND
root 1 /sbin/init
Running these commands requires being a superuser, and we could avoid this by using User Namespaces. But this is a discussion for later when we talk about rootless containers.
We can create nested namespaces, and every namespace must have a parent but for the root namespaces. So, for example, the PID namespace we created above for the child's process has the root as its parent, but if it calls the clone syscall again, the parent won't be the root anymore.
We could add more namespaces kinds by changing the clone syscall. Another essential one for containers is the Mount Namespace, which isolates mounting points seen from within the namespace. We will create chroot jails soon, and Mount Namespaces will be essential for this task.
// clones the process
pid_t pid = clone(
pchild_stack + 1024 * 1024,
CLONE_NEWPID | CLONE_NEWNS | SIGCHLD,
Note: if you want a detailed explanation of namespaces, I recommend Michael Kerrisk's series.
Executing a shell
We want to execute different commands inside our container, so we must modify our code to spawn a shell that we can control.
I modified the child's function to execute the file given in the command line. I also changed the clone syscall parameters to unshare the Mount Namespace and pass the arguments to the child function.
vitor@sumup $ sudo ./bin /bin/pwd
/home/vagrant/rootlessvitor@sumup $ sudo ./bin /bin/bash
root@sumup $ whoami
We can now get a shell inside our isolated environment, and we could even run other commands, as shown above.
Isolating mounting points and the PID number space was great, but it's time to go further because we didn't unshare the Mount Namespace in vain. It was the first step to isolating which processes the container can see and confining its filesystem.
See you in the next post!