Step-by-Step guide to implement a basic Linux container / sandbox (I)

I wanted to find a lightweight sandbox for my dirt cheap Linux server. I came across this blog on implementing a container in 500 lines of code. But apparently, it was written for people who have more background than I did on this topic. I couldn’t understand some of the concepts and I had to google around. And I’m still having questions, but I got a basic sandbox working. This my gentle version of how to implement a basic Linux sandbox. I can’t finish everything in one post. This will be a short series. The code is shared on my github and I will also clean this up to make a formal tutorial in the future. BTW, rabbitc is a similar container in rust and cri-o seems to be another lightweight choice.

The Linux APIs you need to implement a container look a bit messy. I feel that they are the results of unplanned organic growth. There is no single set of APIs to get the job done. Different kinds of isolation are done different groups of APIs. And certain steps needed for a task are more complex than I expected. Like the original author said:

The scope of each mechanism is pretty unclear. They overlap a lot and it’s tricky to find the best way to limit things.

It is worth mentioning the overall architecture of the container. There are 3 processes involved, not just 2, the container and an untrusted process. The container itself actually is consisted of 2 processes. One is outside the container environment to launch everything, and the other is inside the contained environment to help set up the environment and then it will launch the untrusted process.

The container process A launches a new process B and notify the system that this process will be containerized. But at the point B starts, not everything is configured properly. Then B needs to work with A to perform the configuration. And once that part is done, B launches the untrusted process C.

A key concept behind this is the namespace mechanism. It could give each process an individual view of the system. For example, a process X could have a different view of the system than that of a process Y. As a result, process X sees a different set of network interfaces than what Y sees. And they could each sees a different list of users and even file system.

The first step of starting a contained environment is starting the process B and telling the system that B’s namespaces need to be separated. This is achieved by calling the clone function. The clone function takes a function pointer, a set of flags and a payload as its inputs. The function pointer will be called within the newly created process B. This is quite similar to the function fork. the difference is that, fork duplicates the program at the point of fork being called, and the resulting 2 processes will run into different branches based on the process id returned by fork. Clone on the other hand will launch a new process and let the new process to execute the function pointer you provided.

int flags{CLONE_NEWNS | CLONE_NEWCGROUP | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS};
if ((childPid = clone(child, stack + STACK_SIZE, flags | SIGCHLD, &config)) == -1)    
{
std::cerr << "Process Launch Failed." << std::endl;
return 1;
} else {
std::cout << "Parent Process PID: " << getpid() << " Child Process PID: " << childPid << std::endl;
}

Here child is the function, and &config is the payload (input argument)for the child function. What’s interesting is the flags passed into the clone function. They basically says, the new process launched needs to be separated from the system in those following areas.

A bare bone implementation of the child process could be like this:

int child(void *arg)
{
char * const argv[]{"/bin/bash", nullptr};
if (execve(argv[0], argv, nullptr) == -1)
{
std::cerr << "Unable to start process: " << errno << std::endl;
return -1;
}
return 0;
}

Here you do nothing than simply launching another (untrusted) process C, which in this case is the bash shell. I’d like to use bash to debug this code because bash allows you to easily check a lot of things, for example network interfaces and root filesystem. You can quickly verify and see if things are working as expected. But while using bash, somehow, I found it mysteriously quit itself. And when doing so, the terminal emulator I use will receive some signals it sends and close itself. This makes debugging difficult. So I then wrote a simple shell script to be launched by my toy container. Within the shell script, I perform the same checks I want:

#!/bin/bash
ls /
ip address show
ifconfig
hostname
echo "done"

You can say the above code is a basic container already. For example, you will notice that the network interfaces are now different as printed by the above shell script. It only shows an inactive loop back interface. But this container isn’t secure at all, for example, it still accesses to the same rootfs.

The first step to perform is assigning a different host name to fool the untrusted that this is a different machine.

if (sethostname("caged", 5) == -1) 
{
std::cerr << "Unable to set Hostname." << std::endl;
return -1;
}

The original blog has a complex logic to pick an interesting hostname randomly. I think for my use case, I don’t need to have fancy hostnames. So I just hardcoded it to “caged”.

Then after setting the hostname, the next step is separating the user namespace. The concept of the user namespace is giving each process a map of fake users you provided to the real users. For example, the untrusted process C could see an user K, but this K doesn’t actually exist in the system. It is just an alias of an actual user. With this, you can even fool the process C that it has the root privilege (uid 0), but in reality, it only runs as a normal user.

To set this real users to fake users mapping, you need to modify 2 files: /proc/process_id_of_B/uid_map and /proc/process_id_of_B/gid_map.

The problem is, the process B doesn’t know its real process id. If you print it within the process B. it thinks it has the pid 1 and it was launched by the system (pid 0). Only within the parent process A, we could obtain the real process id for B.

Second, the above 2 files can only be modified by the parent process A. Therefore, at this point, the process B needs to wait for process A to set the uid_map and gid_map files first. This is why the original author created a pair of sockets. The purpose of them is to synchronize the two processes. The process B needs to wait on the process A to perform the file changes before it can continue.

int sockets[2]{0};
if (socketpair(AF_LOCAL, SOCK_SEQPACKET, 0, sockets) == -1)
{
std::cerr << "Failed to Create Socket Pair to Communicate With the Child Process: " << errno << std::endl;
return 1;
}

The socket pair represents 2 endpoints of a communication channel. One endpoint is kept by the process A, and the other one is given to the process B.

On weird thing is that once the process B starts, both A and B needs to close the socket endpoint (file descriptor) that is assigned to the other party. A needs to close the socket file descriptor that was given to B. And similarly B needs to close the file descriptor kept by A. Only after these steps, they could communicate with each other with their remaining open file descriptor.

I don’t understand the reason for closing the descriptor, but that’s the way it is.

Once the communication channel is established. The first step is letting process B to unshare its user namespace with that of the parent process A. Then B needs to signal A to modify the above 2 files.

int has_userns = !unshare(CLONE_NEWUSER);

Now this is something I don’t understand, because CLONE_NEWUSER is also a flag for the clone function. Why can’t we simply launch the process B with the CLONE_NEWUSER flag. Why do we have to unshare here within B. I actually tried clone with CLONE_NEWUSER and skip the unshare, things failed. Based on logging, the socket pair didn’t seem to work somehow if I launch B this way. But I didn’t spend time debugging.

Next is letting the process A to modify the 2 files, since it knows the pid of process B. The content of the file is simple:

cat /proc/pid_of_process_b/uid_map
0 1001 1024

It contains 3 numbers. The first number defines the first uid number, which is 0 (root). And the last number is how many uids you could assign. The second number is the actual uid behind the scene. For example, you can assign the root uid to a process within the container, but it’s not the real root, it is the uid 1001 outside the container environment.

When the files are updated, the next is notifying process B to assign an uid and a gid.

gid_t gid = 256;
uid_t uid = 256;
setgroups(1, &gid);
setresgid(uid, uid, uid);
setresuid(uid, uid, uid);

Here I just hardcode the uid and pid to 256 (A number that is within the [0, 0+1024] range, as defined in the 2 files.). Now if check your uid and pid in the shell script:

id -u
id -g

You will see 256 being printed. So far we have successfully changed our uid and pid to 256. Now we need to change the container’s filesystem, so that the untrusted program C won’t change any file that could damage the host system.

Now this step is a bit more complex that I initially expected. I was assuming that there was a single function called bool useThisFolderAsTheNewRootFS(const char *folderName); that allows me to do this. But in reality, this task requires a few steps and the main step is pivot_root.

The idea of pivot_root is that instead of directly specifying a new folder location as the new root. You will have to perform a swap between the current root location and the new location you specified. After the swapping, the folder you specified will be at /, whereas the old root will appear in a location you specified. But there is a requirement for the location, it has to be within the new root folder. To summarize the steps:

1. mount --bind the new root to /tmp/new_root
2. mkdir /tmp/new_root/old_root
3. cd /tmp/new_root
4. pivot_root . old_root

Pivot_root is not only a system call, it is also a command. You can actually try the above in a shell. I didn’t understand why we can’t simply appoint a new folder as the new root. Based on this email from Linus:

‘/’ is special exactly the same way ‘.’ is: one is shorthand for “current
process’ root”, and the other is shorthand for “current process’ cwd”.
So if you mount over ‘/’, it won’t actually do what you think it does:
because when you open “/”, it will continue to open the _old_ “/”. Exactly
the same way that mounting over somebody’s cwd won’t do what you think it
does — because the root and the cwd have been looked-up earlier and are
cached with the process.
This is why we have “pivot_root()” and “chroot()”, which can both be used
to do what you want to do. You mount the new root somewhere else, and then
you chroot (or pivot-root) to it. And THEN you do ‘chdir(“/”)’ to move the
cwd into the new root too (and only at that point have you “lost” the old
root — although you can actually get it back if you have some file
descriptor open to it).

And this quora question:

Another way to look at it is to say that you want to unmount the root file system. Now, obviously you can’t do that directly, because it’s “still in use”. Normally, you would first get every process in the system to stop using an FS and then unmount it. But you can’t stop using the root FS.

So the actual code to swap the rootfs is exactly the same as the above commands. It is worth noticing that the function pivot_root doesn’t actually exist. You will have to wrap the system call that does it.

I want to also talk about where to find a new root filesystem to use in the container. rabbitc suggests to use Apline Linux. There seems to be also the busybox rootfs, but I’m very confused by busybox. The official site seems to say it’s a wrapper of common linux command tools. And the downloads seem to be only single binaries. I don’t know exactly what its relationship with a minimal rootfs is. And as an ubuntu user, I was happen to find the ubuntu base image, which is the minimal base you can use to build an ubuntu release. The size of the image is only 27mb if zipped. I used it with my container. You just need to download it and unzip it into a folder and when running the container, appoint that folder as the new root.

To be continued …