Containers From Scratch with Golang

It doesn’t matter whether you’re a developer, an operator or DevOps engineer, chances are you’ve at least heard of Docker.
A handy tool for packing, shipping, and running applications within “containers”.
But what is a container, really?

A Linux container is a set of one or more processes that are isolated from the rest of the system.

The Linux kernel provides the cgroups functionality that allows limitation and prioritization of resources (CPU, memory, block I/O, network, etc.) without the need for starting any virtual machines, and also namespace isolation functionality that allows complete isolation of an applications’ view of the operating environment, including process trees, networking, user IDs and mounted file systems.

But that sounds a little bit confusing, so let's get our hand dirty while creating a container from scratch using Golang to better understand the mechanisms used under the hood.

First, let’s run a container from alpine image:

as mentioned before, a container is just an isolated process running inside the host, so how does it have different hostname compared to the host?
here comes the Unix Time Sharing namespace.

UTS: The UTS namespace gives processes their own view of the system’s hostname and domain name.

Let’s create our own container from scratch that has different hostname compared to host.

package mainimport (
"os"
"os/exec"
"fmt"
"syscall"
)
func main() {
switch os.Args[1] {
case "run":
run()
default:
panic("pass me an argument please")
}
}
func run() {
fmt.Printf("Running %v\n" ,os.Args[2:])
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.SysProcAttr = &syscall.SysProcAttr {
Cloneflags: syscall.CLONE_NEWUTS,
}
cmd.Run()
}

Let’s run it:

As we expected the program take the second argument passed to it, which is run in “container.go run /bin/bash” and then execute the arbitrary command “/bin/bash”.
The interesting part here is “syscall.CLONE_NEWUTS” that creates a UTS namespace for our process which is /bin/bash here, but as you can see our container has the same hostname compared to host !!
maybe you think we are not in the container at all, but looking at ps command output we see PID 10172, which is our new containerized shell.
Let’s try changing hostname in our containerized shell:

We could change the hostname in our containerized shell.
Now let’s try setting the hostname before shell execution so we can see it.

syscall.Sethostname([]byte(“inside-container”))

But we can not put it after cmd.Run() because the code will execute after cmd.Run() exit, we can not put it before cmd.Run() either, although we set up a UTS namespace for our process, this namespace will be created upon clone(2) syscall which invokes via cmd.Run().

Let’s tweak our code to handle this:

package mainimport (
"os"
"os/exec"
"fmt"
"syscall"
)
func main() {
switch os.Args[1] {
case "run":
run()
case "ns":
ns()
default:
panic("pass me an argument please")
}
}
func run() {
fmt.Printf("Running %v\n" ,os.Args[2:])
cmd := exec.Command("/proc/self/exe" , append([]string{"ns"},
os.Args[2:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.SysProcAttr = &syscall.SysProcAttr {
Cloneflags: syscall.CLONE_NEWUTS,
}
cmd.Run()
}
func ns() {
fmt.Printf("Running in new UTS namespace %v\n" ,os.Args[2:])

syscall.Sethostname([]byte("inside-container"))
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr

cmd.Run()
}

Let’s run it:

wow, that worked.
so what is “/proc/self/exe” in the code above?

/proc/self

This directory refers to the process accessing the /proc filesystem, and is identical to the /proc directory named by the process ID of the same process.

When a process accesses this magic symbolic link, it resolves to the process’s own /proc/[pid] directory.

“/proc/self/exe” will reinvoke the process which is accessing it.
To make it simple, “/proc/self/exe” in the code above will run “go run container.go” but with new arguments which is
## go run container.go ns /bin/bash
in its new namespace.
Till here we just created new UTS namespace, now here comes the ns() function which will set our hostname in the newly created UTS namespace and execute our arbitrary command which is /bin/bash.

What about PIDs in Container?
Let’s take a look at ps command output in our container:

We can still see those high number PIDs in our container, as you probably guessed we can handle that using ProcessID namespaces.

PID: The PID namespace gives a process and its children their own view of a subset of the processes in the system.

package mainimport (
"os"
"os/exec"
"fmt"
"syscall"
)
func main() {
switch os.Args[1] {
case "run":
run()
case "ns":
ns()
default:
panic("pass me an argument please")
}
}
func run() {
fmt.Printf("Running %v as %d\n" ,os.Args[2:], os.Getpid())
cmd := exec.Command("/proc/self/exe" , append([]string{"ns"},
os.Args[2:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.SysProcAttr = &syscall.SysProcAttr {
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID,
}
cmd.Run()
}
func ns() {
fmt.Printf("Running in new UTS namespace %v as %d\n" ,os.Args[2:], os.Getpid())

syscall.Sethostname([]byte("inside-container"))
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr

cmd.Run()
}

Let’s run it:

/bin/bash process now gets its own PID namespace.

Now let’s execute ps inside our container:

Well, we didn’t expect that result, but that is because of /proc directory that contains all information about running processes on the host.
ps command is looking at /proc directory so our container needs its own /proc directory.
This is where “chroot” command comes in place, with “chroot” we can change “root” for a process.

We’re going to use alpine Image filesystem for our container:

Now we can chroot our container to /root/containerFS/

Let’s tweak our code :

package mainimport (
"os"
"os/exec"
"fmt"
"syscall"
)
func main() {
switch os.Args[1] {
case "run":
run()
case "ns":
ns()
default:
panic("pass me an argument please")
}
}
func run() {
fmt.Printf("Running %v as %d\n" ,os.Args[2:], os.Getpid())
cmd := exec.Command("/proc/self/exe" , append([]string{"ns"},
os.Args[2:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.SysProcAttr = &syscall.SysProcAttr {
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID,
}
cmd.Run()
}
func ns() {
fmt.Printf("Running in new UTS namespace %v as %d\n" ,os.Args[2:], os.Getpid())

syscall.Sethostname([]byte("inside-container"))
syscall.Chroot("/root/containerFS")
syscall.Chdir("/") // set the working directory inside container
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr

cmd.Run()
}

Let’s run it:

Our containerized shell did not execute!!! but why?
Because /bin/bash in not present in the alpine image, since we changed our container filesystem to /root/containerFS, it’s going to find /bin/bash in the new filesystem which is /root/containerFS/bin/bash from the host perspective.

Let’s run /bin/sh:

We have successfully changed our container filesystem.
Now let’s try PS inside the container again:

Well, nothing!!! that’s because /proc is a pseudo filesystem, a mechanism for sharing information between userspace and kernelspace.
We should mount host /proc to the container.
I think it is getting more clear about how containers share the host kernel.

Let’s tweak the code again:

package mainimport (
"os"
"os/exec"
"fmt"
"syscall"
)
func main() {
switch os.Args[1] {
case "run":
run()
case "ns":
ns()
default:
panic("pass me an argument please")
}
}
func run() {
fmt.Printf("Running %v as %d\n" ,os.Args[2:], os.Getpid())
cmd := exec.Command("/proc/self/exe" , append([]string{"ns"},
os.Args[2:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.SysProcAttr = &syscall.SysProcAttr {
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID,
}
cmd.Run()
}
func ns() {
fmt.Printf("Running in new UTS namespace %v as %d\n" ,os.Args[2:], os.Getpid())

syscall.Sethostname([]byte("inside-container"))
syscall.Chroot("/root/containerFS")
syscall.Chdir("/") // set the working directory inside container
syscall.Mount("proc", "proc", "proc", 0, "")
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr

cmd.Run()

syscall.Unmount("/proc", 0)
}

We added “syscall.Mount(“proc”, “proc”, “proc”, 0, “”)” to mount /proc into container filesystem and unmount it using “syscall.Unmount(“/proc”, 0)” after the process exits.

Let’s run ps again:

That worked :)

Let’s take a look at mounts inside the host:

By default, we can see all mounts from the host, but we can create a new mount namespace and unshare it so it’s no longer visible to the host.

Let’s do it:

package mainimport (
"os"
"os/exec"
"fmt"
"syscall"
)
func main() {
switch os.Args[1] {
case "run":
run()
case "ns":
ns()
default:
panic("pass me an argument please")
}
}
func run() {
fmt.Printf("Running %v as %d\n" ,os.Args[2:], os.Getpid())
cmd := exec.Command("/proc/self/exe" , append([]string{"ns"},
os.Args[2:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.SysProcAttr = &syscall.SysProcAttr {
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS,
Unshareflags: syscall.CLONE_NEWNS,
}
cmd.Run()
}
func ns() {
fmt.Printf("Running in new UTS namespace %v as %d\n" ,os.Args[2:], os.Getpid())

syscall.Sethostname([]byte("inside-container"))
syscall.Chroot("/root/containerFS")
syscall.Chdir("/") // set the working directory inside container
syscall.Mount("proc", "proc", "proc", 0, "")
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr

cmd.Run()

syscall.Unmount("/proc", 0)
}

We created a new mount namespace using “syscall.CLONE_NEWNS” and unshare it using “Unshareflags: syscall.CLONE_NEWNS”.
BTW NEWNS which stands for new namespace is actually mount namespace.
We can still see the mounts for processes inside the container by looking at /proc/{pid}/mounts, but at least it doesn’t clutter our host mount file.

Other namespaces like
network namespace
user namespace
IPC namespace
can be applied like how we did above.

CGROUPS
Cgroups are another pseudo filesystem interfaces which look like directories and files but we can use them to exchange properties between userspace and kernelspace.
put it simply, cgroups can limit the resources a container use.

Let’s explore it:

As you can see, there is a directory for each type of cgroups.
Let’s take a look at memory.limit_in_bytes in directory “memory”:

This number actually says there is no memory limit for host processes.

There is a “docker” directory in /sys/fs/cgroup/memory/ which contains cgroups of type memory, for Docker containers.

Let’s take an example with no cgroup limitation applied:

Let’s take an example with cgroup limitation:

So now that we know how to limit container resources, let’s put it into code:

package mainimport (
"os"
"os/exec"
"fmt"
"syscall"
"path/filepath"
"strconv"
"io/ioutil"
)
func main() {
switch os.Args[1] {
case "run":
run()
case "ns":
ns()
default:
panic("pass me an argument please")
}
}
func run() {
fmt.Printf("Running %v as %d\n" ,os.Args[2:], os.Getpid())
cmd := exec.Command("/proc/self/exe" , append([]string{"ns"},
os.Args[2:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.SysProcAttr = &syscall.SysProcAttr {
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS,
Unshareflags: syscall.CLONE_NEWNS,
}
cmd.Run()
}
func ns() {
fmt.Printf("Running in new UTS namespace %v as %d\n" ,os.Args[2:], os.Getpid())

cg()
syscall.Sethostname([]byte("inside-container"))
syscall.Chroot("/root/containerFS")
syscall.Chdir("/") // set the working directory inside container
syscall.Mount("proc", "proc", "proc", 0, "")
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr

cmd.Run()

syscall.Unmount("/proc", 0)
}
func cg() {
cgroups := "/sys/fs/cgroup/"
pids := filepath.Join(cgroups, "pids")
os.Mkdir(filepath.Join(pids, "ourContainer"), 0755)
ioutil.WriteFile(filepath.Join(pids, "ourContainer/pids.max"), []byte("10"), 0700)
//up here we limit the number of child processes to 10

ioutil.WriteFile(filepath.Join(pids, "ourContainer/notify_on_release"), []byte("1"), 0700)

ioutil.WriteFile(filepath.Join(pids, "ourContainer/cgroup.procs"), []byte(strconv.Itoa(os.Getpid())), 0700)
// up here we write container PIDs to cgroup.procs
}

Let’s run it:

Here are the cgroup files of PID types which are created by program:

But what happened?
When we run “go run container.go run /bin/sh” , this command turns to “/proc/self/exe ns /bin/sh” and that leads us to ns() function which afterwards execute cg() function.
In cg() function, the reinvoked process will write its PID to “/sys/fs/cgroup/pids/ourContainer/cgroup.procs” file and therefore subject itself and its child processes to this cgroup.

I suggest you try invoking more than “ourContainer/pids.max” processes using fork bomb and see what happens
fork bomb is a Bash function, so testing it is like an exercise for you ;))

Well, That’s it.
I hope it helps you understand the containers more deeply.
Please let me know of your thoughts in the comments section below.
Special Thanks to Liz Rice for her amazing work.

Sources:
Container From Scratch
Namespaces with clone(2)
Namespaces
infoQ

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store