Containers From Scratch with Golang

It doesn’t matter whether you’re a developer, an operator or DevOps engineer, chances are you’ve at least heard of Docker.
A handy tool for packing, shipping, and running applications within “containers”.
But what is a container, really?

A Linux container is a set of one or more processes that are isolated from the rest of the system.

The Linux kernel provides the cgroups functionality that allows limitation and prioritization of resources (CPU, memory, block I/O, network, etc.) without the need for starting any virtual machines, and also namespace isolation functionality that allows complete isolation of an applications’ view of the operating environment, including process trees, networking, user IDs and mounted file systems.

But that sounds a little bit confusing, so let's get our hand dirty while creating a container from scratch using Golang to better understand the mechanisms used under the hood.

First, let’s run a container from alpine image:

root@host:~# docker run — rm -it alpine:latest /bin/sh
/ # hostname
be5c81c5e607

as mentioned before, a container is just an isolated process running inside the host, so how does it have different hostname compared to the host?
here comes the Unix Time Sharing namespace.

UTS: The UTS namespace gives its processes their own view of the system’s hostname and domain name.

Let’s create our own container from scratch that has different hostname compared to host.

package main
import (
"os"
"os/exec"
"fmt"
"syscall"
)
func main() {
switch os.Args[1] {
case "run":
run()
default:
panic("pass me an argument please")
}
}
func run() {
fmt.Printf("Running %v\n" ,os.Args[2:])
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
          cmd.SysProcAttr = &syscall.SysProcAttr {
Cloneflags: syscall.CLONE_NEWUTS,
}
          cmd.Run()
}

Let’s run it:

root@host:~# go run container.go run /bin/bash
Running [/bin/bash]
root@host:~# ps
 PID TTY TIME CMD
 8564 pts/6 00:00:00 bash
10158 pts/6 00:00:00 go
10169 pts/6 00:00:00 container
10172 pts/6 00:00:00 bash
10192 pts/6 00:00:00 ps

As we expected the program take the second argument passed to it, which is run in “container.go run /bin/bash” and then execute the arbitrary command “/bin/bash”.
The interesting part here is “syscall.CLONE_NEWUTS” that creates a UTS namespace for our process which is /bin/bash here, but as you can see our container has the same hostname compared to host !!
maybe you think we are not in the container at all, but looking at ps command output we see PID 10172, which is our new containerized shell.
Let’s try changing hostname in our containerized shell:

root@host:~# hostname container
root@host:~# hostname
container
root@host:~# exit
exit
root@host:~# hostname
host

We could change the hostname in our containerized shell.
Now let’s try setting the hostname before shell execution so we can see it.

syscall.Sethostname([]byte(“inside-container”))

But we can not put it after cmd.Run() because the code will execute after cmd.Run() exit, we can not put it before cmd.Run() either, although we set up a UTS namespace for our process, this namespace will be created upon clone(2) syscall which invokes via cmd.Run().

Let’s tweak our code to handle this:

package main
import (
"os"
"os/exec"
"fmt"
"syscall"
)
func main() {
switch os.Args[1] {
case "run":
run()
case "ns":
ns()
default:
panic("pass me an argument please")
}
}
func run() {
fmt.Printf("Running %v\n" ,os.Args[2:])
    cmd := exec.Command("/proc/self/exe" , append([]string{"ns"},
os.Args[2:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
    cmd.SysProcAttr = &syscall.SysProcAttr {
Cloneflags: syscall.CLONE_NEWUTS,
}
    cmd.Run()
}
func ns() {
fmt.Printf("Running in new UTS namespace %v\n" ,os.Args[2:])

syscall.Sethostname([]byte("inside-container"))
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr

cmd.Run()
}

Let’s run it:

root@host:~# go run container.go run /bin/bash
Running [/bin/bash]
Running in new UTS namespace [/bin/bash]
root@inside-container:~#

wow, that worked.
so what is “/proc/self/exe” in the code above?

/proc/self
This directory refers to the process accessing the /proc filesystem, and is identical to the /proc directory named by the process ID of the same process.
When a process accesses this magic symbolic link, it resolves to the process’s own /proc/[pid] directory.
root@host:~# ls -l /proc/self/exe
lrwxrwxrwx 1 root root 0 Apr 4 20:31 /proc/self/exe -> /bin/ls

“/proc/self/exe” will reinvoke the process which is accessing it.
To make it simple, “/proc/self/exe” in the code above will run “go run container.go” but with new arguments which is 
## go run container.go ns /bin/bash
in its new namespace.
Till here we just created new UTS namespace, now here comes the ns() function which will set our hostname in the newly created UTS namespace and execute our arbitrary command which is /bin/bash.

What about PIDs in Container?
Let’s take a look at ps command output in our container:

root@inside-container:~# ps
 PID TTY TIME CMD
 8564 pts/6 00:00:00 bash
13409 pts/6 00:00:00 go
13420 pts/6 00:00:00 container
13423 pts/6 00:00:00 exe
13426 pts/6 00:00:00 bash
13436 pts/6 00:00:00 ps

We can still see those high number PIDs in our container, as you probably guessed we can handle that using ProcessID namespaces.

PID: The PID namespace gives a process and its children their own view of a subset of the processes in the system.

package main
import (
"os"
"os/exec"
"fmt"
"syscall"
)
func main() {
switch os.Args[1] {
case "run":
run()
case "ns":
ns()
default:
panic("pass me an argument please")
}
}
func run() {
fmt.Printf("Running %v as %d\n" ,os.Args[2:], os.Getpid())
    cmd := exec.Command("/proc/self/exe" , append([]string{"ns"},
os.Args[2:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
    cmd.SysProcAttr = &syscall.SysProcAttr {
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID,
}
    cmd.Run()
}
func ns() {
fmt.Printf("Running in new UTS namespace %v as %d\n" ,os.Args[2:], os.Getpid())

syscall.Sethostname([]byte("inside-container"))
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr

cmd.Run()
}

Let’s run it:

root@host:~# go run container.go run /bin/bash
Running [/bin/bash] as 13769
Running in new UTS namespace [/bin/bash] as 1
root@inside-container:~#

/bin/bash process now gets its own PID namespace.

Now let’s execute ps inside our container:

root@inside-container:~# ps
 PID TTY TIME CMD
 8564 pts/6 00:00:00 bash
13758 pts/6 00:00:00 go
13769 pts/6 00:00:00 container
13772 pts/6 00:00:00 exe
13775 pts/6 00:00:00 bash
14204 pts/6 00:00:00 ps

Well, we didn’t expect that result, but that is because of /proc directory that contains all information about running processes on the host.
ps command is looking at /proc directory so our container needs its own /proc directory.
This is where “chroot” command comes in place, with “chroot” we can change “root” for a process.

We’re going to use alpine Image filesystem for our container:

root@host:~# docker image inspect alpine:latest -f \ ‘{{.GraphDriver.Data.UpperDir}}’
/var/lib/docker/overlay2/4cb707a20edd03410a4190e1d0a8402655b985a152239afefe7c8e2ed055e994/diff
root@host:~# cp -r\ /var/lib/docker/overlay2/4cb707a20edd03410a4190e1d0a8402655b985a152239afefe7c8e2ed055e994/diff \
/root/containerFS
root@host:~#mkdir /root/containerFS/helloContainer && \ 
ls /root/containerFS/
bin dev etc helloContainer home lib media mnt proc root run sbin srv sys tmp usr var

Now we can chroot our container to /root/containerFS/

Let’s tweak our code :

package main
import (
"os"
"os/exec"
"fmt"
"syscall"
)
func main() {
switch os.Args[1] {
case "run":
run()
case "ns":
ns()
default:
panic("pass me an argument please")
}
}
func run() {
fmt.Printf("Running %v as %d\n" ,os.Args[2:], os.Getpid())
    cmd := exec.Command("/proc/self/exe" , append([]string{"ns"},
os.Args[2:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
    cmd.SysProcAttr = &syscall.SysProcAttr {
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID,
}
    cmd.Run()
}
func ns() {
fmt.Printf("Running in new UTS namespace %v as %d\n" ,os.Args[2:], os.Getpid())

syscall.Sethostname([]byte("inside-container"))
syscall.Chroot("/root/containerFS")
syscall.Chdir("/") // set the working directory inside container
    cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr

cmd.Run()
}

Let’s run it:

root@host:~# go run container.go run /bin/bash
Running [/bin/bash] as 18015
Running in new UTS namespace [/bin/bash] as 1
root@host:~#

Our containerized shell did not execute!!! but why?
Because /bin/bash in not present in the alpine image, since we changed our container filesystem to /root/containerFS, it’s going to find /bin/bash in the new filesystem which is /root/containerFS/bin/bash from the host perspective.

Let’s run /bin/sh:

root@host:~# go run container.go run /bin/sh
Running [/bin/sh] as 18203
Running in new UTS namespace [/bin/sh] as 1
/ # ls
bin etc home media proc run srv tmp var
dev helloContainer lib mnt root sbin sys usr
/ #

We have successfully changed our container filesystem.
Now let’s try PS inside the container again:

/ # ps
PID USER TIME COMMAND
/ # ls /proc/

Well, nothing!!! that’s because /proc is a pseudo filesystem, a mechanism for sharing information between userspace and kernelspace.
We should mount host /proc to the container.
I think it is getting more clear about how containers share the host kernel.

Let’s tweak the code again:

package main
import (
"os"
"os/exec"
"fmt"
"syscall"
)
func main() {
switch os.Args[1] {
case "run":
run()
case "ns":
ns()
default:
panic("pass me an argument please")
}
}
func run() {
fmt.Printf("Running %v as %d\n" ,os.Args[2:], os.Getpid())
    cmd := exec.Command("/proc/self/exe" , append([]string{"ns"},
os.Args[2:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
    cmd.SysProcAttr = &syscall.SysProcAttr {
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID,
}
    cmd.Run()
}
func ns() {
fmt.Printf("Running in new UTS namespace %v as %d\n" ,os.Args[2:], os.Getpid())

syscall.Sethostname([]byte("inside-container"))
syscall.Chroot("/root/containerFS")
syscall.Chdir("/") // set the working directory inside container
syscall.Mount("proc", "proc", "proc", 0, "")
    cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr

cmd.Run()

syscall.Unmount("/proc", 0)
}

We added “syscall.Mount(“proc”, “proc”, “proc”, 0, “”)” to mount /proc into container filesystem and unmount it using “syscall.Unmount(“/proc”, 0)” after the process exits.

Let’s run ps again:

root@host:~# go run container.go run /bin/sh
Running [/bin/sh] as 18957
Running in new UTS namespace [/bin/sh] as 1
/ # ps
PID USER TIME COMMAND
 1 root 0:00 /proc/self/exe ns /bin/sh
 4 root 0:00 /bin/sh
 5 root 0:00 ps
/ #

That worked :)

Let’s take a look at mounts inside the host:

root@host:~# mount | grep /proc
proc on /root/containerFS/proc type proc (rw,relatime)

By default, we can see all mounts from the host, but we can create a new mount namespace and unshare it so it’s no longer visible to the host.

Let’s do it:

package main
import (
"os"
"os/exec"
"fmt"
"syscall"
)
func main() {
switch os.Args[1] {
case "run":
run()
case "ns":
ns()
default:
panic("pass me an argument please")
}
}
func run() {
fmt.Printf("Running %v as %d\n" ,os.Args[2:], os.Getpid())
     cmd := exec.Command("/proc/self/exe" , append([]string{"ns"},
os.Args[2:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
     cmd.SysProcAttr = &syscall.SysProcAttr {
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS,
Unshareflags: syscall.CLONE_NEWNS,
}
cmd.Run()
}
func ns() {
fmt.Printf("Running in new UTS namespace %v as %d\n" ,os.Args[2:], os.Getpid())

syscall.Sethostname([]byte("inside-container"))
syscall.Chroot("/root/containerFS")
syscall.Chdir("/") // set the working directory inside container
syscall.Mount("proc", "proc", "proc", 0, "")
    cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr

cmd.Run()

syscall.Unmount("/proc", 0)
}

We created a new mount namespace using “syscall.CLONE_NEWNS” and unshare it using “Unshareflags: syscall.CLONE_NEWNS”.
BTW NEWNS which stands for new namespace is actually mount namespace.
We can still see the mounts for processes inside the container by looking at /proc/{pid}/mounts, but at least it doesn’t clutter our host mount file.

Other namespaces like
network namespace
user namespace
IPC namespace
can be applied like how we did above.

CGROUPS
Cgroups are another pseudo filesystem interfaces which look like directories and files but we can use them to exchange properties between userspace and kernelspace.
put it simply, cgroups can limit the resources a container use.

Let’s explore it:

root@host:~# ls /sys/fs/cgroup/
blkio cpu cpuacct cpu,cpuacct cpuset devices freezer hugetlb memory net_cls net_cls,net_prio net_prio perf_event pids systemd
root@host:~#

As you can see, there is a directory for each type of cgroups.
Let’s take a look at memory.limit_in_bytes in directory “memory”:

root@host:~# cat /sys/fs/cgroup/memory/memory.limit_in_bytes
9223372036854771712

This number actually says there is no memory limit for host processes.

There is a “docker” directory in /sys/fs/cgroup/memory/ which contains cgroups of type memory, for Docker containers.

Let’s take an example with no cgroup limitation applied:

## Run a Docker container
root@host:~# docker run — name alpine — rm -it alpine:latest /bin/sh
## Retreving container ID
root@host:~# docker container inspect alpine -f {{.ID}}
797257f234a5df64518bd470bd937a0d19f7b49cf3ad84c86ac0bb03fe34ebea
## Check for cgroups limitation
root@host:~# cat \
/sys/fs/cgroup/memory/docker/797257f234a5df64518bd470bd937a0d19f7b49cf3ad84c86ac0bb03fe34ebea/memory.limit_in_bytes
9223372036854771712
## There is no limitation on this container processes

Let’s take an example with cgroup limitation:

## Run a Docker container
root@host:~# docker run — name alpine — rm -it — memory 50M alpine:latest /bin/sh
## Retreving container ID
root@host:~# docker container inspect alpine -f {{.ID}}
bd8f7b86aca0ddcca61a562b43349a067e8c7eb5ec62fd854cd1b2e009006037
## Check for cgroups limitation
root@host:~# cat \
/sys/fs/cgroup/memory/docker/bd8f7b86aca0ddcca61a562b43349a067e8c7eb5ec62fd854cd1b2e009006037/memory.limit_in_bytes
52428800
## There is 50M memory limitation on this container processes

So now that we know how to limit container resources, let’s put it into code:

package main
import (
"os"
"os/exec"
"fmt"
"syscall"
"path/filepath"
"strconv"
"io/ioutil"
)
func main() {
switch os.Args[1] {
case "run":
run()
case "ns":
ns()
default:
panic("pass me an argument please")
}
}
func run() {
fmt.Printf("Running %v as %d\n" ,os.Args[2:], os.Getpid())
     cmd := exec.Command("/proc/self/exe" , append([]string{"ns"},
os.Args[2:]...)...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
     cmd.SysProcAttr = &syscall.SysProcAttr {
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS,
Unshareflags: syscall.CLONE_NEWNS,
}
cmd.Run()
}
func ns() {
fmt.Printf("Running in new UTS namespace %v as %d\n" ,os.Args[2:], os.Getpid())

cg()
    syscall.Sethostname([]byte("inside-container"))
syscall.Chroot("/root/containerFS")
syscall.Chdir("/") // set the working directory inside container
syscall.Mount("proc", "proc", "proc", 0, "")
    cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr

cmd.Run()

syscall.Unmount("/proc", 0)
}
func cg() {
cgroups := "/sys/fs/cgroup/"
pids := filepath.Join(cgroups, "pids")
os.Mkdir(filepath.Join(pids, "ourContainer"), 0755)
ioutil.WriteFile(filepath.Join(pids, "ourContainer/pids.max"), []byte("10"), 0700)
//up here we limit the number of child processes to 10

ioutil.WriteFile(filepath.Join(pids, "ourContainer/notify_on_release"), []byte("1"), 0700)

ioutil.WriteFile(filepath.Join(pids, "ourContainer/cgroup.procs"), []byte(strconv.Itoa(os.Getpid())), 0700)
// up here we write container PIDs to cgroup.procs
}

Let’s run it:

root@host:~# go run container.go run /bin/sh
Running [/bin/sh] as 11219
Running in new UTS namespace [/bin/sh] as 1
/ #

Here are the cgroup files of PID types which are created by program:

root@host:~# cat /sys/fs/cgroup/pids/ourContainer/pids.max
10
root@host:~# cat /sys/fs/cgroup/pids/ourContainer/cgroup.procs
11222
11225
root@host:~# ps -aux | grep “/proc/self/exe ns /bin/bash”
root 11222 /proc/self/exe ns /bin/sh

But what happened?
When we run “go run container.go run /bin/sh” , this command turns to “/proc/self/exe ns /bin/sh” and that leads us to ns() function which afterwards execute cg() function.
In cg() function, the reinvoked process will write its PID to “/sys/fs/cgroup/pids/ourContainer/cgroup.procs” file and therefore subject itself and its child processes to this cgroup.

I suggest you try invoking more than “ourContainer/pids.max” processes using fork bomb and see what happens
fork bomb is a Bash function, so testing it is like an exercise for you ;))

Well, That’s it.
I hope it helps you understand the containers more deeply.
Please let me know of your thoughts in the comments section blew.
Special Thanks to Liz Rice for her amazing work.

Sources:
Container From Scratch
Namespaces with clone(2)
Namespaces
infoQ