Container Sandboxing | gVisor

An overview of container sandboxing and usage of gVisor

Published in

Geek Culture

6 min readDec 1, 2022

Compared to VMs, the containers are weakly isolated from the host operating system. Because VMs run with their own OS (kernel) where the containers share host kernel resources among themselves. As an outcome containers are less secure than VMs.

from: https://unit42.paloaltonetworks.com/

In this article, we will discuss container sandboxing which will help us to make containers more isolated.

Traditional Linux containers are not sandboxed. A container communicates with the OS kernel using system calls.

By using the seccomp filter we can restrict the syscalls and the AppArmor profile will help us to restrict the resources a container can use. But it will be a tedious task to create a seccomp filter or AppArmor profile for hundreds of applications in a production environment. To overcome such a kind of situation, we can use container sandboxing.

Sandboxing techniques

● VM-based container technology —
One approach to improve container isolation is to run each container in its own virtual machine (VM). This gives each container its own “machine,” including the kernel and virtualized devices, completely separate from the host. Even if there is a vulnerability in the guest, the hypervisor still isolates the host, as well as other applications/containers running on the host.
Kata Containers is an open-source community working to build a secure container runtime with lightweight virtual machines that feel and perform like containers, but provide stronger workload isolation using hardware virtualization technology as a second layer of defense. More from the documentation.

● Sandboxed containers with gVisor —
gVisor intercepts application system calls and acts as the guest kernel, without the need for translation through virtualized hardware. More from the documentation.

In this article, our main focus is on gVisor. Following is the discussion about how we can install gVisor on our system and how to use gVisor along with kubernetes pods.

Without gVisor

In kubernetes, a container runs on the worker node kernel though they are isolated using namespaces and cgroups. To verify a container runs on the worker nodes kernel, let's run the ‘uname’ command on the host machine.

# execute 'uname' command
>> uname -sr    
Linux 5.4.0-131-generic    # <------

And then let’s create a pod and ‘exec’ into it and run the ‘uname’ command.

# create a pod
>> kubectl run normal-pod --image=busybonx -- sleep 2000

# exec into the pod
>> kubectl exec -it normal-pod -- bash

# execute 'uname' command 
root@normal-pod:/ uname -sr
Linux 5.4.0-131-generic      # <------

In the above illustration, we have seen that the kernel name and kernel release name of the worker node and the pod are the same. With that, we can assume that pods are sharing the worker node kernel. To be certain about that fact, we can check the process id (PID). A container is nothing but a process running on the worker node. To verify that once again ‘exec’ into the pod —

# exec into the pod
>> kubectl exec -it normal-pod -- bash

# run the 'ps aux' command

-> ps aux
-----------------------------------------------------------------------------
PID   USER     TIME  COMMAND
    1 root      0:00 sleep 2000   #<----------
   22 root      0:00 sh
   27 root      0:00 ps aux

Now, ‘ssh’ on to the worker node where the pod is running and run the ‘ps aux’ command, and search for the process which is running as a container.

>> ssh node01
>> ps aux | grep sleep

root       34930  0.0  0.0   1312     4 ?        Ss   02:12   0:00 sleep 2000

We can see that the same process is running on the worker but the process id is different. Because processes are isolated using PID namespaces.

So we can say that pods/containers share worker nodes kernel which can be a major security risk. Because a compromised pod/container can affect other pods/containers.

With gVisor Sandboxing

1. gVisor Installation on worker nodes
We can use the following script to install gVisor on the nodes (where pods will be running)

#!/bin/bash

# To download and install the latest release

(
  set -e
  ARCH=$(uname -m)
  URL=https://storage.googleapis.com/gvisor/releases/release/latest/${ARCH}
  wget ${URL}/runsc ${URL}/runsc.sha512 \
    ${URL}/containerd-shim-runsc-v1 ${URL}/containerd-shim-runsc-v1.sha512
  sha512sum -c runsc.sha512 \
    -c containerd-shim-runsc-v1.sha512
  rm -f *.sha512
  chmod a+rx runsc containerd-shim-runsc-v1
  sudo mv runsc containerd-shim-runsc-v1 /usr/local/bin
)


# Update /etc/containerd/config.toml

cat <<EOF | sudo tee /etc/containerd/config.toml
version = 2
[plugins."io.containerd.runtime.v1.linux"]
  shim_debug = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
  runtime_type = "io.containerd.runsc.v1"
EOF

# Reload containerd

sudo systemctl restart containerd

2. Runtime Class
RuntimeClass is a feature for selecting the container runtime configuration. The container runtime configuration is used to run a pod's containers.

We can set a different RuntimeClass between different Pods to provide a balance of performance versus security. For example, if part of our workload deserves a high level of information security assurance, we might choose to schedule those Pods so that they run in a container runtime that uses hardware virtualization. We'd then benefit from the extra isolation of the alternative runtime, at the expense of some additional overhead.

Now, configure a runtime class for gVisor:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor  # <-- name can be anything
handler: runsc  # <-- gVisor uses 'runsc' named container runtime to create containers

The default container runtime is ‘runc’ which is defined by the OCI. gVisor uses ‘runsc’ named runtime and kata containers uses ‘kata-runtime’ named runtime.

3. Deploy a pod using RuntimeClasses
Once RuntimeClasses are configured for the cluster, we can specify a runtimeClassName in the Pod’s spec section to use it. For example:

apiVersion: v1
kind: Pod
metadata:
  labels:
    run: sandboxed
  name: sandboxed
spec:
  runtimeClassName: gvisor   #<--
  containers:
  - args:
    - sleep
    - "2000"
    image: busybox
    name: sandboxed

4. Verification

● ‘exec’ into the pod and run the ‘dmesg’ command to verify whether gVisor started working or not:

>> kubectl exec -it sandboxed -- sh

--> dmesg | grep gVisor

[   0.000000] Starting gVisor..　　#<--

●‘exec’ into the pod and run the ‘uname’ command:

>> kubectl exec -it sandboxed -- sh

# run uname command
--> uname -sr
Linux 4.4.0   #<---

And then run the ‘uname’ command on the worker node where the pod is running:

>> ssh node01
>> uname -sr
Linux 5.4.0-131-generic  #<---

As we can see the kernel name and kernel release version of the worker node and the container are different. This implies that containers are not directly running on the worker node, it's been sandboxed.

You can read the following resource for a better understanding of sandboxing concepts: