Secure container runtime

Published in

Coccoc Engineering Blog

6 min readApr 18, 2022

Container is a great tech, but it has problems with security: weak isolation. To cover for it, our company has a policy that enforce all containers in our k8s cluster must run as non-root users. However, there ‘s still some exceptions that need root privileges, like our mail relay pod using postfix. To let those services run on our k8s while maintaining our security policies, we need a tighter isolation container runtime.

While the new container runtime has better isolation than runc — our current runtime, it also must satisfy:

Easy to integrated to our current infras
The new container runtime must follow OCI standard.
Keep our current workflows and templates
We could use the current k8s manifests and workflows with minimum changes to get the new runtime works.
Ensure the performance and compatibility
Of course, there are always trade-offs while tightening security. However, it ‘s meaningless if the service failed to run or the performance is drastically reduced.

At first, we had 3 candidates, picked up by our boss: kata container, gvisor and firecracker. However, I removed firecracker from the list because it ‘s actually not container runtime. It ‘s VMM (virtual machine monitor), which is optimized to run serverless functions. It ‘s not follow OCI standards, therefore there ‘s no easy way to make it work with our k8s. That leave us gvisor and firecracker to choose.

Gvisor

Gvisor is an application kernel, written in Go. Its runtime is called runsc and follows OCI standard making it easy to integrate with k8s.

Gvisor creates a sandbox to run containers in the user-space kernel, implement a part of Linux system calls. The containers, instead of running on host kernel, will run on a separated gvisor kernel, which is very similar to a virtual machine. However, unlike VM, gvisor doesn’t need translation through virtual hardware.

Because gvisor create a filter between application and kernel, it intercepts with system calls, and will cause performance degradation in case of heavy system calls.

At the time of this post, gvisor provides an environment equivalent to Linux4.4 with limited system calls so incompatibilities may occur.

Kata Container

Is a combination of Clear Containers (intel) and runv (Hyper.sh) and follow OCI standard, so it could be seamlessly plugged into k8s.

Kata container make use of virtualization technology to secure containers, by putting containers into a light-weight Virtual machine. The machine image is optimized to run containers, with only needed virtual devices and software to run containers. Because of that, you could get isolation of VM on containers — the best of both worlds.

Kata container works by start a VM with standard linux kernel. Inside the VM, an agent will be used to manage the containers using libcontainer, kata runtime will manage the agent via vsock.

The downside is it need some resources overhead for each VM. We could reduce the OS memory footprint by enabling KSM (kernel samepage merging), with the cost of some CPU power in case we have multiple kata pods running on the same host.

Comparison

Based on what we know, we could come to conclusions:

Both Gvisor and kata provide better isolation than runc.
Kata should has better compatibility than gvisor.
Both should have little overhead on CPU/Mem but Gvisor should have smaller footprint over kata (on a single pod), and may boot faster.
Gvisor may have reduced performance on syscall heavily.

At the time of this post, kata is a better candidate for runc replacement to run untrusted workload on production environment. One more thing to consider is kata needs virtualization enabled to run (CPU flag vmx for intel, or svm for amd), which is not a problem for our system, as our k8s nodes are bare-metal. However, you need nested virtualization enabled if you place your k8s nodes in VMs and that comes with restrictions and performance penalty. For more infos about nested virtualization: google cloud docs

Benchmark

To compare the performance of kata and gvisor, I created a simple sysbench container and start the container with 3 runtime in turn on the same host.

limits:
  cpu: 4
  memory: 4000Mi
Test commands:
  sysbench cpu --threads=4 run --cpu-max-prime=20000
  sysbench memory run

The result shows that for cpu bench, all 3 runtime has the same score (with very small margin, which we could ignore). That ‘s understandable, because sysbench evaluates cpu power by calculate the prime numbers smaller or equal to cpu-map-prime parameter. It ‘s arithmetic calculation and involved almost no syscalls .

cpu benchmark results:
  runc:  13619
  kata:  13219
  gvisor:13761

At the memory benchmark, runc performance is higher than kata and gvisor about 13%

memory benchmark results:
  runc:  42132608
  kata:  37118940
  gvisor:36232345

I found a result of real-life benchmark between runc, gvisor and kata:

https://object-storage-ca-ymq-1.vexxhost.net/swift/v1/6e4619c416ff4bd19e1c087f27a43eea/www-assets-prod/presentation-media/kata-containers-and-gvisor-a-quantitave-comparison.pdf

You could see that gvisor ‘s performance is very poor in this case, which is the main reason why I choose kata over gvisor.

Integrate new runtime to k8s

Presiquites:

K8s, kubelet (>= v1.4.3)
Containerd (>= v1.2.0) with cri plugin
Kata container (>= v1.5.0)

As both gvisor and kata follow OCI standard, integrate to k8s is easy.

Integrate kata to k8s

Define new runtime class on k8s cluster

Create kata-runtime.yaml file with content:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
  metadata:
    name: kata
handler: kata
overhead:
  podFixed:
    memory: "70Mi"
    cpu: "150m"

Note: We need to define overhead of runtime class, this will count for resource quota calculations, node scheduling as well as Pod cgroup sizing.

Apply on k8s:

$ kubectl apply -f kata-runtime.yaml

2. Install kata runtime on nodes

Installation guide could be found at kata docs on github

As we ‘re using debian:

$ ARCH=$(arch)
$ BRANCH=”${BRANCH:-master}”
$ echo ‘deb http://download.opensuse.org/repositories/home:/katacontainers:/releases:/${ARCH}:/${BRANCH}/Debian_${VERSION_ID}/ /’ > /etc/apt/sources.list.d/kata-containers.list”
$ curl -sL http://download.opensuse.org/repositories/home:/katacontainers:/releases:/${ARCH}:/${BRANCH}/Debian_${VERSION_ID}/Release.key | sudo apt-key add -
$ apt-get update
$ apt-get install -y kata-runtime kata-proxy kata-shim

3. Configure containerd by adding to /etc/containerd/config.toml

[plugins]
  [plugins.cri]
    [plugins.cri.containerd]
      [plugins.cri.containerd.runtimes.kata]
        runtime_type = “io.containerd.kata.v2”
        privileged_without_host_devices = false

Restart containerd to apply the new config

$ systemctl restart containerd

4. Deploy new pod with kata runtime

We just need to define directive “runtimeClassName: kata” at pod spec:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: katademo
  namespace: inf-dev
spec:
  selector:
    matchLabels:
      app: katademo
  template:
    metadata:
      labels:
        app: katademo
    spec:
      runtimeClassName: kata

Conclusion

I setup an prototype pod using kata runtime. It run well for more than 1 month without problem. And the setup is simple, you could do it within minutes. Even the overhead is small, we ‘re willing to accept it for increased security. But when I presented this solution to my teammates, there ‘s one raised his concerns about adding one more layer to our stack will make the troubleshooting problems more difficult. That ‘s reasonable and he has his points. However, security comes at prices —and it ‘s a price we ‘re willing to pay.