Kubernetes and Container Run-times
On the trail of Linux namespaces and sandboxed containers, containerD, gVisor, Kata Containers etc.
Recently I was checking the gVisor project. Instead of using Docker and its associated runtime — runc, we could use a different container runtime, that prevents applications running inside pods and containers from breaking out of their container sandbox/namespace and invade the host — usually exploiting a vulnerability or leaky abstraction in the container infrastructure.
https://kubernetes.io/docs/setup/production-environment/container-runtimes/ →“A flaw was found in the way runc handled system file descriptors when running containers. A malicious container could use this flaw to overwrite contents of the runc binary and consequently run arbitrary commands on the container host system.”-https://access.redhat.com/security/cve/cve-2019-5736 A detailed write up here.
Such bugs are not so scarce. The container is just an abstraction in Linux userspace using kernel features of namespace and control groups. The kernel has no concept of it. Also, most containers do not use all namespace’s that are available in kernel like user namespace. This means for example that a root user inside a container has all privileges of the root user in the host; and the namespaces of PID, net, mount etc that the container ‘jails’ the process and file system in, are the only lines of defence here. Not a surprise that the law of leaky abstraction hits Container land. However, Containers are not born in 2020 and overtime all major holes have been addressed or known and avoided, and with time this should improve still.
However, your operations team will be still unwilling to run an untrusted payload in a container in a host that has other sensitive or business-critical containers running. Usually, a separate cluster is built on Virtual machine host to isolate these.
And then we go back again to the inefficiency of Virtual Machine, the overhead of VM management and why Containers became an idea that never looked back. This is where more secure runtimes like gVisor/runsc, Kata containers and the like come into play.
Since we are so accustomed to working with higher-level abstractions like containers pods and kubernetes, getting to understand how gVisor and similar work, need us to go down the rabbit hole from abstraction land time and again. Also, the terminology has changed over the years. For example, runc which is used by Docker, has been termed container runtime when it was spun off from Docker. However, it is not a container runtime in current Kubernetes sense of implementing a CRI (Container Runtime Interface). Instead, it has become a reference OCI (Open Container Initiative) Runtime implementation. So in this article, we need to backtrack and deep dive time and again.
Let’s backtrack quite a bit. In the beginning, there was namespace — the mount namespace around 2001 and finally the user namespace in 2013. In parallel came also Control Groups (cgroups), initially from Google developers and expanded on quite a bit (~2006 to 2014). Note that systemd became defacto init process in most Linux distributions around that time, and it had very deep integration with cgroups for process/service management.
When namespaces and cgroups were somewhat stable in the Kernal, came Docker around 2014 that introduced the world to containers using namespaces to jail and cgroups to limit resources — sort of lightweight VMs.
Kubernetes was born around this time 2014 -15 as a means of container orchestration. This was an internal project in Google -Borg much before this, but opensourced during this time.
In 2015, Docker separated out the runtime it into a lower layer — runc and contributed to OCI — https://www.docker.com/blog/runc/.This is the reference implementation of OCI.
In 2016 CRI was introduced by Kubernetes to host other Container runtimes other than Docker. If you recall in each Kubernetes worker node there is a kubelet service that runs and listens to the master node to orchestrate container lifecycle.
At the lowest layers of a Kubernetes node is the software that, among other things, starts and stops containers. We call this the “Container Runtime”. The most widely known container runtime is Docker, but it is not alone in this space. … In the Kubernetes 1.5 release, we are proud to introduce the Container Runtime Interface (CRI) — a plugin interface which enables kubelet to use a wide variety of container runtimes, without the need to recompile. https://kubernetes.io/blog/2016/12/container-runtime-interface-cri-in-kubernetes/
In 2017 Docker spun out the integration layer to runc as a separate project — containerd and open-sourced it.
So we started the containerd project to move the container supervision out of the core Docker Engine and into a separate daemon. containerd has full support for starting OCI bundles and managing their lifecycle. This allows users to replace the runc binary on their system with an alternate runtime and get the benefits of still using Docker’s API. https://www.docker.com/blog/docker-containerd-integration/
Note -On Feb 2019 containerd become the fifth project to qualify from CNCF foundation, no mean achievement as the rest of those who qualify till now are Kubernetes, Prometheus, Envoy, and CoreDNS , an elite batch
Docker gets it’s work done via containerd using runc.Kuberenetes get’s it’s work done in an overwhelming majority of installations via containerd and runc via Docker or using a CRI shim in place of Docker.
Now, this is how Docker call flow goes.
Back to kubelet — kubelet can talk to containerd directly. If you look at some older docs this was through a service called cri-containerd initially.
However, this is now part of containerd itself now and cri-containerd is defunct.
If you have installed your Kubernetes cluster with Kubeadm, you may see that kubelet service is by default dependent on docker service like below
But the Kubernetes built-in dockershim CRI does not support runtime handlers. But this does not prevent one to use containerd service and configure the kubelet to use containerd directly.containerd can be configured to use other OCI compatible runtimes like gvisor or kata containers via service plugins like shown below.
For those experimenting with kubeadm based cluster may feel the current documentation bit incomplete. One can refer to this bug/support I raised until the documentation is improved.
If you follow this https://github.com/google/gvisor-containerd-shim/issues/46 you can set your worker node (worker-1 below to use containerd insread of Docker)
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
master-1 Ready master 69d v1.17.0 192.168.0.26 <none> CentOS Linux 7 (Core) 4.4.211-1.el7.elrepo.x86_64 docker://1.13.1
worker-1 Ready <none> 69d v1.17.0 192.168.0.6 <none> CentOS Linux 7 (Core) 4.4.211-1.el7.elrepo.x86_64 containerd://1.3.2
GVisor has a user-mode lightweight kernel that traps the syscalls/ kernal invocation of applications running in the namespace and act as a sort of firewall. More information here https://gvisor.dev/docs/architecture_guide/ and better here https://blog.loof.fr/2018/06/gvisor-in-depth.html
You can see in an existing K8s setup, how this sample nginx runs in runsc and the rest in runc, after swapping out Docker and using containerd directly with the gvisor and its shim installed.
If you have 35 minutes then this video below is completely worth it to give a good perspective of many aspects of Kubernetes.
SHIM V1 and V2 -Between Contained D and Container there is a containerd-shim interface (for runc). This is the V1 version. Recently SHIM V2 version is released -https://www.alibabacloud.com/blog/cri-and-shimv2-a-new-idea-for-kubernetes-integrating-container-runtime_594783
gVisor Related Configuration https://gist.github.com/alexcpn/8b0550b01dd69df5e0a8fd1116dbd073