Instance-per-Pod Webhook: IaaS-level isolation for Kubernetes Pods

Published in

nttlabs

6 min readDec 9, 2019

Although Kubernetes pods are already hardened in several ways by default, the pods are still unprotected from potential vulnerabilities of the runtimes (kubelet/CRI/OCI), the kernel, and even the hardware.

To mitigate such container-breakout attacks, I’m now working on a project named “Instance-per-Pod Webhook”, which automatically creates IaaS instances to avoid having multiple pods on same node.

Traditional node pool vs. Instance-per-Pod node pool

While Instance-per-Pod Webhook itself doesn’t prevent any breakout attack, it can prevent a compromised pod from gaining privileges for other pods. It can even mitigate hardware vulnerabilities when dedicated IaaS instances are used, e.g. EC2 i3.metal, Azure Dedicated Host, or Google Compute Engine Sole-tenant Node.

Instance-per-Pod Webhook is implemented as a Kubernetes Mutating Admission Webhook, which can trap the Kubernetes control plane to inject custom configuration to the pod resources.

You can find the code here:

AkihiroSuda/instance-per-pod

Instance-per-Pod Admission Webhook (IPP) creates an IaaS instance per Kubernetes Pod to mitigate potential container…

github.com

Implementation

Instance-per-Pod Webhook injects custom tolerations , nodeAffinity , and podAntiAffinity to a Pod manifest so that Kubernetes Cluster Autoscaler will scale out the cluster and create a dedicated Node(IaaS instance) for that pod.

tolerations : Nodes created by Instance-per-Pod Webhook (via Cluster Autoscaler) are tainted to avoid pods to be scheduled by default. Instance-per-Pod Webhook uses tolerations to allow scheduling a pod to the expected tainted node .
nodeAffinity : the tolerations explained above “allows” scheduling a pod to the expected node, but it doesn’t “force” such scheduling. Instance-per-Pod Webhook uses nodeAffinity in conjunction with tolerations to force such scheduling.
podAntiAffinity : podAntiAffinity is used to avoid having multiple pods on a single node.

As the whole IaaS operation is handled by Cluster Autoscaler, Instance-per-Pod Webhook is completely agnostic to the IaaS provider. So it works on any cluster on any cloud that is supported by Cluster Autoscaler, including AWS, Azure, Google Cloud, and OpenStack.

Cluster Autoscaler implementations for several cloud providers

Quick start with GKE

Instance-per-Pod Webhook can even work with managed Kubernetes clusters such as Google Kubernetes Engine (GKE).

To use Instance-per-Pod with GKE, a node pool needs to be created with “Enable autoscaling”. The minimum number of nodes can be set to an arbitrary number. Smaller number is preferred for minimizing the cost, larger number is preferred for minimizing pod startup latency. The maximum number of nodes (= the maximum number of Instance-per-Pod pods) can be set to an arbitrary number as well.

Create a GKE node pool with “Enable autoscaling”

The node pool also must have Kubernetes node label ipp=true and node taint ipp=true with NO_SCHEDULE effect. The label name and the taint name may change in future releases of Instance-per-Pod Webhook.

Set a node label “ipp” and a node taint “ipp”

Then checkout the code (v0.1.1) from the repo:

$ git clone https://github.com/AkihiroSuda/instance-per-pod.git
$ cd instance-per-pod
$ git checkout v0.1.1

The Webhook can be installed to the cluster as follows:

$ IMAGE=gcr.io/YOUR_GOOGLE_CLOUD_PROJECT/ipp:0.1.1
$ docker build -t $IMAGE . && docker push $IMAGE
$ ./ipp.yaml.sh $IMAGE | kubectl apply -f -

To create a pod with the Webhook enabled, you need to set ipp-class label to an arbitrary unique string. Note that pods with the same label value can get co-located on the same node.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: foo
  labels:
    app: foo
    ipp-class: class0
spec:
  selector:
    matchLabels:
      app: foo
  template:
    metadata:
      labels:
        app: foo
        ipp-class: class0
    spec:
      containers:
      - name: nginx
        image: nginx:alpine

The pod will be launched on a newly created node, after a few tens of seconds.

$ kubectl get nodes
NAME                              STATUS   ROLES    AGE ...
****-default-pool-01244521-lr14   Ready    <none>   12d ...$ kubectl apply -f examples/class0.yaml
deployment.apps/foo created$ kubectl get pods -o wide
NAME                  READY   STATUS    ... AGE ... NODE ...
foo-78f895cd4-wt2fz   1/1     Running   ... 70s ... ****-ipp-nodepool-9e0ac2b5-ts6p ...$ kubectl get nodes
NAME                              STATUS   ROLES    AGE ...
****-default-pool-01244521-lr14   Ready    <none>   12d ...
****-ipp-nodepool-9e0ac2b5-ts6p   Ready    <none>   24s ...

In the above example, the pod foo-78f895cd4-wt2fz was launched on the node***-ipp-nodepool-9e0ac2b5-ts6p . The overhead on the pod startup latency to create the node was 46 (=70–24) seconds.

Instance-per-Pod Webhook can be uninstalled from the cluster anytime, just by deleting theMutatingWebhookConfiguration resource and the ipp-system namespace:

$ kubectl delete mutatingwebhookconfiguration ipp
$ kubectl delete namespace ipp-system

Comparison: Kata Containers

Kata Containers is an OCI runtime implementation that creates a virtual machine per a pod, using QEMU, NEMU, or Firecracker.

Kata Containers might be cheaper than Instance-per-Pod Webhook with regard to the IaaS fees, because you can safely co-locate multiple pods on a single IaaS instance, if you trust the VM and the CPU. Also, the pod startup latency is only a few seconds, while Instance-per-Pod Webhook requires a few tens of seconds.

However, Kata Containers require either bare metal instances (e.g. EC2 i3.metal) or instances with support for nested virtualization (e.g. Azure Dv3, Google Compute Engine with a special license). Not all IaaS providers offer such instances. Also, nested virtualization can be slow.

Comparison: EKS-on-Fargate

Coincidentally, Amazon recently announced a commercial product that is very similar to Instance-per-Pod Webhook: “Amazon EKS on AWS Fargate”.

EKS-on-Fargate seems implemented as a Mutating Admission Webhook that injects custom schedulerName to Pod manifests, while Instance-per-Pod Webhook injects custom tolerations , nodeAffinity , and podAntiAffinity , without replacing the scheduler.

The current version of EKS-on-Fargate seems using Xen-based virtualization infrastructure, which is probably just same as the plain old EC2 infrastructure (but doesn’t show up in EC2 Console). Starting a pod takes several tens of seconds as in starting an EC2 instance, but probably this will be improved when they migrate to the new Firecracker-based infrastructure in the near future.

EKS-on-Fargate will use Firecracker, but not now

Although EKS-on-Fargate strongly isolate pods using Fargate instances, it doesn’t support privileged pods for some reason. And yet it doesn’t support DaemonSets. On the other hand, Instance-per-Pod Webhook works well with any kind of pods.

Wrap-up

Instance-per-Pod Webhook strongly isolates pods by creating a dedicated IaaS instance per a pod. Instance-per-Pod Webhook works on any cluster on any cloud.

We’re hiring!

NTT is looking for engineers who work in Open Source communities like Kubernetes & Docker projects. If you wish to work on such projects please do visit our recruitment page.

To know more about NTT contribution towards open source projects please visit our Software Innovation Center page. We have a lot of maintainers and contributors in several open source projects.

Our offices are located in the downtown area of Tokyo (Tamachi, Shinagawa) and Musashino.