How to enable Kubernetes container RuntimeDefault seccomp profile for all workloads

Lachlan Evenson
8 min readAug 22, 2021

--

Photo by Filip Mroz on Unsplash

Kubernetes v1.22 shipped with a new feature in alpha that provides a way to use the container RuntimeDefault as the default seccomp profile for all workloads. At this point you might be asking, “What are RuntimeDefaults and why should I care?” By default, when Kubernetes makes a call to the container runtime to create a container it provides a seccomp profile of Unconfined. This means that seccomp filtering is disabled. This is the default Kubernetes behavior as it ensures maximum application compatibility at the risk of leaving a larger surface area of the Linux kernel open to exploit by a compromised container.

Before we take a deeper look, I would like to say thank you to Duffie Cooley for taking the time to educate me on this topic. I was inspired to look deeper after watching “This Week in Cloud Native” which Duffie hosts. You should check it out.

NOTE: Kubernetes features that are in alpha aren’t typically available for use by managed Kubernetes services like AKS, EKS, GKE for stability reasons. Please check your providers documentation

Seccomp profile what?

Seccomp (Secure Computing) is a feature in the Linux kernel that allows a userspace program to create syscall filters. In the context of containers, these syscall filters are collated into seccomp profiles that can be used to restrict which syscalls and arguments are permitted. Applying seccomp profiles to containers reduces the chance that a Linux kernel vulnerability will be exploited.

All container runtimes ship with a default seccomp profile (or RuntimeDefault) that is applied to containers. These default seccomp profiles aim to strike the balance between a secure set of defaults without sacrificing the functionality of the workload. The problem arrises when these container runtimes are integrated with Kubernetes, Kubernetes will explicitly set the seccomp profile to Unconfined which disables seccomp filtering.

Docker desktop ships with a default seccomp profile that is used whenever you run a container directly using the Docker CLI. The default profile disables approximately 44 system calls of the 300+ currently availble. At this point most other well known container runtimes also ship with a default seccomp profile and they are very similar if not the same as the one Docker uses (I’ve only done some light research). The Docker documentation on seccomp profiles is great and I recommend the read.

We can easily test that the default seccomp profile is being applied on Docker desktop by running the following commands.

$ docker run -it bash bash
bash-5.1# apk add curl
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
(1/4) Installing ca-certificates (20191127-r4)
(2/4) Installing nghttp2-libs (1.41.0-r0)
(3/4) Installing libcurl (7.78.0-r0)
(4/4) Installing curl (7.78.0-r0)
Executing busybox-1.31.1-r20.trigger
Executing ca-certificates-20191127-r4.trigger
OK: 8 MiB in 21 packages
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 237 100 237 0 0 1022 0 --:--:-- --:--:-- --:--:-- 1025
100 5936k 100 5936k 0 0 4069k 0 0:00:01 0:00:01 --:--:-- 6415k
bash-5.1# curl -LO k8s.work/amicontained
bash-5.1# chmod +x amicontained
bash-5.1#./amicontainedContainer
Runtime: docker
Has Namespaces:
pid: true
user: false
AppArmor Profile: unconfined
Capabilities:
BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap
Seccomp: filtering
Blocked Syscalls (60):
MSGRCV SYSLOG SETSID USELIB USTAT SYSFS VHANGUP PIVOT_ROOT _SYSCTL ACCT SETTIMEOFDAY MOUNT UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE GET_KERNEL_SYMS QUERY_MODULE QUOTACTL NFSSERVCTL GETPMSG PUTPMSG AFS_SYSCALL TUXCALL SECURITY LOOKUP_DCOOKIE CLOCK_SETTIME VSERVER MBIND SET_MEMPOLICY GET_MEMPOLICY KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL MIGRATE_PAGES UNSHARE MOVE_PAGES PERF_EVENT_OPEN FANOTIFY_INIT NAME_TO_HANDLE_AT OPEN_BY_HANDLE_AT SETNS PROCESS_VM_READV PROCESS_VM_WRITEV KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF USERFAULTFD PKEY_MPROTECT PKEY_ALLOC PKEY_FREE
Looking for Docker.sock

In the output above you can see the following which states that seccomp filtering is enabled at that 60 syscalls are being blocked:

Seccomp: filtering
Blocked Syscalls (60):

Here are the commands that I used to run the test, you can use them to test yourself. Massive thanks to Duffie from providing me with these commands.

docker run -it bash bash
apk add curl
curl -LO k8s.work/amicontained
chmod +x amicontained
./amicontained

Configuring Kubernetes RuntimeDefault

Now that we know all about seccomp profiles and RuntimeDefault let’s take a look at how we can configure Kubernetes to use the RuntimeDefault seccomp profile rather that using Unconfined. First, I would like to demonstrate how a default Kubernetes cluster without this new feature enabled operates. We are going to prove that Kubernetes sets the seccomp profile to Unconfined.

We will need a Kubernetes cluster to run the tests. I use Kind to quickly and easily spin up a Kubernetes cluster locally. Create the following config file called kind-config.yaml.

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker

Create a Kubernetes cluster using Kind.

$ kind create cluster --image=kindest/node:v1.22.0@sha256:b8bda84bb3a190e6e028b1760d277454a72267a5454b57db34437c34a588d047 --config kind-config.yaml
Creating cluster "kind" ...
✓ Ensuring node image (kindest/node:v1.22.0) 🖼
✓ Preparing nodes 📦 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing CNI 🔌
✓ Installing StorageClass 💾
✓ Joining worker nodes 🚜
Set kubectl context to "kind-kind"
You can now use your cluster with:
kubectl cluster-info --context kind-kindThanks for using kind!

Let’s again run the same test as we did in the previous section but this time on the Kubernetes cluster.

$ kubectl run -it bash --image=bash --restart=Never bashIf you don't see a command prompt, try pressing enter.bash-5.1# apk add curl
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
(1/4) Installing ca-certificates (20191127-r4)
(2/4) Installing nghttp2-libs (1.41.0-r0)
(3/4) Installing libcurl (7.78.0-r0)
(4/4) Installing curl (7.78.0-r0)
Executing busybox-1.31.1-r20.trigger
Executing ca-certificates-20191127-r4.trigger
OK: 8 MiB in 21 packages
bash-5.1# curl -LO k8s.work/amicontained
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 237 100 237 0 0 1032 0 --:--:-- --:--:-- --:--:-- 1034
100 5936k 100 5936k 0 0 4580k 0 0:00:01 0:00:01 --:--:-- 16.6M
bash-5.1# chmod +x amicontained
bash-5.1# ./amicontained
Container Runtime: docker
Has Namespaces:
pid: true
user: false
AppArmor Profile: unconfined
Capabilities:
BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap
Seccomp: disabled
Blocked Syscalls (21):
MSGRCV SYSLOG SETSID VHANGUP PIVOT_ROOT ACCT SETTIMEOFDAY SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME INIT_MODULE DELETE_MODULE LOOKUP_DCOOKIE KEXEC_LOAD FANOTIFY_INIT OPEN_BY_HANDLE_AT FINIT_MODULE KEXEC_FILE_LOAD BPF
Looking for Docker.sock

In the output above you can see that seccomp is disabled and that 21 syscalls are being blocked. In comparison with the output of the same test running on Docker on my dev machine using the Docker CLI, seccomp is filtering and 60 syscalls are being blocked. Now that we can see the baseline. Let’s create a cluster with and configure the new RuntimeDefault feature.

Delete the last Kind Kubernetes cluster using the following command:

$ kind delete cluster
Deleting cluster "kind" ...

Now let’s create a new config file called kind-runtimedefault-config.yaml with the following content. You may notice that we are setting the SeccompDefault feature gate to true which enables the new feature:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
SeccompDefault: true
nodes:
- role: control-plane
kubeadmConfigPatches:
- |
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
seccomp-default: "true"
feature-gates: "SeccompDefault=true"
- role: worker
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
kubeletExtraArgs:
seccomp-default: "true"
feature-gates: "SeccompDefault=true"

Create a Kubernetes cluster using Kind and the config file you just created:

$ kind create cluster --image=kindest/node:v1.22.0@sha256:b8bda84bb3a190e6e028b1760d277454a72267a5454b57db34437c34a588d047 --config ~/Downloads/kind-runtimedefault-config.yaml
Creating cluster "kind" ...
✓ Ensuring node image (kindest/node:v1.22.0) 🖼
✓ Preparing nodes 📦 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing CNI 🔌
✓ Installing StorageClass 💾
✓ Joining worker nodes 🚜
Set kubectl context to "kind-kind"
You can now use your cluster with:
kubectl cluster-info --context kind-kindHave a nice day! 👋

Let’s again run the same test however we would expect the RuntimeDefault seccomp profile to be applied.

$ kubectl run -it bash --image=bash --restart=Never bash
If you don't see a command prompt, try pressing enter.
bash-5.1# apk add curl
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
(1/4) Installing ca-certificates (20191127-r4)
(2/4) Installing nghttp2-libs (1.41.0-r0)
(3/4) Installing libcurl (7.78.0-r0)
(4/4) Installing curl (7.78.0-r0)
Executing busybox-1.31.1-r20.trigger
Executing ca-certificates-20191127-r4.trigger
OK: 8 MiB in 21 packages
bash-5.1#
bash-5.1# curl -LO k8s.work/amicontained
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 237 100 237 0 0 916 0 --:--:-- --:--:-- --:--:-- 915
100 5936k 100 5936k 0 0 4313k 0 0:00:01 0:00:01 --:--:-- 14.8M
bash-5.1# chmod +x amicontained
bash-5.1# ./amicontained
Container Runtime: docker
Has Namespaces:
pid: true
user: false
AppArmor Profile: unconfined
Capabilities:
BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap
Seccomp: filtering
Blocked Syscalls (61):
MSGRCV PTRACE SYSLOG SETSID USELIB USTAT SYSFS VHANGUP PIVOT_ROOT _SYSCTL ACCT SETTIMEOFDAY MOUNT UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE GET_KERNEL_SYMS QUERY_MODULE QUOTACTL NFSSERVCTL GETPMSG PUTPMSG AFS_SYSCALL TUXCALL SECURITY LOOKUP_DCOOKIE CLOCK_SETTIME VSERVER MBIND SET_MEMPOLICY GET_MEMPOLICY KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL MIGRATE_PAGES UNSHARE MOVE_PAGES PERF_EVENT_OPEN FANOTIFY_INIT NAME_TO_HANDLE_AT OPEN_BY_HANDLE_AT SETNS PROCESS_VM_READV PROCESS_VM_WRITEV KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF USERFAULTFD PKEY_MPROTECT PKEY_ALLOC PKEY_FREE
Looking for Docker.sock
bash-5.1#

In the output above you can see that seccomp is filtering and that 61 syscalls are being blocked. This validates that the RuntimeDefault seccomp profile is being successfully applied.

Rolling this out in production

Now that you know how to configure this on a Kubernetes cluster you probable want to know how to roll it out. The Kubernetes documentation recommends enabling this feature flag on a subset of nodes and test your workloads thoroughly before rolling it out to an entire Kubernetes cluster.

It’s also worth mentioning again that this feature is in alpha so it might be worth waiting for it to mature before enabling it in production.

The jury is out on whether or not we can make the syscall filter responses be more application friendly especially for new applications that use libraries that depend on newer syscalls that may not yet be in allowed by the RuntimeDefault seccomp profile. Currently the RuntimeDefault seccomp profile responds to the application with the equivalent of “You don’t have permission” or EPERM for blocked syscalls which might include newer syscalls that haven’t yet been allowed. The other suggestion is to respond with “function not implemented” or ENOSYS by default for blocked syscalls so that applications that use newer libraries call “fallback” to other syscalls. You can read more on this discussion in the following GitHub issue. Thanks to Brian Goff for educating me on the up to date happenings regarding seccomp profiles in container runtimes.

Conclusion

We covered a lot of ground in this blog starting with seccomp profiles and how they are used in the context of container runtimes. We then enabled a new feature that enables Kubernetes to use the RuntimeDefault seccomp profile. The question now is “How much does this matter?” and that’s something I will leave you to answer. Certainly, anything you can do to reduce the attack surface area on the Linux kernel from a container is an incremental improvement on the overall security posture of the Kubernetes cluster. Let me know what you think?

Finally, I will leave you with a great resource from Duffie. Kubernetes Seccomp Profiles: A Practical Guide if you are looking for a more in depth look at seccomp profiles.

--

--

Lachlan Evenson

Husband | Father of three | Youtuber | Containers @Azure | 🇦🇺 | Time Traveller | CloudNative Ambassador + Mercenary | CKA | Opinions are my own.