Security Challenges with Kubernetes

shashank Jain
11 min readJan 1, 2019

The motivation behind this blog is to highlight security aspects around how to build/deploy/run secure containers using kubernetes as the orchestration engine. The aspects entail security from what kind of ownership a container runs as and what privileges in terms of file system/port access / capabilities a container should be allowed so that the whole environment runs in a secure way and chances of breach are limited. The very fact that by using Docker as the container runtime which shares the kernel, the kernel vulnerabilities found in the wild increase the attack vector tremendously. To mitigate this the approach taken is around the lines of defense in depth principle which entails having different layers of security around the workloads. So, the document is not just limited to applying static checks in terms of security but also proposes some runtime aspects which can be applied to harden the security perimeter.

K8s cluster operator security perspective:

From a cluster operations POV, a cluster should have hardening of the default setup. This will include how from network perspective; the cluster security looks like. How are the tenants isolated from the network standpoint? This also entails with what privileges a containerized workload can run. Keeping this in mind, and using the concept of pod security policies certain aspects can be hardened as illustrated below

Pod Security Policies: Kubernetes v1.13 [beta]

a. Read only rootfs — The root file system is the set of binaries with userspace utilities needed to interface with the kernel. With rootfs having write permissions, a possibility of a container image to download some binaries is possible. Since hard to check during runtime what binaries are downloaded, a recommended way is not allow any further installation during runtime on rootfs by making rootfs as read only. This also fits well with the notion of immutable infrastructure. Linux kit as an example provides a read only rootfs.

b. Host port access disallowed — There are several processes running on host like kubelet, docker daemon (on domain socket) which can be subject to possible attacks from within the pods. The idea here is to block the access to such ports.

c. Capabilities restriction — Assess the set of Linux capabilities needed for a pod. Define the set of capabilities as part of the PSP. A service operator can choose to apply the capabilities from within that set. Anything beyond should be blocked. Another aspect to be considered here, is that if we use user namespaces there will be a different notion of certain capabilities within the user namespace. As an example, we possibly can have no capability of mounting outside the namespace but within the user namespace we can have that for selective FS.
A popular exploit Shocker where in a container with CAP_DAC_READ_SEARCH capability could use the open_by_handle_at system call to access some privileged file data (http://stealth.openwall.net/xSports/shocker.c). With 4.14 the file system capabilities are scoped to namespace so technically any capability set on file within a user namespace doesn’t hold good if same file is executed from host.

d. Seccomp profiling — A service pod should have only a certain set of system calls enabled to reduce the possible attack vector arising out of access to certain set of sys calls. Classic example is disabling ptrace sys call from within the pod. This will reduce the attack surface tremendously and is also basis to mechanisms like gvisor or Nabla containers where the idea is to reduce this attack surface. As an example, exploit on a waitd system call led to attaining cap_net_admin capabilities in one of the known CVEs. Another known vulnerability in this area was the usage of waitd system call which allows to write data into kernel memory. Using the exploit . Other reported CVEs fixed by using seccomp filters

CVE-2014–4699- Bug in ptrace to allow privilege escalation. Fixed using disallowing ptrace sys call via seccomp

CVE-2014–9529 — Crafted keyctl() calls could cause DoS and memory corruption. Again fixed by disallowing the sys call via seccomp.

CVE-2016–0728 — Again using keyctl() in a crafty way could lead to privilege escalation. Seccomp filters to the rescue.

e. Apparmor/SELinux/landlock — These are mechanisms to enforce Mandatory Access controls using the Linux Security Modules on top of the DAC mechanisms with the Linux OS. Explore as to do we need such fine-grained access controls which means one can enforce privileges based on kernel objects like allowing only a certain process to open a file and all other process though may be have the same ownership still can’t open the file which if we just resort to DAC mechanisms based on file ownership will not discriminate between the two processes. This allows much more fine-grained sandboxing. As an example, a process which is able to elevate the privileges still will be denied access to the specific kernel objects like files/network ports. The LSM policies are still enforced at host level. The work in community is in progress to scope them to namespaces. Comparing approaches like namespaces, Seccomp, SELinux and landlock on parameters like fine grained security, Embedding (whether the policies can be applied at host level or on containers) and whether they can be applied by privileged users or unprivileged users as well

SELinux and landlock like fine-grained security approaches have to be evaluated and applied in case we have very specific use cases for the same. One possibility is to only allow kubelet access to kubeconfig file or kubelet to access certain ports.

CVE’s like https://github.com/mirchr/security-research/blob/master/vulnerabilities/CVE-2018-19788.sh found in softwares like policykit allow privilege escalation (A low-privileged user account on most Linux operating systems with UID value anything greater than 2147483647 can execute any systemctl command unauthorized). With SELinux the users can be confined and a breach can be avoided.

There might be overlap between the mechanisms mentioned like seccomp, capabilities and MAC mechanisms like Apparmor/SELinux. Example being a mount call which will be disabled by apparmor profile as well as blocked by seccomp sys call filter and also the capability cap_sys_admin won’t be allowed. This fits well into the principle of defence in depth as there can be possibly multiple layers of protection and a breach of one still means one has to breach the others.

a. Not allow usage of host namespace — Kubernetes allows the creation of pod within a host namespace. Service pods should not be allowed to have access to host network namespace. If a cap_net_admin capability is allowed as PSP and a combination with creating a pod in host network namespace can possibly lead to a situation where the pod cannot just have capability of enumerating ports on the host network, but also ability to intercept and capture the traffic on host.

b. AllowHostPaths- Disallow mounting of host directories like root,dev,docker daemon domain socket etc which can possibly allow backdoors and privilege elevation means to pods.

c. Run as non-root (possibly explore user namespaces)-How will the rootfs ownership get impacted if container is run as non-root? One challenge with user namespaces (if the same user is mapped on to host for multiple containers), is that some of the resource limits are not scoped to user namespace. The limits like

1. Pending signals

2. Max user process

3. Max FDs per user

Can be exploited by one container to then cause a DoS on other containers.

Other reported CVEs with user namespaces

CVE-2013–1956, 1957, 1958, 1959, 1979, CVE-2014–4014, 5206, 5207, 7970, 7975, CVE-2015–2925, 8543, CVE-2016–3134,3135,

These are mitigated by not allowing processes inside the namespaces to create nested namespaces by seccomp profiling.

Sample workflow for applying the Pod security policy within the cluster/namespace

1. Define a PodSecurityPolicy (PSP) like above

2. Create a role for the above PSP

3. Create a service account for service fabrik

4. Create a Cluster rolebinding for service fabrik to bind the above role to the service account.

First create the PSP resource for the cluster and then any pods provisioned within the cluster (across namespaces) will have the policy applied on those pods. We can fine grain this access as well at the namespace level if needed.

Applying the Security Contexts in k8s

Using above constructs which may or may not be supported by K8s and docker, how can we restrict access given to sidecar/main processes.

Like:

(1) Allow only certain system calls. (whitelisting).

(2) Only allow network calls.

(3) Allow/control container directory access

mounts

/dev

/proc.

K8s provides notion of Security Contexts -

Security context is defined as part of the pod meta-data as to what context a pod should operate under. The contexts would be evaluated against the pod security policy defined. The following aspects are covered as part of the security contexts.

d. Linux capabilities- PSP governs what capabilities are allowed. If user namespaces are allowed can certain capabilities be given to containers to run within the namespaces of the container.

e. Seccomp.

f. AppArmor/SELinux

g. Other Linux Security Modules like landlock — Landlock uses ebpf as the mechanism to apply fine grained security checks on kernel objects like files unlike seccomp which is restricting checks on system calls. This allows to build sandboxing around file system access. This means that this can be used as a mechanism to restrict containers to access files which are not owned by the container.

Both the above measures of PSP and security context are based on the assumption, that the attacker has attained some level of access within the pod by means of an exploit on the service instances.

Resource e Limits and DoS possibilities

The cgroups controller in Docker allows to do memory/cpu/io accounting. This primarily protects one resource from hogging the system and cause noisy neighbours like scenarios.

For memory cgroup the controller has an accounting for anonymous as well as page cache per container. There are additions to even account for kernel memory say in the form of tcp sockets, kernel stack, slab pages(handling kernel object caches). Any of the kernel resource can also lead to memory pressure on the node.

What would be interesting to check here is that whether the cgroup accounts for memory pages used for creating pipes. Since memory for pipes is memory allocated within kernel, a container can tend to create pipes (by default having 4k buffer) and just can keep the memory blocked by just opening one side or even lead to a DoS in case the global file descriptors limit allow it to happen. Initial tests suggest that this memory is accounted for and in case of memory exceeding the pod memory, the container is killed and recreated. But this still needs some more tests to confirm the behaviour.

Access to instance meta-data

We have to validate if pods/containers running on the instance derive meta-data information. This can possibly expose the vm userdata to the containers. It might allow even credentials for the instance profile

On AWS

curl http://169.254.169.254/latest/meta-data/iam/security-credentials/default

{

“Code” : “Success”,

“LastUpdated” : “2018–04–29T13:28:16Z”,

“Type” : “AWS-HMAC”,

“AccessKeyId” : “AAAEEEEEIIIIOOO”,

“SecretAccessKey” : “aaaaaaaa2222222233333330000aaa”,

“Token” : “a-very-long-token-that-looks-to-be-base64-encoded=”,

“Expiration” : “2018–04–29T19:50:14Z”

}

On GCP

curl “http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token" -H “Metadata-Flavor: Google”

Returns

{“access_token”:”ya29.c.EmZ7BgVx-bctVbU0NgY4ts9Dhg3RWbnx4AL9M6D_Qbg0J6P_6fTlphmXzsnhXJYeJa0lj7QlxdijMwOA55N3lVhiplcSmG-8m7rS4-PArfIJ8",”expires_in”:3458,”token_type”:”Bearer”}

On openstack accessing meta-data service from within the container (curl http://169.254.169.254/openstack/2012-08-10/user_data) can lead to exposing some critical information.

On AWS though a simple routing rule can be added to fix this

route add -host 169.254.169.254 reject

But this can also then prohibit even the host accessing the meta-data service. In past we have seen systems like BOSH use this meta-data service and such routing rule can block the normal functioning.

Use of iptables DOCKER-USER chain a possibility?

iptables \

— insert DOCKER-USER \

— destination 169.254.169.254 \

— jump REJECT

Component which applies this should be having cap_net_admin capabilities. A daemon set can be used to apply these rules.

Security areas around Gitops

This part should cover on how we in general should build/deploy/operate containers. Aim should be to have rootless execution end to end . Some aspects to be addressed-

1. Can we build the container images with a build container? The build container doesn’t have to run as root. Use of user namespaces to create the build container and have certain privileges like mount etc within that user namespace should allow to create a container runtime without being root on host. This means we make rootless as part of the gitops (CI/CD mechanism)

Few projects doing it already

1. Buildah

1. Img by Jessie (https://github.com/genuinetools/img)

2. Rootless containers

3. Running the actual service workloads within rootless containers

4. As part of CI/CD what other tools we run like image/config scanning tools like kubehunter, kube Bench/Docker bench ?

5. Handling of credentials (Should be stored encrypted in git and decrypted during the actual build process. Tools like kubeseal (https://github.com/bitnami-labs/sealed-secrets) or similar can be recommended.

5. Mounting rootfs as readonly. Avoids any other download and installation during runtime. Can be done by pod security policy.

Security Auditing

Linux provides a mechanism to do auditing of system calls made by running processes. This information can be handy in determining what kind of system calls are being used and detecting anomalies if any within those access. As a simple example changes to files like /etc/passwd or /etc/sudoers can have audit event generated and acted upon.

The audit system constitutes of

1. The kernel part kauditd which captures the system call and passes it to user space via netlink sockets

2. Auditd is the userspace daemon which receives the audit events and can put them into persistent storage or transfer over network

The Audit system is not namespaced within the kernel and so hard to determine which process has generated a specific event. There is a way to get around this in Docker world via tools like https://github.com/ubercoolsec/go-audit-container which traverse the process tree up words and treat the containerd shim pid as the container id to tag the specific event with that container id.

Use of Intrusion detection systems

Most of the stuff above covers measures which try to apply right policies and gives only needed access to resources. Still the attackers might find a way or two to possibly find a service or OS level vulnerability and try to attain access as well as try to do a privilege escalation via the exploits. To preempt such attacks,

Ability to determine at runtime if some malafied access is happening. Like someone doing an exec into service pods. A system to generate alerts and possibly lock down the service to avoid any illegal access.

Examples:

1. Sysdig, Falco.

2. EBPF — using tracepoints combined with ebpf to intercept the sys calls leading to shell access within containers/pods as an example

3. OpenVAS — Vulnerability assessment tool to determine if some pods are running with vulnerabilities. Intensive operation as has to do a reconnaissance of the systems which generates traffic.

4. Kube-hunter

It was also a point in KubeCon NA 2018

https://schd.ws/hosted_files/kccna18/1c/KubeCon%20NA%20-%20This%20year%2C%20it%27s%20about%20security%20-%2020181211.pdf

Static checks:

1. Image scanning

Cluster Configuration checks via tools like kube bench (https://github.com/aquasecurity/kube-bench).

(4) Miscellaneous:

Attacks for securing API server and other services:

(a) Denial of service for API server

Attacks:

(1) Syn flooding.

(2) Fuzzing — To determine if some servers run with possible vulnerabilities like buffer overflows and can possibly be brought down by some fuzzers.

Analyze some tools available:

1. Prowler: AWS Security Best Practices Assessment, Auditing, Hardening and Forensics Readiness Tool. https://github.com/toniblyx/prowler https://github.com/toniblyx/prowler

2. Lunar: A UNIX security auditing tool based on several security frameworks. https://github.com/lateralblast/lunar

3. Security Monkey:
https://github.com/Netflix/security_monkey
we can hook this as an operator which keeps a watch on the account
and changes happening on the policies.

4. Diffy: It allows a forensic investigator to quickly scope a compromise across cloud instances during an incident, and triage those instances for follow-up actions.

https://github.com/Netflix-Skunkworks/diffy

5. Rootkit detection — The most common rootkit functions involve hiding the attacker’s malicious files, processes, or network connections, providing unauthorized access for future events (backdoors), deploying keyloggers, and deleting system logs that would reveal the attacker’s presence. Rootkits can be in user space using mechanisms like LD_PRELOAD or can be in kernel installed via a Kernel Module. Although it might be extremely difficult to get a rootkit installed , but there are tools which can detect if a machine is having a rootkit installed. One of such tool is https://github.com/nbulischeck/tyton which can be used to detect Rootkits if needed.

Honeypots: Extreme idea is to keep some instances open for exploits in a separate network from the prod instances. This can possibly give some inference as to what techniques attackers in general are applying/thinking to attack the systems.

--

--