Threat Modeling — EKS

Published in

Nerd For Tech

5 min readJan 7, 2022

Embedding security into the service offering

Threat modeling is the process of identifying potential threats to systems and data. By using this process, we can develop a plan to protect our systems from these threats.

At my current organization, we use AWS Service Catalog to provide products that are consumed by other teams in the company — A consumable pattern. Since we have a single source of service creation, we built security in the product itself. To do so we follow a threat modeling exercise for the service in question. In this blog, I will share the slimmed-down approach that we follow.

Identify Service/App

In this step, we define the scope of the threat modeling.

For this blog, I will use AWS EKS Cluster.

Create a Responsibility & ownership matrix

In this step, we define who is responsible for the component in question. AWS vs Customer. In EKS, the Control plane, Fargate & Managed Node is managed by AWS, and the Data plane is managed by us. Custom node group & custom controllers, and webhook mutation are all our responsibility.

Identify Admin APIs

CreateNodegroup
DeleteNodegroup
UpdateNodegroupConfig
UpdateNodegroupVersion
CreateFargateProfile
DeleteFargateProfile
CreateCluster
DeleteCluster
UpdateClusterConfig
UpdateClusterVersion
AssociateIdentityProviderConfig
DisassociateIdentityProviderConfig

Identify Access/Entry Points

In this step, we find all entry points of the service in question.

API Endpoint
SSH to Host
Access to Container by kubelet
Access to cluster by launching rouge containers

Identify Exfiltration Points

In this step we find all the ways using which data can be moved out.

Access to upload artifacts to outside/unintended system (S3)
Downloading and uploading images from ECR to other ECR or artifactory

Identify places where you can put controls

Based on the above two steps we know all the places where we need to put a control mechanism.

Dataflow diagram for EKS (Security View)

With the data flow diagram, we can summarize the last three steps. In the above diagram, I have marked all places where we can put a control with a red line. In this case, these controls can be An SCP, Security Group, aws-auth, Role, Endpoint Access points, OS Mounts.

Identify Actors

In this step, we try to find who all can cause a security concern in the service.

External Attackers
Malicious Containers (Tampered image)
Vulnerable third party packages
Malicious User / Stolen Credentials
Misuse of legitimate privileges

Identify Threats (What can go wrong?)

In this step, we find the threats associated with the actors. This is basically what an actor can do in the service.

People who have no access to the cluster may be able to reach the applications running on it and/or the management port(s) over a network
An attacker has access to a single container and would like to expand their access to take over the whole cluster
An attacker has valid credentials to execute commands against the Kubernetes API, as well as network access to the port
If the user has access to Admin API's they can modify cluster configuration

These threats are then mapped to a Vulnerability framework. Like Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, and Elevation of Privilege (STRIDE).

Identify the Impact of the threat

In this step, we evaluate the impact of the action performed by various actors.

Data destruction (deleting configurations, storage)
Resource Hijacking (running digital currency mining)
Denial of service (makes the service unavailable)
Disruption of service

Identify Persistance of breach

In this step, we try to find how long the said action can be present in the system.

Backdoor container
Writable Hostpath (creating a cron job on the host)
Kubernetes Cronjob (scheduled pods in the cluster)
Malicious Admission controller

Identify Lateral movements

In this step, we assess all the touchpoints the malicious actor can reach. This includes systems and services outside the scope of service being evaluated.

Access cloud services
Application Credentials in container
Kubernetes secrets
CoreDNS poisoning
Writable volume mounts in the host node

Identify Preventive measures

In this step we find the preventive controls to mitigate the threats identified.

Ensure that management services (API server, kubelet) are not exposed to untrusted networks without authentication controls in place
- API Server Authentication
- API Server Authorisation
- Kubelet Authentication
Ensure that service accounts are either not mounted in containers or have restricted rights (i.e. not cluster-admin)
RBAC, IRSA, Pod Security Policy
Separate security groups for control plane & workers
Calico Network Policies
Private registry authentication
Ensure Admin APIs are only given to the Admin roles and are restricted to be assumed by certain entities from the corporate network.

Generic preventive measures to implement

Input validation, Authentication, Session handling, and contextual bound handling

Create a controls matrix

Once we have the controls we create a control matrix, which maps threats and all controls for that threat. We separate them into different categories like below. This needs to be reviewed by a security SME and approved.

Directive
- Put configuration in the product against the threats.
- Calico Network policy
Preventive
- Instead of enabling SSH access, use SSM Session Manager when you need to remote into a host.
Detective
- Event-based detection of misconfiguration (AWS Config & Custom Security tool)
- Periodically run Kube-bench to verify compliance with CIS Amazon EKS Benchmark
- Periodically use Amazon Inspector to assess hosts for exposure, vulnerabilities, and deviations from best practices
- Periodically Scan your container images
Remediative
- Incidence Response plan
Corrective
- Iterative Hardening

Create an Incidence Response plan

In case the threat identified happens how to deal with it. This is a runbook or SOP for each threat.

Identify the offending Pod and worker node (by worker node, by deployment, by label, using the service account name
Isolate the pod/node (Network policy)
Revoke temporary security credentials assigned to pod/node
Cordon the worker node
Enable termination protection on the impacted worker node
Capture volatile artifacts on the worker node (os system memory, netstat tree dump)

In case the measure is not implemented or needs to be exempted due to some other mechanism that prevents the misuse, get approval from the SME and mention that in a separate exemption list.

This practice with detailed documentation helps in creating and distributing a secure consumable.

Happy Reading !!

Note: All views are personal. No endorsement from current or previous organizations.