Threat Modeling — EKS
Embedding security into the service offering
Threat modeling is the process of identifying potential threats to systems and data. By using this process, we can develop a plan to protect our systems from these threats.
At my current organization, we use AWS Service Catalog to provide products that are consumed by other teams in the company — A consumable pattern. Since we have a single source of service creation, we built security in the product itself. To do so we follow a threat modeling exercise for the service in question. In this blog, I will share the slimmed-down approach that we follow.
Identify Service/App
In this step, we define the scope of the threat modeling.
For this blog, I will use AWS EKS Cluster.
Create a Responsibility & ownership matrix
In this step, we define who is responsible for the component in question. AWS vs Customer. In EKS, the Control plane, Fargate & Managed Node is managed by AWS, and the Data plane is managed by us. Custom node group & custom controllers, and webhook mutation are all our responsibility.
Identify Admin APIs
- CreateNodegroup
- DeleteNodegroup
- UpdateNodegroupConfig
- UpdateNodegroupVersion
- CreateFargateProfile
- DeleteFargateProfile
- CreateCluster
- DeleteCluster
- UpdateClusterConfig
- UpdateClusterVersion
- AssociateIdentityProviderConfig
- DisassociateIdentityProviderConfig
Identify Access/Entry Points
In this step, we find all entry points of the service in question.
- API Endpoint
- SSH to Host
- Access to Container by kubelet
- Access to cluster by launching rouge containers
Identify Exfiltration Points
In this step we find all the ways using which data can be moved out.
- Access to upload artifacts to outside/unintended system (S3)
- Downloading and uploading images from ECR to other ECR or artifactory
Identify places where you can put controls
Based on the above two steps we know all the places where we need to put a control mechanism.
With the data flow diagram, we can summarize the last three steps. In the above diagram, I have marked all places where we can put a control with a red line. In this case, these controls can be An SCP, Security Group, aws-auth, Role, Endpoint Access points, OS Mounts.
Identify Actors
In this step, we try to find who all can cause a security concern in the service.
- External Attackers
- Malicious Containers (Tampered image)
- Vulnerable third party packages
- Malicious User / Stolen Credentials
- Misuse of legitimate privileges
Identify Threats (What can go wrong?)
In this step, we find the threats associated with the actors. This is basically what an actor can do in the service.
- People who have no access to the cluster may be able to reach the applications running on it and/or the management port(s) over a network
- An attacker has access to a single container and would like to expand their access to take over the whole cluster
- An attacker has valid credentials to execute commands against the Kubernetes API, as well as network access to the port
- If the user has access to Admin API's they can modify cluster configuration
These threats are then mapped to a Vulnerability framework. Like Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, and Elevation of Privilege (STRIDE).
Identify the Impact of the threat
In this step, we evaluate the impact of the action performed by various actors.
- Data destruction (deleting configurations, storage)
- Resource Hijacking (running digital currency mining)
- Denial of service (makes the service unavailable)
- Disruption of service
Identify Persistance of breach
In this step, we try to find how long the said action can be present in the system.
- Backdoor container
- Writable Hostpath (creating a cron job on the host)
- Kubernetes Cronjob (scheduled pods in the cluster)
- Malicious Admission controller
Identify Lateral movements
In this step, we assess all the touchpoints the malicious actor can reach. This includes systems and services outside the scope of service being evaluated.
- Access cloud services
- Application Credentials in container
- Kubernetes secrets
- CoreDNS poisoning
- Writable volume mounts in the host node
Identify Preventive measures
In this step we find the preventive controls to mitigate the threats identified.
- Ensure that management services (API server, kubelet) are not exposed to untrusted networks without authentication controls in place
- API Server Authentication
- API Server Authorisation
- Kubelet Authentication - Ensure that service accounts are either not mounted in containers or have restricted rights (i.e. not cluster-admin)
- RBAC, IRSA, Pod Security Policy
- Separate security groups for control plane & workers
- Calico Network Policies
- Private registry authentication
- Ensure Admin APIs are only given to the Admin roles and are restricted to be assumed by certain entities from the corporate network.
Generic preventive measures to implement
- Input validation, Authentication, Session handling, and contextual bound handling
Create a controls matrix
Once we have the controls we create a control matrix, which maps threats and all controls for that threat. We separate them into different categories like below. This needs to be reviewed by a security SME and approved.
- Directive
- Put configuration in the product against the threats.
- Calico Network policy - Preventive
- Instead of enabling SSH access, use SSM Session Manager when you need to remote into a host. - Detective
- Event-based detection of misconfiguration (AWS Config & Custom Security tool)
- Periodically run Kube-bench to verify compliance with CIS Amazon EKS Benchmark
- Periodically use Amazon Inspector to assess hosts for exposure, vulnerabilities, and deviations from best practices
- Periodically Scan your container images - Remediative
- Incidence Response plan - Corrective
- Iterative Hardening
Create an Incidence Response plan
In case the threat identified happens how to deal with it. This is a runbook or SOP for each threat.
- Identify the offending Pod and worker node (by worker node, by deployment, by label, using the service account name
- Isolate the pod/node (Network policy)
- Revoke temporary security credentials assigned to pod/node
- Cordon the worker node
- Enable termination protection on the impacted worker node
- Capture volatile artifacts on the worker node (os system memory, netstat tree dump)
In case the measure is not implemented or needs to be exempted due to some other mechanism that prevents the misuse, get approval from the SME and mention that in a separate exemption list.
This practice with detailed documentation helps in creating and distributing a secure consumable.
Happy Reading !!
Note: All views are personal. No endorsement from current or previous organizations.