On Amazon EKS and FIS

Dirk Michel
12 min readSep 4, 2022

--

Distributed computing and cloud-native software applications have, in many ways, precipitated the discipline of reliability engineering. How do you reason about application reliability in a situation in which the application itself is constructed out of many loosely coupled components that are deployed across clusters of networked servers which collaborate over a web of remote connections?

Unit testing and integration testing, performance testing and benchmarking, resource consumption profiling, or broadly speaking functional and non-functional validation, are foundational activities when building reliable software components. However, building software applications based on micro-services architecture principles tend to create circumstances in which well-established software component validation approaches and assumptions about reliability may not capture all the failure modes we care about.

A way to help uncover opportunities to improve the resiliency of distributed applications is through chaos engineering.

Akin to the scientific method, we can carefully formulate expectations or hypotheses about application resiliency, observability, and performance and then validate them experimentally on a deployed and running system by injecting controlled failures. The hope is that we can discover insights into the reliability behaviour of our application in this way and iteratively improve our overall system by teasing out ever-weaker failure modes.

The practicalities of implementing chaos engineering principles for Kubernetes applications however may not always be obvious. Many popular tools and utilities can inject perturbations and faults at various layers of the system stack. Equally, we can choose from many options for auxiliary systems that help observe impacts, capture all the evidence, and implement experimental safeguards such as stop definitions and automatic rollbacks.

Consider Kubernetes applications deployed on Amazon EKS. This would typically entail experiments at an AWS resource level, perhaps via specialist tooling such as the AWS Fault Injection Simulator or through a collection of custom scripts and toolkits. Fault injections into the Kubernetes application layer would commonly be covered via a separate set of tools: Cloud-native chaos engineering projects such as LitmusChaos or ChaosMesh can provide specialised platforms.

However, the boundaries between AWS resources and software applications are rapidly dissolving. Applications running on Amazon EKS increasingly leverage the broader ecosystem of AWS resources and services in many ways, including through AWS Controllers for Kubernetes. The development pattern of using Kubernetes as the control plane for AWS Services opens up the opportunity to build Kubernetes applications that directly provision and control AWS resources they need. We can refer to this as cloud-native infrastructure.

Therefore we increasingly want to run coordinated experiments across our AWS Services layer, the Kubernetes layer, and the application workloads themselves.

Streamlining experimentation across the layers of the stack is growing in relevance. Ultimately, we look to improve the aggregate resiliency of the entire system.

In this blog, we explore the possibilities of using AWS FIS to consolidate and coordinate experimentation across the layers of the stack. The following diagram illustrates the target workflow.

AWS FIS Coordinating experiments across AWS Resources and Kubernetes Resources

For those on a tight time budget: The TL;DR of the following sections shows that we can streamline experimentation with AWS FIS across AWS Services and cloud-native chaos engineering projects. At the same time, we can fit the approach into a GitOps model that further helps audit, log, peer-review, and authorise fault injection definitions and their execution.

Let’s do a quick recap before we get started:

The AWS FIS managed service directly ties into AWS IAM, which helps us authenticate and authorise the definition and execution of experiments. This cuts across FIS experiments defined as part of automated CICD pipeline stages as well as human-in-the-loop activities such as executing experiments on production systems.

AWS FIS Experiment Templates are used to encapsulate the definition of an experiment. The template is where we codify our overall experiment design, including its failure actions, the injection flow sequence, rollbacks, and observability. The building blocks of experiment templates include an Action Set, Targets, Stop Conditions, and Log Destinations.

The Action Set is where we combine one or more individual fault injection Actions. The AWS FIS team publishes a growing choice of curated and predefined Actions. A list of predefined Actions provided by the AWS FIS is available here. The predefined Actions follow the <aws:service-name:action-type> syntax, take a documented set of Action Parameters as input, and include Action Types on common AWS resources such as EC2, RDS, and EKS. AWS FIS itself delivers predefined Action Types that target IAM Roles. Interestingly, AWS FIS integrates with AWS SSM and provides Actions based on curated and preconfigured SSM documents.

We can define and control the Action execution flow with the “start before” parameter for each Action. AWS FIS executes all Actions at the start of the experiment execution, and any sequence we want to observe is indicated through the “start before” parameter. This helps us define the flow sequence of Actions within an Action Set. Another flow element is the <aws:fis:wait> Action that we can use to introduce a wait timer between Actions.

Where possible, we adopt predefined Actions, and we will leave the heavy lifting of creating them to the AWS FIS service team. We can, however, implement custom use cases via “open” predefined Actions when needed.

The AWS FIS team have catered for flexible Custom Action Types that are “open” in the sense that we can define what the Actions are.

AWS Systems Manager Actions are a type of “open” predefined Action that we can leverage. The <aws:ssm:send-command> and <aws:ssm:start-automation-execution> Action types for example take an ARN as input and simply point to the location of specific SSM documents or SSM automation documents. We get to decide what the documents contain and leverage the wider ecosystem of SSM documents and playbooks we can control and author. SSM-based Actions will be useful for Amazon EKS clusters that use worker node AMIs which contain the SSM agent. If you don’t, implement these failure actions on the Kubernetes layer via cloud-native chaos engines.

We can create user-defined SSM-Document-based Actions for custom fault injections. AWS FIS also offers pre-configured AWS FIS SSM Documents for common use cases that can be used for worker nodes based on AmazonLinux2 and Ubuntu AMIs.

The Amazon EKS Action <aws:eks:inject-kubernetes-custom-resource> is another “open” Action. With this Action, AWS FIS executes fault actions that are defined as part of ChaosMesh and Litmus experiments for containerised applications running on Amazon EKS. This helps us extend “AWS FIS native” AWS-resource-based Actions and bridge into Kubernetes and cloud-native chaos experiment definitions. The AWS FIS Kubernetes Custom Resource Action allows us to define and apply Kubernetes manifests to Amazon EKS clusters. We can define AWS FIS supported ChaosMesh and Litmus custom resource spec definitions in JSON format, have AWS FIS apply them to Amazon EKS target clusters, and then rely on the respective ChaosMesh or Litmus Controllers to execute them for us.

This allows us to use AWS FIS as the “control plane” for fault injection actions executed by multiple tools, including cloud-native.

This is where the critical concept emerges:

We can run “AWS FIS native” actions on AWS resources, such as terminating a randomly selected percentage of EC2-backed worker nodes and triggering an RDS instance failover alongside Kubernetes pod deletion and pod stress tests on a Deployment using LitmusChaos or ChaosMesh faults definitions.

Being aware of these concepts helps with implementing our target workflow. We assume that an Amazon EKS cluster, cluster add-ons, the LitmusChaos Operator, worker nodes and sample applications are already deployed. Equally, we assume that the wider observability tooling and alerting are already in place, for example, via Amazon CloudWatch Logs, Amazon Managed Prometheus (AMP), and Amazon Managed Grafana (AMG).

Finally, we also assume that you have the relevant AWS FIS IAM Role and policy statements in place to run the FIS experiments. The Action Type examples we’ll be using require the EKS policy, the EC2 policy, the RDS policy, and the SSM policy.

Let’s do it…

We start by authorising AWS FIS to interact with our target Amazon EKS cluster environment. This is done via the aws-auth ConfigMap, where we add the IAM Role for AWS FIS into the mapRoles field. This ConfigMap can look like this. For illustration purposes, we’re assigning the AWS FIS IAM Role to the Kubernetes RBAC ClusterRole system:masters.

apiVersion: v1
kind: ConfigMap
metadata:
name: aws-auth
namespace: kube-system
data:
mapRoles: |
[...]
- groups:
- system:masters
rolearn: arn:aws:iam::your-aws-account-id:role/my-fis-role
username: aws-fis
[...]

Then we check the readiness of the LitmusChaos Operator and the LitmusChaos Delegate agent on our target Amazon EKS cluster.

$ kubectl get pods -n litmus
NAME READY STATUS RESTARTS AGE
chaos-exporter-xyz 1/1 Running 0 5m27s
chaos-operator-ce-xyz 1/1 Running 0 5m27s
event-tracker-xyz 1/1 Running 0 5m28s
litmusportal-frontend-xyz 1/1 Running 0 15m
litmusportal-server-xyz 1/1 Running 0 15m
litmusportal-auth-server-xyz 1/1 Running 0 15m
mongo-0 1/1 Running 0 15m
subscriber-xyz 1/1 Running 0 5m30s
workflow-controller-xyz7 1/1 Running 0 5m32s

Like AWS FIS Experiment Templates, we deploy the equivalent LitmusChaos Experiment Templates into the Kubernetes cluster as ChaosExperiment custom resource definitions (CRDs).

Use predefined LitmusChaos experiment templates from the community on LitmusChaosHub or create your own ChaosExperiment template definitions. For example, use the community pod-delete ChaosExperiment YAMLs and apply them to the litmus namespace. Of course, each ChaosExperiment needs to have Kubernetes RBAC permissions related to the actions it contains. Therefore we also use Kubernetes RBAC YAMLs associated with each ChaosExperiment, which include the ServiceAccount definition and the Kubernetes Role and RoleBindings we need for the experiment. I’ve found it helpful to keep all ChaosExperiment, ServiceAccount, and RBAC resources together and deployed into the litmus namespace.

It is a good practice to use dedicated ServiceAccounts and Kubernetes RBAC resources for each experiment. Notice also that we need to authorise the ChaosExperiment’s ServiceAccounts for the namespace in which the target application workloads are running. Depending on your set-up, you will need to define Role and RoleBindings for the ServiceAccounts in the application namespace as well.

Also, apply and manage the lifecycle of cloud-native Experiment templates and associated RBAC resources securly via GitOps controllers such as FluxCD. Use pull requests and peer-review approval workflows for the Git repositories that hold your experiment YAMLs and let Flux reconcile them onto the target environments.

Any LitmusChaos ChaosExperiment template we want to use and trigger via AWS FIS would need to be pre-installed and RBAC pre-authorised on the target Amazon EKS cluster.

FluxCD will then apply and install our chosen ChaosExperiments and their RBAC resources. The Amazon EKS cluster is now ready to receive experiment triggers from AWS FIS.

To do that, we need to define the AWS FIS Experiment Template. These can be defined, for example, through the AWS Management Console as illustrated in the screenshot below.

AWS FIS Experiment Template on the AWS Management Console

The example screenshot shows an AWS FIS Action Set containing Actions across Kubernetes Pods, EC2 instances, RDS instances, and SSM documents.

We can also use the AWS CLI. The below code snippet shows how.

aws fis create-experiment-template --cli-input-json file://<path-to-json-file> --profile $MY_PROFILE_NAME

Notice the --profile flag, as you’ll probably want to refer to the AWS CLI profile name in the ~/.aws/credentials file that contains your AWS account details. The --cli-input-json flag references a file that contains the JSON representation of the AWS FIS Experiment Template we want to create.

The following snippet shows the equivalent AWS FIS Experiment Template we saw on the screenshot of the AWS Management Console. The JSON file defines the components of the AWS FIS Experiment Template, namely the Actions, Targets, Stop Conditions, and the IAM Role.

{
"description": "multi-action-experiment",
"actions": {
"ec2-terminate": {
"actionId": "aws:eks:terminate-nodegroup-instances",
"parameters": {
"instanceTerminationPercentage": "20"
},
"targets": {
"Nodegroups": "Nodegroups-Target-2"
},
"startAfter": [
"k8s-pod-cpu-stress"
]
},
"fis-wait-timer": {
"actionId": "aws:fis:wait",
"parameters": {
"duration": "PT5M"
},
"startAfter": [
"rds-reboot"
]
},
"k8s-pod-cpu-stress": {
"actionId": "aws:eks:inject-kubernetes-custom-resource",
"parameters": {
"kubernetesApiVersion": "litmuschaos.io/v1alpha1",
"kubernetesKind": "ChaosEngine",
"kubernetesNamespace": "litmus",
"kubernetesSpec": "***",

"maxDuration": "PT10M"
},
"targets": {
"Cluster": "Cluster-Target-1"
}
},
"k8s-pod-delete": {
"actionId": "aws:eks:inject-kubernetes-custom-resource",
"parameters": {
"kubernetesApiVersion": "litmuschaos.io/v1alpha1",
"kubernetesKind": "ChaosEngine",
"kubernetesNamespace": "litmus",
"kubernetesSpec": "***",

"maxDuration": "PT10M"
},
"targets": {
"Cluster": "Cluster-Target-1"
}
},
"rds-reboot": {
"actionId": "aws:rds:reboot-db-instances",
"parameters": {
"forceFailover": "true"
},
"targets": {
"DBInstances": "DBInstances-Target-3"
},
"startAfter": [
"ec2-terminate"
]
},
"ssm-ec2-stress": {
"actionId": "aws:ssm:send-command",
"parameters": {
"documentArn": "arn:aws:ssm:eu-west-1::document/AWSFIS-Run-CPU-Stress",
"documentParameters": "{\"DurationSeconds\":\"60\", \"InstallDependencies\":\"True\"}",
"duration": "PT5M"
},
"targets": {
"Instances": "Instances-Target-4"
},
"startAfter": [
"fis-wait-timer"
]
}
},
"targets": {
"Cluster-Target-1": {
"resourceType": "aws:eks:cluster",
"resourceArns": [
"arn:aws:eks:eu-west-1:xyz:cluster/my-eks-cluster"
],
"selectionMode": "ALL"
},
"DBInstances-Target-3": {
"resourceType": "aws:rds:db",
"resourceTags": {
"ChaosReady": "Yes"
},
"selectionMode": "ALL"
},
"Instances-Target-4": {
"resourceType": "aws:ec2:instance",
"resourceTags": {
"ChaosReady": "Yes"
},
"selectionMode": "ALL"
},
"Nodegroups-Target-2": {
"resourceType": "aws:eks:nodegroup",
"resourceTags": {
"ChoasReady": "Yes"
},
"selectionMode": "ALL"
}
},
"stopConditions": [
{
"source": "none"
}
],
"roleArn": "arn:aws:iam::xyz:role/my-fis-role",
"tags": {
"Name": "My example experiment"
}
}

The snippet highlights in bold show the Actions that reference the LitmusChaos details.

With LitmusChaos, we can trigger any ChaosExperiment Template by defining a ChaosEngine custom resource. Every ChaosEngine custom resource contains the details with which we want to execute the ChaosExperiment Template.

Hence, as with any Kubernetes resource, we specify the ApiVersion, the Kind, and the namespace in which the resource will be deployed. To keep things tidy, we deploy ChaosEngine resources into the litmus namespace, alongside all the other LitmusChaos resources. The *** placeholders contain the JSON representation of the ChaosEngine spec for the two inject-kubernetes-custom-resource Actions we defined.

The below ChaosEngine spec body references the pod-delete ChaosExperiment Template.

{
"engineState": "active",
"appinfo": {
"appns": "default",
"applabel": "app=nginx",
"appkind": "deployment"
},
"chaosServiceAccount": "pod-delete-sa",
"experiments": [
{
"name": "pod-delete",
"spec": {
"components": {
"env": [
{
"name": "TOTAL_CHAOS_DURATION",
"value": "300"
},
{
"name": "CHAOS_INTERVAL",
"value": "60"
},
{
"name": "FORCE",
"value": "true"
},
{
"name": "PODS_AFFECTED_PERC",
"value": "30"
}
]
},
"probe": []
}
}
],
"annotationCheck": "false"
}

The snippet shows how the ChaosEngine spec with which we pass in the details for the ChaosExperiment run: The appinfo section tells how the ChaosExperiment should identify our targeted application; we reference the ServiceAccount (pod-delete-sa) we created for the Experiment earlier, and we use the experiments section to of course reference our ChaosExperiment name (pod-delete) we want to trigger.

The below ChaosEngine spec body references our pod-cpu-stress ChaosExperiment Template.

{
"engineState": "active",
"appinfo": {
"appns": "default",
"applabel": "app=nginx",
"appkind": "deployment"
},
"chaosServiceAccount": "pod-cpu-hog-exec-sa",
"jobCleanUpPolicy": "delete",
"experiments": [
{
"name": "pod-cpu-hog-exec",
"spec": {
"components": {
"env": [
{
"name": "TOTAL_CHAOS_DURATION",
"value": "300"
},
{
"name": "CPU_CORES",
"value": "1"
},
{
"name": "PODS_AFFECTED_PERC",
"value": "100"
},
{
"name": "CHAOS_INJECT_COMMAND",
"value": "md5sum /dev/zero"
},
{
"name": "CHAOS_KILL_COMMAND",
"value": "kill $(find /proc -name exe -lname '*/md5sum' 2>&1 | grep -v 'Permission denied' | awk -F/ '{print $(NF-1)}')"
}
]
}
}
}
]
}

Once we start an AWS FIS Experiment the Actions are executed as per the flow definitions. In our example, we’ve instructed AWS FIS to execute the k8s-pod-delete and k8s-pod-cpu-stress Actions right at the beginning of the flow. The ec2-terminate Action starts after k8s-pod-cpu-stress completes and the rds-reboot Action starts once the ec2-terminate Action completes. Then we trigger the fis-wait-timer. Then the final ssm-ec2-stress Action is triggered and the Experiment run completes and concludes after that.

The shown AWS FIS Experiment Template containing the various Action Types illustrates the capabilities and flexibility of the AWS FIS service. Incorporating AWS FIS into our working practices requires thoughtful planning, articulation of objectives, and coordination with stakeholders and various personas.

This journey can begin by defining the objectives and use cases we want to address. They may include introducing experimentation into the early phases of the software development lifecycle, such as development and release engineering. Automating experiment executions through cron-like schedules or on event triggers could be useful to validate release candidates. Considering the automation of experiments as part of continuous build and delivery pipelines can also be used to increase the total number of experiment runs and allow us to construct trends over time: With this pattern, each code commit would trigger a validating experiment run. Improving the resiliency, performance, and observability of software applications early on in the development life cycle can be fast and cost-effective.

Equally, we may wish to run failure injections on production systems, which can help build confidence in the reliability behaviour of customer-facing workloads. These practices embed and help systematise experimentation into our daily work and improve our ability for continuous learning.

For Developers: We equip developers with integrated AWS resource and application experiments during development in a way that extends and builds upon existing unit and integration testing.
For Release Engineers: We extend CICD pipelines with new stages and gates for automated and integrated fault injection validation experiments. We also want to provide pre-production clusters on which we deploy release candidates and continuously or recurrently run a set of experiments, too.
For Site Reliability Engineers: We want to enable SREs to plan and schedule integrated experiments on production systems. This helps identify opportunities to improve resilience, observability, and performance based on insights gained from live customer-facing workloads.

Conclusion

We can streamline and coordinate experimentation flows with AWS FIS and orchestrate failure injection actions into AWS Services as well as cloud-native failure injection tools. In this way, we can help reduce the amount of context-switching for developers, release engineers, and SREs. At the same time, we can fit the cloud-native element of the approach into a GitOps model that further helps audit, log, peer-review, and authorise experiment templates and authorise their execution.

--

--

Dirk Michel

SVP SaaS and Digital Technology | AWS Ambassador. Talks Cloud Engineering, Platform Engineering, Release Engineering, and Reliability Engineering.