Litmus Chaos Tests (CPU & Memory) in GKE (Google Kubernetes Engine)

Doddipalli Vamsy Reddy
Google Cloud - Community

--

Firstly, Why Chaos Tests? This is to build confidence and test reliability in ever growing large scale distributed software systems, the induction of simulated catastrophic events (Deletion/CPU Stress/Memory Stress .. ) that possibly happen in production environment.

Secondly, in the assumption of having knowledge of operating Google Kubernetes Engine (GKE), Litmus Chaos Tool (LitmusChaos) , Helm or Kubectl, the below testimonials have been illustrated.

We are going to test the reliability in GKE Clusters at Node level by doing CPU and Memory stress experiments. To attain this, we install LitmusChaos Center and Chaos Delegate inside the clusters.

Create GKE Cluster:

To create GKE Cluster, make sure billing is enabled to your project, enable Kubernetes Engine API’s and ensure you have available quota.

gcloud container clusters create <CLUSTER_NAME> --zone "us-central1-c" --machine-type "e2-custom-4-4096" --image-type "UBUNTU_CONTAINERD" --num-nodes "2" --node-locations "us-central1-c"

Connect to the installed cluster:

gcloud container clusters get-credentials <CLUSTER_NAME> --zone us-central1-c --project <PROJECT_NAME>

List Nodes and Namespaces details:

kubectl get nodes
kubectl get ns

Install Litmus ChaosCenter using either with Helm or Kubectl:

helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo list
# You can use any of your convenient NameSpace
kubectl create ns litmus
helm install chaos litmuschaos/litmus --namespace=litmus

kubectl command to install Litmus ChaosCenter:

kubectl apply -f https://litmuschaos.github.io/litmus/2.12.0/litmus-2.12.0.yaml

We can verify the Litmus tool installation by

kubectl get pods -n litmus
kubectl get svc -n litmus

You should see litmusportal-frontend-service and other four dependent services running as pods.

Create a Google Cloud firewall rules for frontend-service and server-service to allow connections from internet to access Litmus ChaosCenter:

gcloud compute firewall-rules create <NAME_FRONTEND_SERVICE> --allow tcp:<TCP_PORT>
gcloud compute firewall-rules create <NAME_SERVER_SERVICE> --allow tcp:<FIRST_TCP_PORT>

The URL of Litmus ChaosCenter can be obtained from the below command. The output format will be like (example) http://172.17.0.3:31186

LITMUS_PORTAL_NAMESPACE=litmus
export NODE_NAME=$(kubectl -n $LITMUS_PORTAL_NAMESPACE get pod -l "component=litmusportal-frontend" -o=jsonpath='{.items[*].spec.nodeName}')
export EXTERNAL_IP=$(kubectl -n $LITMUS_PORTAL_NAMESPACE get nodes $NODE_NAME -o jsonpath='{.status.addresses[?(@.type=="ExternalIP")].address}')
export NODE_PORT=$(kubectl -n $LITMUS_PORTAL_NAMESPACE get -o jsonpath="{.spec.ports[0].nodePort}" services litmusportal-frontend-service)
echo "URL: http://$EXTERNAL_IP:$NODE_PORT"

The default credentials are provided below, however you are asked to change the password in the first login attempt

Username: admin
Password: litmus

Now, we are going to test reliability of GKE nodes by injecting CPU and Memory stress. You should see self-agent under ChaosDelegates in active status.

Click schedule a Chaos Scenario and select self-agent as a delegate.

You can choose pre-defined scenarios, bring your own, use template or select from ChaosHub. I have chosen ChaosHub.

select generic/node-cpu-hog and generic/node-memory-hog

Resiliency score is the measure of how resilient is the chaos scenario when different chaos scenarios are performed on the Kubernetes System. The weight priority is generally categorised in three levels 0–3 (Low Priority), 4–6 (Medium Priority), 7–10 (High Priority).

ChaosHub is a marketplace where you can get various experiments.

you can edit sequence to choose either run cpu and memory simultaneously, or each one sequentially.

Click edit YAML -> and change the environment variable TARGET_NODES to the GKE node names. You can have multiple nodes in-place by separating comma.

Once you schedule as per the requirement and verify the details, you are good to click finish to run the experiment.

kubectl get pods -n litmus

You can see the pods popping up and inducing the stress on CPU and Memory.

kubectl top nodes

The comprehensive details of running experiments can be found in graphical view and table view. The logs have been recorded and presented live during the execution and it follows for Chaos Result as well. You can see in the above image, the verdict is Awaited, that renders the job is in progress.

Once, the scenario is completed, the report will be generated that can be found in analytics under statistics section. In the above scenario, the resilience score is 100 % (10 points for CPU Stress and 10 points for Memory Stress) that indicates the GKE nodes are efficiently resilient for the defined configuration inclined to this particular experiment.

This brings us the completion of Chaos Tests (CPU/Memory) in GKE nodes. Moreover, the Litmus tool have capability to integrate with Prometheus and Grafana . You can get more details from documentation section from Litmus tool.

………..Thank you for visiting the blog and have a great day!…………

--

--