Chaos Engineering — Failures Becomes Reliability

Improving Kubernetes Resiliency with Chaos Engineering

Failures are inevitable, even a strongest platform with concrete operations infrastructure can face outage in production, when system’s threshold to withstand turbulent conditions go out of control. There is no single reason why a system fail and it is not possible to immediately address a failure without prior knowledge on why and when that specific failure might occur, the same implies to widely used Kubernetes platform. Even when all of the individual services in a Kubernetes environment are functioning properly, the interactions between those services can cause unpredictable outcomes.

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

In simpler terms : Breaking things on purpose!!. While seemingly counterintuitive, intentional injection of something chaotic into a complex system (Kubernetes here) in order to prevent a future issue and record the behavior of the system while on chaos. Chaos engineering is carefully injecting a failure into our systems to test the system’s ability to respond to it. This is an effective method to practice, prepare, and prevent/minimize downtime and outages before they occur.

Chaos engineering is a relatively new fields. It all started with Netflix transition to true cloud environments. Netflix saw the cloud as vulnerable. They believed that no instance in the cloud could guarantee permanent uptime. So, they created Chaos Monkey. Chaos Monkey was designed to randomly disable production instances to ensure survivability during common types of failures. Failure Injection Testing (FIT) was designed to give developers a “blast radius” rather than unmanaged chaos.

There are multiple tools in Kubernetes space which can create controlled chaos: kube-monkey, PowerfulSeal, Pod-Reaper etc. All these tools enables users to provide/design a planned fault scenario and apply the same to specific/targeted areas in kubernetes. In this post Powerfulseal and other tools that can be used together for testing are outlined below.

PowerfulSeal

Powerfulseal from Bloomberg follows the Principles of Chaos Engineering, and is inspired by the infamous Netflix Chaos Monkey. The tool allows engineers to “break things on purpose” and observe any issues caused by the introduction of various failure modes. PowerfulSeal, written in Python, is currently Kubernetes-specific and only has “cloud drivers” for managing infrastructure failure for the OpenStack platform and provides an abstract-driver to support additional cloud platforms.

PowerfulSeal works in several modes where users can provide fault configurations in different ways that suite the requirement. Interactive mode: allows to discover cluster’s components and manually break things, Autonomous mode: reads a policy file which contain matches, filters and actions, tests are executed in a loop, Label mode: allows killing targeted objects using labels.

A minimal no-op JSON policy file includes:

config:
minSecondsBetweenRuns: 47 #Minimum Time between each execution
maxSecondsBetweenRuns: 452 #Maximum Time between each execution
nodeScenarios: [] #Scenarios Specific to Nodes (Kubernetes Nodes)
podScenarios: [] #Scenarios Specific to container objects (Kubernetes Pods)

podScenarios

Scenarios describing actions on kubernetes pods.

config:
minSecondsBetweenRuns: 1
maxSecondsBetweenRuns: 30
podScenarios:
- name: "Kill Pods"
    # Match the intial set of pods.# The set of pods will be a union of all matches.
match:
      # you can pick a whole namespace
- namespace:
name: "blue"
      # you can pick a particular deployment
- deployment:
name: "nginx"
namespace: "blue"
      # throw in another one to the union
- deployment:
name: "busybox"
namespace: "red"
      # you can also select labels in a namespace# (note that for the labels are always strings)
- labels:
namespace: "blue"
selector: "app=nginx"
    filters:
      # property filters (all the property filters support regexp)
- property:
name: "name"
value: "application-X-*"
- property:
name: "state"
value: "Running"
      # time of execution filters# to restrict the actions to work days, you can do
- dayTime:
onlyDays:
- "monday"
- "tuesday"
- "wednesday"
- "thursday"
- "friday"
startTime:
hour: 10
minute: 0
second: 0
endTime:
hour: 17
minute: 30
second: 0
      # to pick a random sample of nodes/pods, you can specify either a size
- randomSample:
size: 5
      # or a ratio (will be rounded down to an integer)
- randomSample:
ratio: 0.2
      # this will pass all the nodes with the given probability,# or none otherwise
- probability:
probabilityPassAll: 0.5
    # The actions will be executed in the order specified
actions:
- kill:
probability: 0.5
force: true
- wait:
seconds: 5
- kill:
probability: 1
force: true

nodeScenarios

Scenarios describing actions on nodes.

config:
minSecondsBetweenRuns: 60
maxSecondsBetweenRuns: 360
nodeScenarios:
  # example of a policy using al the filters available
- name: "kill nodes"
    # Choose the initial set of nodes to operate on.# Note that this will be an union of all the notes you match (logical OR)
match:
- property:
name: "name"
value: "minion-*"
- property:
name: "ip"
value: "127.0.0.1"
- property:
name: "group"
value: "minion"
- property:
name: "az"
value: "A1|A2"
- property:
name: "state"
value: "UP"
    # The filters are executed in the order specified and can be# used mutiple times, and piped from one to the next.
filters:
      # property filters (all the property filters support regexp)
- property:
name: "name"
value: "minion-*"
- property:
name: "ip"
value: "127.0.0.1"
- property:
name: "group"
value: "minion"
- property:
name: "az"
value: "AZ1|AZ2"
- property:
name: "state"
value: "UP"
      # time of execution filters# to restrict the actions to work days, you can do
- dayTime:
onlyDays:
- "monday"
- "tuesday"
- "wednesday"
- "thursday"
- "friday"
startTime:
hour: 10
minute: 0
second: 0
endTime:
hour: 17
minute: 30
second: 0
      # to pick a random sample of nodes/pods, you can specify either a size
- randomSample:
size: 5
      # or a ratio (will be rounded down to an integer)
- randomSample:
ratio: 0.2
      # this will pass all the nodes with the given probability,# or none otherwise
- probability:
probabilityPassAll: 0.5
    # The actions will be executed in the order specified
actions:
- stop:
force: false
- wait:
seconds: 30
- start:
- execute:
cmd: "sudo service docker restart"

Running PowerfulSeal in autonomous mode

Taking a simple example of killing pods part of a deployment in a specific namespace. PowerfulSeal here runs in autonomous mode where users can provide a policy config file or use a UI to change configuration and also send the metrics to Prometheus finally visualized on Grafana.

Starting PowerfulSeal in autonomous mode:

seal autonomous \
--kubeconfig ~/.kube/config \ #Kube-Config path
--host 172.29.86.34 \ #Web Host
--port 30089 \ #Web Port
--policy-file /root/policy_kill_random_default.yml \ #Policy file
--inventory-kubernetes \ #Use Kubectl nodes as output for inventory
--remote-user root \
--ssh-path-to-private-key ~/.ssh/ansible_rsa \
--prometheus-collector \ #Export metrics
--prometheus-host 172.29.86.34 \ #send metrics to prometheus
--prometheus-port 9090 \
--no-cloud #no cloud driver

Initializing Powerfulseal, the nodes and all kubernetes objects are discovered by powerfulseal and starts the activity based on the policy file supplied above.

PowerfulSeal Initialization
PowerfulSeal UI
PowerfulSeal Discovering all objects on Kubernetes

Configuration:

Policy Configuration on PowerfulSeal UI
Parameter Configuration

As shown above, the configuration randomly kills the pods in the blue namespace. Assume a scenario if a production cluster is running a company’s web platform as multiple deployments with multiple services on kubernetes and user is interested to know the behavior (latency, response time etc.) of an application if some of the pods in the deployment fail under different thresholds then he can use the podscenarios to evaluate various scenarios. The sample configuration above kills pods based on the sampling size provided (3) which are part of a deployment.

As seen below, powerfulseal will start killing the pods based on the sample size: 3 given above where it constantly kills a set of 3 pods in the deployment having 6 replicas.

A much sophisticated config can be provided, which can run during a specific time window without disrupting a production environment using the podsScenarios and nodeScenarios mentioned in the previous section. All the metrics can be easily ported to prometheus and users can specify alerts to have control over chaos tests.

Grafana Visualization — PowerfulsSeal Activity

Users can make use of other different opensource tools along with all available chaos engineering tools to extend the functionalities. Below are some of the applications that can be used along with Powerfulseal to broaden the test matrix.

Goldpinger

Goldpinger is a debugging tool for Kubernetes which tests and displays connectivity between nodes in the cluster. Users can use Goldpinger along with Powerfulseal to check and gather metrics for nodeScenarios. An example would be if user wants to test HA of an application running across a distributed cluster, with Powerfulseal’s nodeScenarios user can kill a process or shutdown a node during a specific period of time and gather metrics from Goldpinger which constantly monitors the nodes.

Goldpinger makes calls between its instances for visibility and alerting. It runs as a DaemonSet on Kubernetes and produces Prometheus metrics that can be scraped, visualized and alerted on.

GoldPinger

Goldpinger runs as a kubernetes pod and gathers the nodes reaching kubernetes-api.

Status metrics can be ported to prometheus for visualization as shown below:

Goldpinger Visualization
Goldpinger Visualization

Locust

Locust is a distributed load testing tool which enables users to run load tests on distributed deployments. Locust supports a distributed mode (one master and multiple slave nodes).

Locust runs on Kubernetes as a daemonset:

A sample use-case would be, if a user have a large three tier application architecture running on Kubernetes with multiple replicas and user is interested in doing a load test on the webservers under a faulty condition where some of the replicas go bad which provides users to estimate how hard their application can withstand a disaster scenario. Here Powerfulseal kills the pods and nodes and Locust constantly performs the load test based on the number of users to simulate and spawns .

For example, taking an application with client and application architecture. Locust can distribute requests to the /login and /metrics target paths. There are many load generation software packages available, including JMeter, Gatling, and Tsung — any one of which might better suit your project’s needs.

Configuration can be provided as a config-map to the container running on kubernetes:

data:
locustfile.py: |
from locust import HttpLocust, TaskSet, task

    class UserTasks(TaskSet):
        @task
def index(self):
self.client.get("/")
        @task
def stats(self):
self.client.get("/stats/requests")

    class WebsiteUser(HttpLocust):
task_set = UserTasks
and attacker target can be provided using the configuration:
ATTACKED_HOST: http://locust-master:8089
Load Testing with Locust
Locust Load Tests Visualization

Building the most effective system requires experimentation. Chaos engineering allows users to run specific scenarios that could happen at any time while a product or service is live. Gaining insight into system problems before-hand also creates a better production environment. Systems design is an another aspect that can be profited by Choas engineering as users will get enough data how a fault-tolerant system should look like. Everyone will know what to look for in the future, and what systems might be vulnerable. So chaos is not bad in all aspects!!