An Approach To Automating Application Resiliency Testing With Kubernetes
If you’ve never failed, you’ve never lived
Without too much pessimism applied to micro-services, this common saying is also sensible: failures are an intrinsic characteristic of any distributed system. Hundreds of independent entities trying to communicate and collaborate to achieve a greater good… at least one of them is sure to fail.
Trying to avoid all of these failures is wishful thinking and can sometimes lead to an extra complexity that can directly impact an architecture, its cost, and in particular limit its capacity to evolve, be tested and maintained.
As always, a balance is required: even if we obviously ought to limit them noticeably by adding redundancy, accepting failures and trying to be resilient offer a set of different approaches and opportunities when building an architecture.
Once understood the different types of possible failures and how to design and build the system to deal with them, it’s only natural to test it before experiencing them in production.
This article focuses on this last aspect and shows an approach to automate the way resiliency can be tested at the micro-service level.
Reliability and Resiliency
One important aspect of any system is its reliability:
Dependability, or reliability, describes the ability of a system or component to function under stated conditions for a specified period of time. Reliability is closely related to availability, which is typically described as the ability of a component or system to function at a specified moment or interval of time. — Wikipedia
Implicitly, it captures the system’s capability to avoid failures, and as a result be available for a given amount of time: x 9’s availability.
However, with a system composed of multiple independent services, calculating and managing a global availably starts being tricky. This old post from Netflix is helpful to get a sense of the bigger picture.
There are potentially multiple sources of failures: a micro-service may be running out of resources, failing under heavier loads or simply because it’s even more complex with distributed systems to build an application that is bug-free.
Additionally, the platform (hardware, software +, network) can never guarantee 100% availability, if only because of the reliability aspect of the network.
It becomes then important to look at the system through the lens of resiliency:
Resiliency is the ability of a system to gracefully handle and recover from failures. The nature of cloud hosting, where applications are often multi-tenant, use shared platform services, compete for resources and bandwidth, communicate over the Internet, and run on commodity hardware means there is an increased likelihood that both transient and more permanent faults will arise. Detecting failures, and recovering quickly and efficiently, is necessary to maintain resiliency. — Microsoft Azure
Although every type of failures cannot be anymore tested in a sandbox, their impact on the client can be simulated.
Ultimately, the goal is to be prepared. Fallback strategies, e.g. running in a degraded mode, can be put in place at the application level, and subsequently, need to be validated.
So let it break… but plan for it!
Without further ado, this last part showcases how to simulate failures between independent (micro) services in a Kubernetes cluster, and to capture these scenarios through automated tests.
Why Kubernetes?
For people who had to deal with integration issues, trying to figure out why a service working in Dev or in a QA environment fails in production, minimizing these discrepancies by adopting only one deployment model is a no-brainer.
Added to that we want to support one more type of tests, moreover impacted by the environment topology, one of the requirements is to test using the same deployment platform used in production, namely Kubernetes.
All the following examples are based on a deployment composed of 2 identical go micro-services exposing the same /status
REST endpoint: service 1 has a dependency on service 2, such as when it receives a GET request, it forwards it to service 2, and aggregates the response before returning a global status.
The code of this sample micro-service is provided here and both deployments can be set up using the provided helm chart:
$ helm install chart/ --name service1 --namespace resiliency-testing --set DEPENDENCY_NAME=http://service2-go-service.resiliency-testing:8080$ helm install chart/ --name service2 --namespace resiliency-testing
It creates the following Kubernetes resources:
$ kubectl get pods,services,ingresses -n resiliency-testingNAME READY STATUS
pod/service1-go-service-7b4bc7c444-mnfvz 1/1 Running
pod/service2-go-service-6dff85ff9c-8rzg6 1/1 RunningNAME TYPE PORT(S)
service/service1-go-service NodePort 8080:32564/TCP
service/service2-go-service NodePort 8080:30829/TCPNAME HOSTS
ingress.extensions/service1-go-service service1.resiliency-testing.com
ingress.extensions/service2-go-service service2.resiliency-testing.com
To make it simple:
- there is one pod per micro-service,
- a micro-service is accessible internally through its dedicated K8s service,
- the K8s service redirects to the pod,
- an ingress allows the clients outside the cluster to interact with a micro-service through a hostname by redirecting to the K8s service.
For this request:
curl -X GET http://service1.resiliency-testing.com/status
a healthy system would return:
{
"status": "OK",
"name": "service1",
"dependencies": [
{
"status": "OK",
"name": "service2"
}
]
}
Organize the resistance with ToxiProxy
What we are looking for at first is a way to tamper with the network communications and that later on, it works in a Kubernetes cluster.
Here comes ToxiProxy, a L4 proxy from Shopify.
Toxiproxy is a framework for simulating network conditions. It’s made specifically to work in testing, CI and development environments, supporting deterministic tampering with connections, but with support for randomized chaos and customization. Toxiproxy is the tool you need to prove with tests that your application doesn’t have single points of failure.
ToxiProxy allows to dynamically open ports and forwards any incoming TCP traffics to target destinations. Having this proxy between a service and its external dependencies is the perfect place to simulate failures.
On top of that, ToxiProxy exposes a set of Restful APIs that make it easy to dynamically create proxies between services and simulate issues. A quiet convenience once the initial deployment is done!
ResiProxy: a Kubernetes companion
Unfortunately, although it can be packaged as a Docker container, ToxiProxy cannot work once deployed to Kubernetes: port forwarding in a Kubernetes cluster is not just about opening a port at the app level.
An app running in a container (running in a pod) is accessible in the cluster through a Kubernetes service. This service has to open an incoming port to target the port opened by the app and exposed by the container/pod.
Each time a new port is opened by ToxiProxy, it needs to be mapped to an associated k8s service’s port. More info here.
To do this job, we came up with ResiProxy: a light ToxiProxy K8s companion written in Go that proxies the calls to ToxiProxy. Here is the git repository.
It’s used to only intercept the REST admins calls to ToxiProxy and, if needed, executes the K8s related operations, then forwarding the calls to ToxiProxy for completion.
A helm chart is provided to deploy both ResiProxy and ToxiProxy:
helm install chart/ --namespace resiliency-testing --name resiproxy
It creates 1 additional pod containing ResiProxy and ToxiProxy containers, the 2 associated services, and an ingress to access ResiProxy (and configure ToxiProxy):
$ kubectl get pods,services,ingresses -n resiliency-testingNAME READY STATUS
pod/resiproxy-resiproxy-7dd7867984-sjk8r 2/2 Running
pod/service1-go-service-56f6b67c6b-czjjg 1/1 Running
pod/service2-go-service-6dff85ff9c-8rzg6 1/1 RunningNAME TYPE PORT(S)
service/resiproxy-resiproxy NodePort 8080:32342/TCP
service/resiproxy-toxiproxy NodePort 8474:31974/TCP
service/service1-go-service NodePort 8080:31666/TCP
service/service2-go-service NodePort 8080:30829/TCPNAME HOSTS
ingress.extensions/resiproxy-resiproxy resiproxy.resiliency-testing.com
ingress.extensions/service1-go-service service1.resiliency-testing.com
ingress.extensions/service2-go-service service2.resiliency-testing.com
An external client who wants to configure ToxiProxy will use the ingress to ResiProxy. Subsequently, a micro-service who wants to talk to another dependency will use the ToxiProxy’s K8s service and the port that was previously created.
By default, this service “exposes” only one port, the one where ToxiProxy listens for REST calls and to which ResiProxy forwards the configuration requests: 8474.
$ kubectl describe service resiproxy-toxiproxy -n resiliency-testingName: resiproxy-toxiproxy
Namespace: resiliency-testing
Labels: app=resiproxy
chart=resiproxy-0.0.1
heritage=Tiller
release=resiproxy
Annotations: <none>
Selector: app=resiproxy,release=resiproxy
Type: NodePort
IP: xxx.xxx.xxx.xxx
Port: http-toxiproxy 8474/TCP
TargetPort: 8474/TCP
NodePort: http-toxiproxy 31974/TCP
Endpoints: xxx.xxx.xxx.xxx:8474
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
To configure the bridge between service 1 and service 2, as described above, we need to use ResiProxy.
curl -X POST http://resiproxy.resiliency-testing.com/proxies \
-H 'Content-Type: application/json' \
-d '{
"name": "proxy_service2",
"listen": "[::]:8081",
"upstream": "service2-go-service.resiliency-testing:8080",
"enabled": true
}'
Now every request sent to the port 8081 of ToxiProxy will be forwarded to the port 8080 of service 2. And as expected, a new port is now opened at the ToxiProxy’s K8s service level:
$ kubectl describe service resiproxy-toxiproxy -n resiliency-testingName: resiproxy-toxiproxy
Namespace: resiliency-testing
Labels: app=resiproxy
chart=resiproxy-0.0.1
heritage=Tiller
release=resiproxy
Annotations: <none>
Selector: app=resiproxy,release=resiproxy
Type: NodePort
IP: xxx.xxx.xxx.xxx
Port: http-toxiproxy 8474/TCP
TargetPort: 8474/TCP
NodePort: http-toxiproxy 31974/TCP
Endpoints: xxx.xxx.xxx.xxx:8474
Port: 8081 8081/TCP
TargetPort: 8081/TCP
NodePort: 8081 30795/TCP
Endpoints: xxx.xxx.xxx.xxx:8081
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
Then we can redeploy service 1 to point to the new port opened by ToxiProxy that redirects to service 2 (instead of directly to service 2 like previously):
$ helm install chart/ --name service1 --namespace resiliency-testing --set DEPENDENCY_NAME=http://resiproxy-toxiproxy.resiliency-testing:8081
To sum up, the resiliency testing deployment model now looks like this:
Happy vs Alternate path
It’s interesting to notice that after this initial deployment, the proxy is enabled by default and forwards any request initiated from service 1 to service 2:
{
"name": "proxy_service2",
"listen": "[::]:8081",
"upstream": "service2-go-service.resiliency-testing:8080",
"enabled": true,
"toxics": []
}
From a client-side, testing the happy path is strictly identical to our first test.
Testing an alternate path, e.g. where service 2 would not be reachable, is astonishingly simple and requires just one prior REST request to disable this proxy.
curl -X POST \
http://resiproxy.resiliency-testing.com/proxies/proxy_service2 \
-H 'Content-Type: application/json' \
-d '{
"name": "proxy_service2",
"listen": "[::]:8081",
"upstream": "service2-go-service.resiliency-testing:8080",
"enabled": false
}'
The same previous REST request
curl -X GET http://service1.resiliency-testing.com/status
will now return this time a totally different result:
{
"status": "OK",
"name": "service1",
"dependencies": [
{
"status": "UNKNOWN",
"name": "http://resiproxy-toxiproxy.resiliency-testing:8081"
}
]
}
Testing this scenario ensures if and how it was planned. For example, by retrying, returning a predefined error or, in this case, a degraded partial response.
Keep calm and under control with Karate
Karate is the only open-source tool to combine API test-automation, mocks and performance-testing into a single, unified framework. The BDD syntax popularized by Cucumber is language-neutral, and easy for even non-programmers. Besides powerful JSON & XML assertions, you can run tests in parallel for speed — which is critical for HTTP API testing.
You can easily build (or re-use) complex request payloads, and dynamically construct more requests from response data. The payload and schema validation engine can perform a ‘smart compare’ (deep-equals) of two JSON or XML documents, and you can even ignore dynamic values where needed.
Test execution and report generation feels like any standard Java project. But there’s also a stand-alone executable for teams not comfortable with Java. Just write tests in a simple, readable syntax — carefully designed for HTTP, JSON, GraphQL and XML.
Karate is the tool that we use to automate all the tests that until now we were doing manually. This way we can easily integrate them as part of our CI/CD pipeline and increase our confidence in the overall robustness of our system in production.
This last part gives an overview of the Karate DSL by capturing all the previous examples in individual automated scenarios.
For the sake of readability, the first task is to isolate in 2 different scenarios the REST calls used to:
1. create the proxy
Feature: Create service2 proxy: does not fail if the proxy already existsScenario:
* def validStatus = [201, 409]
* def proxy =
"""
{
"name": "proxy_service2",
"listen": "[::]:8081",
"upstream": "service2-go-service.resiliency-testing:8080",
"enabled": true
}
"""
Given url 'http://resiproxy.resiliency-testing.com/proxies'
And request proxy
When method post
Then match validStatus contains responseStatus
2. enable/disable it
Feature: Enable/Disable service2 proxyScenario:
* def enabled = __arg.enabled
* def proxy =
"""
{
"name": "proxy_service2",
"listen": "[::]:8081",
"upstream": "service2-go-service.resiliency-testing:8080",
"enabled": "#(enabled)",
}
"""
Given url 'http://resiproxy.resiliency-testing.com/proxies/proxy_service2'
And request proxy
When method post
Then status 200
Then we can define a scenario to capture the happy path that we were previously manually testing with the curl command:
Background:
* url 'http://service1.resiliency-testing.com'
* call read('failures/service2-delete.feature')
* call read('failures/service2-create.feature')Scenario: Retrieve status when service 2 is available* call read('failures/service2-enable.feature') { enabled: true }Given path 'status'
When method get
Then status 200
Then match response ==
"""
{
"status":"OK",
"name":"service1",
"dependencies":[
{
"status":"OK",
"name":"service2"
}
]
}
"""
Finally, the alternate path when service 2 is unavailable can also be defined in a different scenario:
Scenario: Retrieve status when service 2 is not available* call read('failures/service2-enable.feature') { enabled: false }Given path 'status'
When method get
Then status 200
Then match response ==
"""
{
"status":"OK",
"name":"service1",
"dependencies":[
{
"status":"UNKNOWN",
"name":"#ignore"
}
]
}
ToxiProxy exposes a set of Restful APIs that make it easy to dynamically create proxies between services and simulate “issues”. If most of the time we want to play with the network latency or just simulate unavailabilities, ToxiProxy supports a wider range of what is named toxics.
Karate provides a clear BDD readable syntax (GIVEN/WHEN/THEN) to capture these tests and prevent future regressions.
The complete code for these tests is available in this git repository.
Conclusion
This post was not about:
- describing the different types of failures that could happen in a distributed system,
- building a resilient application,
- detailing the different approaches for testing resiliency,
- or how later bringing observability to monitor, detect, and understand the failures and the strategies in place.
All of these would be good subjects for follow up posts, although, some of this content has already been widely spread, starting with the excellent book Release It!.
Looking globally at the resiliency aspect of an architecture creates different opportunities, like providing a different user experience, scaling and supporting heavier loads, and generally, building a more robust system.
This post was about giving the right tools to a developer to focus on this aspect and to simply test it with Kubernetes, in addition to regular unit tests.
Another advantage is also for someone else, like a QA team, to be able to look later at what was done, check if the different pieces work together, and if they play nicely with outside systems.
ResiProxy/ToxiProxy and Karate are new tools to add to your testing toolbox. They help to isolate these type of tests from others and emphasize their importance.
You can look back at all the code listed throughout this post in the different git repositories: ResiProxy, the sample go micro-service and the Karate tests.