Chaos Engineering: Chaos Testing Your HTTP Micro-Services

Failing To Succeed And Succeeding At Failing

Andy Macdonald
May 20 · 5 min read
Netflix’s Chaos Monkey is mostly responsible for popularising the concept of Chaos Engineering.

TLDR: Your microservices are vulnerable to unexpected failure, if services they depend on fail in some way (and you don’t handle it). Fault test your HTTP microservices using a “Chaos Proxy”.

Here’s one I made earlier:

Chaos Engineering — What Is It?

Chaos Engineering is a great idea — build an automated solution/tool to randomly attempt to break a system in some way; ultimately to learn how the system behaves in such situations. Then you can use your newfound knowledge to find ways to make the system more fault tolerant during these failure conditions in the future.

What Is A Chaos Proxy?

A Chaos Proxy is a service that your microservices can connect to.

It routes traffic to real destination microservices and returns responses back to the microservices through the proxy — but does so in a very unreliable way.

Through the proxy, requests are randomly delayed and/or randomly fail in unexpected ways — all for the sole purpose of helping you understand how the microservice responds to these various failure conditions.


Why Would Anyone Want An Unreliable HTTP Proxy?

Everything fails eventually. Everything.

Accept it and embrace failure. Design for failure. Succeed at failing.

Microservices often communicate with other services via REST and HTTP. How do your microservices cope, when the services they depend on inevitably fail in some unpredictable way?

Courtesy of: https://blog.algorithmia.com/introduction-to-microservices/

Your microservices are vulnerable to unexpected failure, if services they depend on fail (and you haven’t accounted for the failure or defined how your service should behave).


Why Is This Useful?

Recently I was investigating a JDBC connection leak in a microservice.

With modern frameworks abstracting away JDBC operations, connection leaks shouldn’t really happen these days, but alas there was a connection leak.

Courtesy of: https://jsherz.com/leak/memory/connection/database/graphing/plotly/python/linux/2017/02/16/finding-connection-leak.html

I wanted to assess how resilient the microservice (A) was to failures and delays in another microservice (B) that it depended upon.

I needed a way to simulate periodic failures and delays in microservice ‘B’ while I performed requests and automated regression tests locally against microservice ‘A’.

I could access microservice ‘B’ on a remote environment but because of various constraints, I couldn’t run ‘B’ up locally to try to modify it to emit failures.

I couldn’t really find something existing that was lightweight, reasonably easy to set up, and that accomplished what I hoped to accomplish.

After some fiddling around, the first iteration of ClusterFk Chaos Proxy was born!

Thanks to ClusterFk Chaos Proxy, I was able to identify that with sufficiently delayed responses from microservice ‘B’, the JDBC connections in microservice ‘A’ would stack up and stick around for as long as the HTTP request was active — even if the JDBC transaction had actually long since committed.

With the cause known, this opened up a range of possible solutions for the issue (and an easy way to test their effectiveness through the chaos proxy), e.g:

  • Implement controlled timeout on request from ‘A’ to ‘B’.
  • Implement timeout of JDBC connections and return to the connection pool.
  • Make elements of processing asynchronous so the request thread exits quicker.

ClusterFk Chaos Proxy

https://github.com/clusterfk/chaos-proxy

The premise is simple:

  • Configure your locally running service-under-test to point to the Chaos Proxy and configure the Chaos Proxy to point to your real running dependent-destination-service.
  • Switch on ClusterFk Chaos Proxy and configure a “chaos strategy”.
  • Use your microservice (fire requests at it).
  • Watch the world burn (through monitoring logs or through application behaviour).
  • Optional — Learn from the chaos and implement changes to improve the resilience of your microservice.
  • Repeat.

At the time of first putting the Chaos Proxy together, I wasn’t really aware of the concept of a Chaos Proxy, but I decided to finish the first iteration off.


Getting Started

ClusterFk Chaos Proxy is on DockerHub. To install simply:

docker pull andymacdonald/clusterf-chaos-proxy

And then configure a docker-compose file with the destination service details — e.g if your ‘B’ service runs on http://10.0.0.231:8098:

version: "3.7"
services:
user-service-chaos-proxy:
image: andymacdonald/clusterf-chaos-proxy
environment:
JAVA_OPTS: "-Dchaos.strategy=RANDOM_HAVOC -Ddestination.hostProtocolAndPort=http://10.0.0.231:8098"
ports:
- "8080:8080"

Configure a chaos strategy as per the project’s README.md:

NO_CHAOS - Request is simply passed throughDELAY_RESPONSE - Requests are delayed but successful (configurable delay)INTERNAL_SERVER_ERROR - Requests return with 500 INTERNAL SERVER ERRORBAD_REQUEST - Requests return with 400 BAD REQUESTRANDOM_HAVOC - Requests generally succeed, but randomly fail with random HTTP status codes and random delays

Then simply:docker-compose up

Once the application is up, you can point the microservice(s) you want to test at your ClusterFk Chaos Proxy instances (instead of the real destination services). Then just fire up the microservice and start testing and using it.

Depending on the strategy you’ve picked, the proxy will effect the strategy against the requests you send to it.

Probably the most useful strategies are RANDOM_HAVOC and DELAY_RESPONSE — but you still might find the others useful.

More features will be added in the future with more configurable options!


Suggestions

I’d appreciate if you’d give some feedback on the project and if you find it useful.

Thanks for reading! 😃

Hopefully, you’ve enjoyed this article and the introduction to the concept of a Chaos Proxy.

Although I’ve used my own personal project here, the concept is incredibly simple to implement. Feel free to take my project and fork it or just make your own implementation!

Better Programming

Advice for programmers.

Andy Macdonald

Written by

Senior Software Engineer @ BlackCat Solutions. Interested in TDD / BDD, DevOps, Chaos Engineering, ML and getting things done quickly so we can go to the pub.

Better Programming

Advice for programmers.