Better Practices
Published in

Better Practices

Learn how your Kubernetes clusters respond to failure using Gremlin and Grafana

Building resilient APIs with chaos engineering

What happens when one of your dependencies fail?

What is Chaos engineering?

[Chaos engineering] incentivizes engineers to build their services to anticipate that some servers will suddenly go away… so they have to build their services to be redundant, highly available, and fault tolerant.

-Casey Rosenthal, CEO at Verica

Why worry about something that isn’t going to happen?

- HBO miniseries “Chernobyl”

Why look for trouble?

Who is responsible for Chaos engineering?

I started doing [chaos engineering] so I would get woken up less in the middle of the night and better understand my software.

It boils down to who gets paged — if that’s an SRE or Ops team, they have the most incentive to start doing this work and making their lives better.

Kolton Andrus, CEO at Gremlin

  • Specialized roles — Site Reliability Engineers (SRE), Production Engineers (PE)
  • Functional teams — DevOps, Test and Quality Assurance (QA), Research and Development (R&D)
  • Domain knowledge experts — Traffic, database, data, storage

How can you start a chaos program?

Perhaps aggregate bits and pieces from different [resilience engineering] frameworks that appeal to you, and then create a practice around it. You’ll likely be the first person to create a similar practice in your particular context.

I wish the best of luck to you in that undertaking, but I wouldn’t wager that you get it right on your first try. Or your second.

-Casey Rosenthal, CEO at Verica

How do you run a chaos experiment?

Make improvements, and automate the experiments to run continuously

When a company measures their critical services, APIs are often considered second-class citizens.

But APIs are a core part of an organization’s infrastructure, and not understanding their weaknesses can lead to performance issues and downtime.

- Tammy Butow, Principal Site Reliability Engineer at Gremlin

A Postman recipe for creating chaos with Amazon EKS and Gremlin

  • Trigger: Gremlin is a failure-as-a-service and offers a free version with limited attack types. We’ll be using Postman with the Gremlin API to trigger our attacks. Spoiler alert . . . this is so we can easily automate these chaos tests with our continuous integration pipeline.
  • Target: We’ve previously talked about deploying scalable apps with Docker and Kubernetes. Amazon EKS is a managed Kubernetes service that runs on AWS. It ain’t cheap. If you’re already running on hosts, containers, or another cloud platform, swap out EKS with your own target.
  • Observability: We’ll use Prometheus as our time-series database and Grafana to visualize the effects of our attacks. Both are open-source and have a free version. If you’re already using something else for steady-state monitoring, swap out Prometheus and Grafana with your own toolset.
a Postman recipe for creating chaos

Set up Gremlin and create a Kubernetes cluster on EKS

1) Sample app deployed in EKS, and 2) Gremlin installed using Kubernetes Dashboard

Set up Grafana and Prometheus

  • Step 0 — Create a GKE cluster (skip)
  • Step 1 — Lots and lots and lots of yaml configuration
  • Step 2 — Configure your cluster settings on Grafana (skip)
1) Add a datasource of Prometheus type, and 2) import dashboard 3131 for an overview of all nodes in your cluster

Programmatically manage your chaos experiments

Click this enticing button to import the template
  1. Get a list of our active containers
  2. Shut down a specified container
  3. Verify the health of our app
  4. Stop the attack (if you need to)
1) Shut down a container using the Gremlin API, and 2) see the effects in your browser
1) Stop the attack if we see a 500 internal server error in Postman, and 2) Run our chaos tests automatically in the collection runner

A final thought about Chaos Engineering

The biggest limitation in the fear of delivering software faster is the focus on adding more pre-release testing.

Chaos engineering is all about building trust in our resiliency and mean time to recovery. In time, we have less fear that any one change will bring down our products and when issues do occur, we are practiced in triaging and deploying fixes faster, building confidence that we aren’t fragile.

- Abby Bangser, Platform Test Engineer at MOO

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store