Hardening CockroachDB — Chaos Engineering with Gremlin

Tammy Butow
Chaos Engineering
Published in
8 min readSep 26, 2020

Gremlin is a simple, safe and secure service for performing Chaos Engineering experiments through a SaaS-based platform. Cockroach DB is an elastic, indestructible SQL database for developers building modern applications

This tutorial will teach you how to do Chaos Engineering on Cockroach DB using Gremlin.

This tutorial shows:

  • How to install Cockroach DB
  • How to Install Gremlin
  • How to practice Chaos Engineering on Cockroach DB — specific use cases and examples

Chaos Engineering Hypothesis

For the purposes of this tutorial, we will run Chaos Engineering experiments on Cockroach DB. We will focus on a specific set of use cases that we have crafted into Gremlin Scenarios including understanding clock skew constraints and understanding what happens when our primaries and replicas fail. We will utilize resource, network and state Chaos Engineering attacks.

Known failure modes of CockroachDB:

CockroachDB requires moderate levels of clock synchronization to preserve data consistency. For this reason, when a node detects that its clock is out of sync with at least half of the other nodes in the cluster by 80% of the maximum offset allowed (500ms by default), it spontaneously shuts down. This avoids the risk of consistency anomalies, but it’s best to prevent clocks from drifting too far in the first place by running clock synchronization software on each node.

ntpd should keep offsets in the single-digit milliseconds, so that software is featured here, but other methods of clock synchronization are suitable as well.

To ensure we correctly understand and can handle this failure mode we’ll run a Gremlin Time Travel Scenario to change the clock time.

What other failure modes should we investigate?

  1. Monitoring & Alerting — validate your monitoring and alerting for CockroachDB, for example, are you alerted when there is an issue? Do you measure time to promote a new replica and alert on when this is too long (this is a useful SLO/SLI example)? Do you know if QPS (queries per second) has slowed down? Do you measure and alert on QPS (also a useful SLO/SLI).
  2. Failover — If one node fails, does the load balancer redirect client traffic to available nodes?

The CockroachDB team explains:

Despite CockroachDB’s various built-in safeguards against failure, it is critical to actively monitor the overall health and performance of a cluster running in production and to create alerting rules that promptly send notifications when there are events that require investigation or intervention.

For details about available monitoring options and the most important events and metrics to alert on, see Monitoring and Alerting.

Prerequisites

Step 1 — Install Cockroach DB

In this step, you’ll install Cockroach DB. First, create 3 instances within a VPC so that you can enable and utilize private networking.

wget -qO- https://binaries.cockroachdb.com/cockroach-v20.1.6.linux-amd64.tgz | tar xvz
cp -i cockroach-v20.1.6.linux-amd64/cockroach /usr/local/bin/
cockroach version

These are the details for my nodes:

cockroach-loadbalancer = 138.197.49.52 (public IP)
cockroach_01 = 104.131.99.215 (public IP) and 10.108.0.2 (private IP)
cockroach_02 = 159.203.127.47 (public IP) and 10.108.0.3 (private IP)
cockroach_03 = 64.225.18.75 (public IP) and 10.108.0.4 (private IP)

To start node 01 (cockroach_01) I run the following:

cockroach start — insecure — background — advertise-host=10.108.0.2

Now go to http://104.131.99.215:8080 to view your CockroachDB admin portal, it will look like the screen below:

Next, let’s add the two additional nodes we created to our cluster, for cockroach_02 in my case this is:

cockroach start — insecure — background \
— advertise-host=cockroach_02_private_ip \
— join=10.108.0.2:2625

for cockroach_03 in my case this is

cockroach start — insecure — background — advertise-host=10.108.0.4 — join=10.108.0.2:26257

Now we have our 3 node CockroachDB cluster setup and ready to use:

Step 2 — Setup Load Balancing

While we are setting up load balancing let’s pay attention to different possible failure modes we should prepare for. I’m using DigitalOcean and you can see how to do this on DO below.

  • Set forwarding rules to route TCP traffic from the load balancer’s port 26257 to port 26257 on the node Droplets.
  • Configure health checks to use HTTP port 8080 and path /health?ready=1. This health endpoint ensures that load balancers do not direct traffic to nodes that are live but not ready to receive requests.

Now we’ve got our loadbalancer created and it’s up and running:

Step 3 — Create a database table

We’ll add a database and then test data replication across all nodes:

CREATE TABLE users (
id UUID NOT NULL DEFAULT gen_random_uuid(),
city STRING NOT NULL,
name STRING NULL,
address STRING NULL,
credit_card STRING NULL,
CONSTRAINT "primary" PRIMARY KEY (city ASC, id ASC),
FAMILY "primary" (id, city, name, address, credit_card)
);
INSERT INTO users (name, city) VALUES ('Petee', 'new york'), ('Eric', 'seattle'), ('Dan', 'seattle');SELECT * FROM users;

Check all three nodes and you will see that the database table and it’s data are now stores on all three nodes.

Step 4— Monitor the cluster

We’ve set up Datadog to monitor the cluster using the ubuntu agent, below is the default dashboard for cockroach_01.

Step 5 — Install Gremlin

Now we’re ready to start doing Chaos Engineering, let’s install Gremlin on each of our nodes:

echo “deb https://deb.gremlin.com/ release non-free” | sudo tee /etc/apt/sources.list.d/gremlin.listsudo apt-key adv — keyserver keyserver.ubuntu.com — recv-keys 9CDB294B29A5B1E2E00C24C022E8EF3461A50EF6sudo apt-get update && sudo apt-get install -y gremlin gremlindgremlin init

To finish installing your agent you’ll need to grab your Team ID and Secret Key from app.gremlin.com/settings/team as shown in the image below:

Once you have installed the Gremlin agent on your Cockroach DB nodes they will appear in the Gremlin Clients list like below:

Step 6— Prepare to run your Gremlin Scenarios on CockroachDB

We will be running the following Gremlin Scenarios on Cockroach DB:

  • Cockroach DB Replica Blackhole — make one node unavailable
  • Cockroach DB Replica Clock Skew — change the clock time for one node
  • Cockroach DB Validate Monitoring & Alerting — ensure your monitoring works continuously and you don’t experience issues during different failure modes

Step 6.1 — Gremlin Scenario: Cockroach DB Replica Blackhole

In this scenario, we’ll be running a Gremlin Blackhole attack to determine the impact on CockroachDB.

SUCCESS 🤯

This was a rather large-scope failure result for me… I was unable to use this node, experienced monitoring data loss and I was unable to SSH into my node. The node was still part of the cluster even though I was unable to reach it. I didn’t see any alerts in the CockroachDB monitoring admin page or within Datadog so as a next step I should resolve that. When I halted the Blackhole I was once again able to utilize the node. What happens when you do this on your CockroachDB cluster?

The monitoring dashboard below shows the loss in monitoring data during the Blackhole attack:

Next, I filled in the Scenario results page to record the impact and critical action items going forward:

Step 6.2 — Cockroach DB Replica Clock Skew

In this scenario, we’ll be running a Gremlin Time Travel attack to determine the impact on CockroachDB. Since Cockroach DB explains that a node will be shutdown if clock skew occurs, we expect this will be the outcome of this experiment. Here is the Scenario:

Be sure to block NTP when running this Time Travel attack:

Next click to run the scenario and see the result of your Chaos Engineering work. I noticed an error message appearing on my terminal for the node that I changed the clock time on (cockroach_03).

FATAL ERROR 😬

I then noticed a few unexpected things happened:

  • All nodes were unresponsive
  • All nodes were listed as “Dead Nodes” in the CockroachDB admin screen
  • All nodes later disappeared from the CockroachDB admin screen and I saw a spinning wheel for each of the nodes

Record the results of your Gremlin Scenario, what did you experience? Was your Scenario the same as mine? Did your entire cluster die?

This Datadog monitoring screen shows you how Cockroach_01 also died and was no longer up and running:

Important actions to take to improve reliability

  1. Synchronize clocks — learn about Google’s NTP service because it handles “smearing” the leap second. If you’re not using this service, look into client-side smearing.
  2. Monitoring & Alerting — validate your monitoring and alerting for CockroachDB, for example, are you alerted when there is an issue? Do you measure time to promote a new replica and alert on when this is too long (this is a useful SLO/SLI example)? Do you know if QPS (queries per second) has slowed down? Do you measure and alert on QPS (also a useful SLO/SLI).
  3. Load balancer — does your load balancer spread client traffic across nodes? Does your load balancer prevent any one node from being overwhelmed by requests? Does it improve overall cluster performance (QPS)?
  4. Failover — If one node fails, does the load balancer redirect client traffic to available nodes?

Conclusion

This tutorial has explored how to perform Chaos Engineering experiments on CockroachDB using Gremlin. We learned how we can use Gremlin to practice Chaos Engineering and identified important questions to ask in regard to failure modes we should be prepared for.

--

--