Chaos Testing

Surya Bhagvat
Harness Engineering
4 min readMay 14, 2021

We want to provide an update on chaos testing where we induced failures in our production system in different components to test the resiliency and dependency of the microservices that form part of the Harness infrastructure. This blog post is the first in the series of chaos testing, and we are planning for more in our upcoming maintenance windows.

We wanted to cover a couple of things as part of this testing:-

  1. What happens if the internal traffic packet to Redis that got blocked by firewall and caused one of the incidents were to surface again? Would it block the traffic and cause an incident?
  2. What happens if the TimescaleDB that we primarily use for custom dashboards, cloud costs, and analytic data were to go down? This was the second incident wherein cause of the NAT allocation errors and TimescaleDB being part of the core service pod’s health check brought down the system.
  3. What happens when the MongoDB replica sets go into routine maintenance, and the replica set elections to happen. Would the application be resilient enough to tolerate these rolling updates?

In addition to the above, we wanted to test what happens if we bring down the specific services of a module (Cloud Cost Management and Continuous Verification). What kind of impact do they have on the Harness core Continuous Deployment product?

Firewall Incident

If we go back to the incident we had regarding firewall blocking the traffic (https://medium.com/harness-engineering/learning-from-our-recent-production-issues-20973ff47701), it had to do with an anomalous packet flowing through the wire that matched the signature of a possible OpenSSL HeartBleed rule set. This traffic was internal traffic between our Manager service pod to the GCP Managed Memorystore (Redis).

We worked with the team, and we had whitelisted the internal CIDR range for our Kubernetes cluster so that any internal traffic does not get wrongly classified and gets blocked. We have also changed the policies wherein, by default, the internal traffic between the pods or between the pods to the GCP Managed services and the Mongo database doesn’t get blocked automatically but instead alerts us on slack. Since we turned on the alerting mode, we notice from time to time that some of the packet signatures do get incorrectly classified as OpenSSL HeartBleed rule set, but the traffic is not blocked.

Inducing failures in connecting to TimescaleDB

As mentioned, we use TimescaleDB for custom dashboards, cloud costs, and analytic data. TimescaleDB is hosted in-house on GCP, and we were using CloudNAT for the pods to communicate to Timescale using the external static IP. We use the manual way of assigning the static external IP for CloudNAT since we whitelist these with one of our internal products. Cause of the NAT allocation errors, the pod connectivity to Timescale dropped. Since the code recently added TimescaleDB as part of the health check and probes for the core Manager pod, the probes failed, resulting in the incident. We made two changes:

  1. We removed Timescale connectivity in the critical path of the probes
  2. We moved to use VPC peering and static internal IP for the pods to communicate to Timescale, taking the CloudNAT out of the equation.

As part of Chaos testing, we brought the TimescaleDB down by taking out the firewall that lets pods communicate to Timescale. As expected, this did have the consequence of the Harness platform’s custom dashboards, cloud costs not showing up, but it didn’t bring down the core functionality of the Continuous Deployment product. We have a read-only DR Timescale in us-west2, and when we pointed our services to this DR instance, we found a bug in the code where it could not connect to this DR instance and we are addressing it.

MongoDB maintenance and elections

We currently use Atlas Managed Mongo Database as our primary database for our Harness platform. As part of the regular maintenance, including OS security patches and other software updates, MongoDB Replica sets use elections to happen wherein the decision is made as to which set member will become the primary node. We tested to make sure that the Harness platform can tolerate these elections as well as the failover.

Scaling down the Cloud Cost Management pods

The Harness Cloud Cost Management module provides visibility around the cloud cost. This module consists of two core services, the event service, and the batch processing service. These two services monitor your infrastructure costs and plans your budgets. As part of this chaos testing, we first scaled down the event service pods to 0. The actual and the expected behavior, in this case, was in line with our expectations.

  • Harness Continuous Deployment product continued to work as expected, having no issues with the deployments.
  • For the existing customers, the cloud cost management dashboards continued to be functional.
  • For the new clusters added to analyze the cost, this cluster stopped receiving any events.

Next, we scaled down the batch processing pods to 0. In this case, as well, the behavior was what was expected.

  • Harness Continuous Deployment product continued to work as expected, having no issues with the deployments.
  • For the existing customers, the cloud cost management dashboards continue to be functional
  • Only the billing data generation was delayed.

Scaling down the Continuous Verification pods

The Harness Continuous Verification module monitors your application for anomalies. This module consists of two core services, the verification service and the learning engine. As part of this chaos testing, we first scaled-down the verification service pods to 0. The actual and the expected behavior, in this case, was in line with the expectations.

  • Harness Continuous Deployment product continued to work as expected, having no issues with the deployments. as long the deployment had no verification step.
  • The data collection stopped for the workflows with the verification step and the service guard
  • Upon scaling back the services, things went back to the working state and the data collection resumed.

Next, we scaled down the learning engine pods to 0. In this case, as well, the behavior was what was expected.

  • Harness Continuous Deployment product continued to work as expected, having no issues with the deployments. as long the deployment had no verification step.
  • Analysis stopped happening for the Service guard
  • Upon scaling back the services, things go back to the working state and the analysis resumed.

--

--