0830 — Harness production incident due to Database Nodes going down

Published in

Harness Engineering

4 min readAug 31, 2022

We want to share the details around the production incident that impacted our customers [in Prod-2 cluster] on 08/30 between 6:10 PM — 7:00 PM PT. This incident has brought down all the back-end services. We apologize to our impacted customers and would like to share the incident details with full transparency.

Root cause

As part of the effort to tune unoptimized queries on a large database [Mongo] collection, we had to create a new index and drop an existing one on this collection. This resulted in the secondary database nodes going down, impacting all our services in our Prod-2 cluster. Please find below additional details from our MongoDB service provider.

In MongoDB version 4.2, there is a known defect, as referred to in this link Dropping an Index during Index Replication. The recommendation is that when dropping an index, we should avoid dropping it while any index is replicated on a secondary. If not, we will get into a situation where the two indexing operations will conflict.

The mongod logs below show the above behavior wherein while the index build was replicated on the secondary, the command to drop the index was also issued.

The above behavior is addressed from version 4.4 of MongoDB.

Timeline

We had a few alerts triggered in our Prod-2 cluster on 08/30 between 6 AM - 12 PM PT. These alerts were related to unoptimized queries hitting our Production database for one of our largest collections. The Engineers and the Database Reliability Team quickly figured out the root cause, came up with the solution to add an additional index that is more suitable for the above query shape, and drop an index to ensure the newly created index gets used.
We usually perform all our deployments between 4–5 PM PT, and we have decided to perform the above index-related operations around the same time. This being one of the largest collections, the index operation took close to an hour.
Around 6:00 PM PT, we got alerts that a few of our Pods were getting restarted. At this point, there was no cause of concern as sometimes the pod’s health check fails cause of DB being overloaded, which was the case here.
Around 6:01 PM PT, we got the alert on MongoDB that the secondary host was down, but the primary was still up and serving the traffic. This happened because of the above index creation and subsequent drop.
Between 6:01 PM PT — 6:10 PM PT, SREs and DBREs were looking into service health dashboards, and we realized that most of our Pods went into crashloopbackoff.
Around 6:10 PM PT, we noticed that only the Mongo primary node was up, and the secondary in US-West1 and US-West2 were down. This has resulted in triggering the Harness incident response protocol.
6:10 PM PT — 6:49 PM PT We raised a P1 incident with MongoDB support. They mentioned that the secondary nodes would be back up only when the replication lag clears with the Primary and the index build completes across the secondary nodes.
Between 6:50 PM — 6:58 PM PT, we got the alert that our secondary nodes were back up.
6:58 PM PT — 7:05 PM PT — We also started seeing that the pods were successfully getting restarted. All our delegate tasks queued up were getting processed, and to make this processing faster, we increased the replicas to some of the pods. At this point, our site was functional.
After thoroughly running our sanity pipelines, we updated our status page to reflect the same around 7:17 PM PT.

Remediation

Having the right Database Indexes is integral to ensuring our platform serves the customers without much latency. At the same time, the indexes always need to be modified as new queries get introduced into the system. SREs and DBREs will come up with a process on how we could address these kinds of situations and reduce the risk where new index builds need to be deployed.
Our MongoDB service provider recommends upgrading MongoDB to 4.4. This defect got addressed in this version of the database. We have plans to upgrade MongoDB to 4.4 before end of Q3.
One of the other recommendations the MongoDB team provided to avoid this issue while we are still running in 4.2 is not to run the dropIndexes command while an index is being built on secondary nodes. We will follow the above practice and monitor the MongoDB secondary node logs to check the status of the index builds before dropping any index.

0830 — Harness production incident due to Database Nodes going down

Root cause

Timeline

Remediation

Written by Surya Bhagvat