Production updates

Surya Bhagvat
Harness Engineering
4 min readApr 9, 2021

We want to share the production incident that impacted our customers in our Prod-1 cluster, the steps we took to mitigate it, and the process that we are putting in place to address these kinds of incidents.

Timeline

On 04/08/2021, at around 8:40 AM PT, we got the first alert from our monitoring tools that app.harness.io has stopped responding to our customers whose accounts are in the Prod-1 cluster. Once the alert got triggered, the SRE’s get paged, and we immediately start looking at some of the critical metrics that caused this to happen.

Application and Infrastructure metrics

We noticed that the Application Average Response Time on prod-1 cluster that usually hovers around 200ms reached up to 10,000 ms during this time window,

We did a quick analysis on the infrastructure metrics which includes our Kubernetes services, Atlas hosted Mongo databases and in-house TimescaleDB. We noticed one of our key services Manager was failing the readiness check in prod-1 cluster and was not taking the traffic. At this point, we knew the health check was failing and we focused our attention onto the MongoDB. We noticed that the Disk IOPS were more than usual and this pattern started happening right after a new production deployment that finished around 1:10 am PST on 04/08/2021.

The Mongo node that went down recovered around 8:58 AM, and from operational perspective, things went back to normal on app.harness.io.

What exactly caused this incident

Once we brought back the services, we started looking at changes that went in with that deployment. On the Mongo dashboards, we started noticing one query was doing a full table scan against Mongo collection called logAnalysisRecords which is primarily used in our CV module. This query was thrashing the mongo CPU and IOPS, making other mongo queries slower.

As part of our daily deployments, we have an automated mechanism that includes the addition and removal of database indexes. One of the above automated process recommendations was to drop an index on logAnalysisRecords collection. In the prod-1 cluster, this collection had around 600K records and about 140K records in the prod-2 cluster. This was one of the human errors on our part, and we should have analyzed the impact of dropping the recommended index before doing it in production.

The intention behind the index’s dropping was to improve and get away from the deprecated way of maintaining mongo indexes. Since the new way of creating an index didn’t support the exact order of indexes, we changed them. We also found an index that the mongo dashboard was showing not used at all during this exercise. We deleted some of those indexes to improve further performance. The information from the atlas ended up not being correct, and that index was actually used. Since that collection is huge and each record is also big, we ended up having an outage with the query against this collection resulting in table scans. This didn’t manifest in the prod-2 cluster as the number of records was about 1/4th of what was in the Prod-1 cluster, and the load is relatively low compared to Prod-1.

Once app.harness.io was operational, we continued to notice the query that was performing the table scan continued to show up in the Mongo logs. This particular query against the logAnalysisRecords collection was coming from our verification service micro-service. Around 9 AM, we flipped the switch to disable the verification service component that results in getting this query invoked. We also decided to scale down the verification service so that the site continues to operate without any issues. This ensured that the query which was doing a full table scan stopped and DB health was back to its fully functional state. Around 11:20 AM PT, we recreated the index that got dropped cause of human error and scaled back up the verification service, and re-enabled the service guard, thereby bringing all the modules back to their functional state.

Improvements to our deployment process

  1. Going forward, any index changes to our databases need to be reviewed and approved by the respective team lead and the DBA
  2. Once the QE signs off on a release, we will run the automated job that publishes the indexes that will be created or dropped during the deployment. These indexes will then be provided to the DBA, and the deployment will happen only after the DBA signs off on these indexes.

--

--