1130 — Harness production incident due to third party vendor misconfiguration

Surya Bhagvat
Harness Engineering
2 min readDec 5, 2022

We want to share the details about the incident where the pipelines could not advance after a step was completed. This impacted the deployments and builds in Prod-1 and Prod-2 clusters between 8:28 AM — 9:48 AM PT on Nov 30th, 2022. Next Gen Continuous Delivery, Continuous Integration, Service Reliability Management, Feature Flags, and Security Testing Orchestration were the modules that got impacted. Harness Current Gen Modules were not affected.

Root cause

Harness pipeline service relies on a third-party in-memory database provider. A rollout of the wrong configuration due to human error by third-party personnel caused the harness pipeline service failure. The vendor initiated a project to replace the self-signed server certificate with a signed certificate by GlobalSign across their fleet. They executed the first step for some of the non-TLS-enabled database clusters. By mistake, Harness clusters got added to the batch resulting in an outage since the client didn’t trust the new certificate.

Remediation

The vendor reverted their incorrect config changes by rolling back the server certificate across the Harness clusters.

Timeline

  • 8:28 AM -the first alert fired, and we triggered pager duty.
  • 8:38 AM — Status page updated.
  • 8:46 AM — We identified the issue was related to a third-party in-memory database, and we opened a ticket with the vendor.
  • 8:47 AM — While we were waiting on the vendor, Harness engineering tried different config changes and debugging to see whether we could address the issue.
  • 9:12 AM — Harness side config changes fail to solve the problem.
  • 9:24 AM — The vendor joined the troubleshooting call.
  • 9:37 AM — The vendor reverted their incorrect changes, and Harness services started to recover
  • 9:48 AM — Sanity checks pass. Issue resolved.

Action Items

  • Harness is working with the third-party vendor to improve their support SLA times.
  • Harness is re-evaluating their architecture to reduce its dependence on this third-party provider. This is an ongoing discussion and change and is already underway.

--

--