Login issues in app3.harness.io

Surya Bhagvat
Harness Engineering
3 min readSep 23, 2022

We want to share the details around the login issues that impacted some of our customers [in Prod-3 cluster] on 09/22 between 5:30 PM — 7:30 PM PT. Users were not able to login into https://app3.harness.io.

Root cause

Harness traditionally has adopted the deployment model, something similar to Canary, but one can think more of it as a reverse Canary. When we roll out the deployment, we have versioned-based deployment for some of our services. This means we deploy version x+1, route the traffic to this version, and simultaneously keep version x around for pending task completions. We use K8S namespaces to differentiate the versions. A Cron Job runs in the background and looks for annotations on the namespace, which includes the expiration time and deletes the older namespace upon reaching the expiration time. This expiration time is one of the steps as part of our pipelines.

Harness culturally is transitioning to the model where Engineers are responsible for writing, testing, and deploying their services into Production. As part of this effort, we moved 95% of our services away from the reverse canary model described above and to rolling-based deployment to keep things simple. So all the services no longer are in the versioned namespace but now belong to a static namespace. As part of this activity, the SRE team modified the pipelines for rolling-based deployment but didn’t remove the step of setting the namespace expiration from the pipelines.

We did our weekly deployment, completed around 4:30 PM PT, and the deployment pipeline incorrectly set the static namespace to expire an hour around 5:30 PM PT. All the pods in that namespace got scaled down to 0, and because the login service relies on some of these services, it could not get the required data, and the users could not log in. We identified the root cause, addressed the underlying issue, and logins started to work around 7:30 PM PT.

Timeline

  • 4:30 PM PT — Deployment Completed to Prod-3. We ran the post-deployment sanity pipelines, and everything went fine.
  • 5:30 PM PT — The namespace expired, causing some of the pods to scale down to 0, resulting in the users being unable to log in. Rest of the services are not impacted.
  • 5:30 PM — 7:30 PM PT — Debugging and remediation for the above.

Remediation

  • We missed a few things; the first was not removing the namespace expiration step from the modified pipelines. We will review all our pipelines and do a pipeline review to ensure there are no bugs in our steps/stages in the pipeline.
  • We had alerting in place, which alerts us if the pods behind a K8S service all scale down to zero. We configured the alert to ignore this condition for versioned namespaces, so we didn’t receive any. We modified the alert to take this into consideration.
  • We are enhancing our Synthetic Jobs this weekend to exercise some additional scenarios for our NextGen services so that we get immediate alerts in case our automated periodic testing fails.

Uptime/Availability

We apply a weightage-based calculation for calculating our uptime/availability https://medium.com/harness-engineering/computing-uptime-for-harness-modules-29be8ae9f622. This incident primarily impacted the login for our Current Gen CD customers in Prod-3. This accounts for 20% weightage.

The users could not log in for two hours (7,200 seconds); 20% of it would come to 1440 seconds. So for the week of 09/19–09/25, our uptime for our Current Gen CD for Prod-3 cluster is 99.76.

--

--