Harness Production Deployment Process

Published in

Harness Engineering

7 min readDec 11, 2020

Johnny Liu
Brett Zane
Surya Bhagvat

In the previous blog post, we have covered the deployment architecture for the Harness product. This post will talk about how we do our daily deployments using our Harness Continuous Delivery platform. One of the principles behind continuous deployment is to make frequent, automated releases to the product. This post covers how the production operations team at Harness does the daily deployments once the deployment artifacts are in place. What this means is that the deployment artifacts are already thoroughly tested in our QA environments. The QA sign off to production operations is briefly covered below.

QA Sign off process

Before we deploy any of the microservices in production, the pre-requisite step is to have a proper QA sign off. We will not cover all the steps that go in as part of the release process; that’s a matter for another blog post. Briefly, the release branch is cut from Master 4 days a week (Monday through Thursday morning IST) and deployed to QA for regression testing ( Soaked for 24 hrs). Post-release regression testing and sign-off, the build is released for deployment to production. This is scheduled for four days in a week (Tuesday through Friday morning IST).

The QA signs off, and the sign-off page covers the new builds that need to be deployed, the changeset, which is essentially a list of JIRA tickets that are going in as part of the build. The sign off page also contains whether there are any things that the operations team needs to be aware of as part of the deployment, including enabling or disabling any feature flags.

Deployment pipelines

The first step towards getting to automated deployments is defining a production-grade deployment pipeline. As covered in the earlier blog post around our deployment architecture, we use GKE for the Harness SaaS offering. We have defined our automated production pipelines composed of various stages and steps. Some of the steps are mentioned below

Notify in various slack channels when the deployment starts and when it's complete.
Set up our PagerDuty alerting in maintenance mode during the deployment window.
Automated approvals from the right set of people before the deployment can proceed.
Deploying of the individual services into our production primary Kubernetes cluster
Triggering of various other deployments that include the failover cluster and our UAT environment
Finally, marking the deployment as complete.

If any of the above steps result in an error, then the pipeline automatically gets rolled back.

Our production pipeline in detail

Shown below is our daily production deployment pipeline. The first few steps are around notifications and maintenance. The Triggers step automates deployments, and we primarily use them for deploying it to our other environments.

The next couple of screenshots shows the continuation of the pipeline and the deployment of our services. Not all services are covered here, but one of the core services of Harness’s product is our manager service that handles all tasks, communications with delegates, and background jobs. The deployment of the service itself consists of various steps. For manager service, we do the blue/green deployment. We bring the new version of the pod up entirely in the new namespace without sending any traffic. When we run the Set Primary stage, the new pods start to receive traffic, and it’s a hard cutover of all new inbound traffic. The only exception is tasks that were already running keep running on the old version. There is never a time where some new request goes to either version randomly, which is a rolling deployment property. We go with a true rolling deployment for other services (UI, verification, learning engine, etc.).

Altogether, we deploy around 7–8 services, and the whole deployment process takes about 40 mins. During this daily deployment process, the deployment takes place with zero downtime. One of the interesting things we can do with our pipelines is to add a verification step at the end of the deployment, checking against the existing APM and observability tools to verify the deployment’s performance. We have these verification steps as part of the QA pipeline, and in production, we don’t have this step explicitly defined. Using our Continuous verification product, we perform the sanity checks as mentioned below.

Deployment is completed — Now comes the sanity checks

We use our Continuous Verification product for detecting any anomalies after the deployment. The Continuous Verification product wires to your APM backend and can detect anomalies around your core business transactions. We use Continuous Verification product in conjunction with AppDynamics for detecting anomalies around the business transaction response time and with Google Stackdriver for detecting anomalies around our error messages. If the Continuous Verification product detects anomalies around these business transactions or the error logs, usually within 5–10 mins, we then get on a call and figure out what’s going on and whether there is a need to rollback.

The other sanity check that we do is we have a pipeline defined that covers the Harness CD platform functionality around deploying to various providers like Kubernetes, ECS, PCF, and so on. If the pipeline passes within the duration it’s expected to, we are good; otherwise, we debug further and figure out whether there is a need to rollback.

Database aspect of the deployment

One of the challenges with regards to daily deployment is around database indexes and schema changes. We have automated the process of our database related changes as part of our daily deployment process. We use Atlas MongoDB as our backend database. With every deployment, either there are database migrations involved or database indexes getting created or dropped on the collections.

The term database migrations in this context imply any schema related changes that are applied to the collection in that database instance. Production operations have a mechanism where they get notified if the code has any database migrations as part of the deployment. We make sure that the migrations are thoroughly documented and what is the impact of rolling back the deployment.

The second aspect of database changes includes the addition and removal of database indexes. Our deployment pipeline consists of a step where the code looks for the annotations on Java classes related to that deployment and figures out whether new indexes need to be created or existing indexes need to be dropped. This is one of our automated pipeline steps where the production operations get involved and look through the indexes to ensure that no indexes are getting created or dropped on the hottest collections or the huge collections. If this is the case, then we push the deployment to the time window when there is less traffic on the site, or in the rare cases, we let the deployment finish and then create the indexes at a later point of time when there is less traffic.

Roll-back or Fix Forward

One of the critical decisions of any deployment is whether to roll back the deployment or fix it forward. Rolling back the code is a business decision, and it needs to be weighed in the context of the impact on our customers. In the following scenarios, we usually do the roll-back

If the deployment has a performance impact across the product line.
If the deployment has introduced regressions to most of our customers, making the product unusable from the customer perspective.
In the current quarter, we had to do the rollback once to fix a P1 issue introduced with the deployment.

Like mentioned before, every roll-back decision is carefully considered. If we decide to do a roll-back, then we roll it back to the previous build and follow the same deployment process mentioned above.

Frequency of our deployment(s)

We usually do our deployments between 5 PM — 5 AM PT. Some data around our deployments that you may find interesting

In our last quarter, we have done around 31 deployments, 12 of them being hotfix deployments. We usually do hotfix deployments when there are regressions introduced in the code or deployment needs to happen because there is some functionality that we want to get in our customer’s hands a little bit earlier than usual.
In the current quarter, we are getting towards more on the daily deployment model (Mon-Thu) and have completed around 27 deployments, 10 of them being hotfix deployments.

From the production operations standpoint, we prefer daily deployments. Our production pipelines are robust and automated, and we use the same pipelines for doing daily deployments and rolling back a deployment. We track changes that went in as part of the deployment, including when the deployment got completed. This helps us identify any regressions or any performance-impacting bugs in a much quicker fashion, and we can either roll-back or fix forward as described above.