Nearly four years ago, Hootsuite ran on a single monolithic codebase deployed on a fleet of AWS EC2 instances. Now, this functionality has been split up and delegated to over 120 services, which are deployed as containers on Kubernetes. Improving the tooling needed to build and deploy such a large number of services has come with a lot of challenges. Over the years, robust build and deploy scaffolding has been put in place to make it easy to generate and ship new services to production.
The core of this scaffolding is the Hootsuite service skeleton. Given a set of inputs, the service skeleton will generate both the basic application (Scala/Go) code and the build and deployment scaffolding needed to run a service. These generated files are then committed to a single GitHub repository — each service has its own GitHub repository. After committing the generated files, the service can be built and deployed to each environment. To date, over 120 services have been generated using this process.
The build and deployment scaffolding needed to run a service has evolved over time. This post documents the problems with coupling application code to build and deployment code. It then describes how GitOps can be leveraged to decouple these two separate concerns.
What We Were Doing
Historically, in order to build and deploy a service at Hootsuite, several files were committed to the service repository alongside the application source code:
- A Jenkinsfile
- A Makefile
- Helm chart values files
The Jenkinsfile, written in Groovy, contained stages for building, testing and deploying the service. A Jenkins job was created and configured to use this Jenkinsfile as its configuration. The service repository was configured with a webhook to send push events to Jenkins to trigger this job. If a push event was for a change to the master branch, Jenkins would queue a full build and deployment of the service. The actual logic for building and deploying the service was defined in the Makefile.
The Makefile included targets that would:
- Build the application binary
- Run unit tests
- Build a container with the application binary
- Push the container image to a Docker registry
- Generate a Kubernetes manifest from Helm Chart templates
- Apply the manifest to each applicable cluster
The Helm chart values were defined in a set of files. There would be one file, which included values applicable to all environments and an additional file per environment, which included values specific to that environment. The values in these files would be consumed by an internally maintained service Helm chart. The service Helm chart included the templates needed to generate Kubernetes manifests and is described in more detail below.
The Compute Platform team at Hootsuite maintained two variations of the service Helm chart for deploying services — one for REST services and another for gRPC services. Each of these Helm charts included the resources needed to run a service on Kubernetes at Hootsuite. The chart type and chart version used for a service were hardcoded in the Makefile. The Helm chart values committed alongside a service were used to template out the Kubernetes resources included in the chart. When a service was deployed, the Makefile deploy target would use the chart type, version and values to generate a Kubernetes manifest.
What Wasn’t Working
The core problem with this setup was that the maintainers for a service’s deployment configuration (eg. the Helm charts) were not the same people as the maintainers for the service’s application code.
The only way for the Compute Platform team to release a new Helm chart version was to manually update the chart version and values in each service repository. These updates would then trigger a build and deployment of each service, which would propagate the deployment configuration changes to each cluster.
One of the most common problems the team encountered when making these updates was infrequently-built services, which would often have broken builds. In the best case, the Compute Platform team would need to debug the build pipeline to successfully push the deployment configuration updates to all clusters. In other cases, services would have unreleased, broken code in the master branch. When this happened, the team would need to fix the broken code, then chase down the service owners to validate these fixes. Obviously, neither of these concerns were related to deployment configuration, yet they repeatedly needed to be addressed before deployment configuration changes could be shipped.
Because each service repository’s ownership was shared by the Compute Platform team and a development team, these updates were almost never a straightforward process. Even trivial deployment configuration updates could sometimes take months to propagate to all services.
Another problem was that the service skeleton had evolved over the years. Each service was a reflection of the state that the skeleton had been in when the service was first generated. This meant that each service could potentially have dramatic differences in layout and configuration. The differences from service to service made it impossible to leverage automation to roll out changes to build and deployment scaffolding.
As time went on, it became clear that the shared ownership of service repositories was causing unnecessary friction for the Compute Platform team. The team struggled to roll out changes to deployment configuration with the current setup — modifying 120 services by hand simply did not scale. In order to roll out new Kubernetes features and version upgrades, it was critical to be able to make frequent changes to the deployment configuration. By isolating deployment code, the Compute Platform team could take stronger ownership of the deployment of services on Kubernetes. This would make it possible to quickly iterate on deployment configuration and code without impacting development teams — the team decided to investigate options for decoupling deployment configuration from application code.
What We Tried
In summer 2019, the Production Operations and Delivery teams at Hootsuite held a hackathon. One of the projects used a relatively new tool at the time — Weaveworks Flux. The hackathon project prototyped a basic Flux setup and used it to deploy a service, which had been generated from the service skeleton. Flux was a new type of tool, one used for implementing a new paradigm, known as GitOps. Though it was new, GitOps was based on accepted and proven concepts in the DevOps world — infrastructure as code and pull-based configuration management. Hootsuite had been successfully using similar tools for years — eg. Terraform and Atlantis.
The core concept of GitOps is simple — deployment configuration is defined as code. Applied to Kubernetes, it means that the state of a cluster is defined in source control. With GitOps, changes to cluster state, like rolling out new versions of services, are made by committing a change to source control. The code that defines the cluster state is stored independently from any application code, typically in a completely separate repository.
The hackathon prototype made clear that a GitOps workflow could solve the problem with build and deployment code being tightly coupled to application code. A GitOps best practice is for services’ deployment configuration to live in a completely separate repository from the application code. Extracting the services’ deployment configuration to separate repositories could help the Compute Platform team with taking stronger ownership of the deployment configuration. Using automation, Helm chart versions and values in these separate repositories could be modified programmatically and propagated to the clusters. Broken build pipelines and unreleased code would no longer block deployment configuration changes at Hootsuite.
Encouraged by the success of the Flux prototype, the Compute Platform team decided to move forward with implementing a GitOps workflow at Hootsuite. The first step in this process was to investigate the available GitOps tools. At the time, there were two leading technologies in the GitOps space — Weaveworks Flux and ArgoCD. In November 2019, it was announced that these projects would be merging into a new project known as Argo Flux. While the team eventually wanted to use Argo Flux, a decision on whether to use Flux or ArgoCD needed to be made in the meantime.
In the interest of due diligence, an ArgoCD prototype was set up. The goal was to compare the ArgoCD prototype to the Flux prototype from the hackathon. Like the Flux prototype, the ArgoCD prototype would deploy a service using deployment configuration pulled from GitHub. Once both prototypes were running side by side, a detailed comparison of the two projects could be made.
Initially, the team was leaning towards Flux. Flux had some desirable features that ArgoCD did not have. Flux could watch Docker registries for new images and automatically deploy them. It could also automatically discover new services (defined as a FluxRelease or HelmRelease) to deploy in the GitHub repository that it had been configured to poll.
At the end of the day, however, the team decided to go with ArgoCD. ArgoCD could support multiple repositories per instance, which meant that it would be possible to keep deployment configuration in multiple repositories. This would allow access to deployment configuration to be restricted to certain teams. ArgoCD could also be integrated with Okta. By integrating with Okta and mapping Okta Groups to ArgoCD Projects and Roles, access could be limited specific to user groups. This made it possible to grant access to the tool to a wider audience. Finally, ArgoCD supported the option to “manually sync” changes on Github to cluster state.
Setting Up ArgoCD
The team set out to integrate ArgoCD into the deployment workflow at Hootsuite. First, the team set up ArgoCD on each Kubernetes cluster. ArgoCD provides base manifests, but Hootsuite’s implementation would require changes to these manifests. Kustomize was used to patch and add Hootsuite specific changes to the base manifests. Some of these changes included:
- Limiting the default RBAC permissions granted to ArgoCD. Kubernetes tools ship with wide open RBAC permissions that need to be scoped down to what is actually needed. ArgoCD’s permissions were scoped so that ArgoCD could only deploy changes to certain namespaces. The ability to perform operations on certain types of resources was also limited.
- Configuring Artifactory access (needed to pull down the service Helm chart) and GitHub access (needed to pull down deployment repositories).
- Configuring an integration with Okta to grant access to the ArgoCD CLI/UI to developers. ArgoCD Roles were created and mapped to Okta groups to further restrict the actions that different groups of developers could take using the ArgoCD CLI/UI.
- Setting up an ArgoCD Project for ArgoCD Applications to belong to. A role for Jenkins was created for this Project that granted permissions to get, list and sync Applications belonging to the Project. An auth token was then generated for this role to make ArgoCD API calls from Jenkins.
In addition to the above customizations, the team also had to contend with a few bugs that were found while setting up ArgoCD. After filing issues with the ArgoCD project on GitHub, the team decided to contribute fixes for several of them. The ArgoCD maintainers were quick to provide feedback and to release the changes contributed by the team. They deserve a shout out for their ongoing work to provide an excellent tool and for making it easy to contribute back to the project!
What Our GitOps Workflow Looked Like
The challenge with GitOps is “gluing” the pieces together. With over 120 services, each with their own pipeline, it is important to provide a build and deployment workflow that is easy to configure. It is also important to roll out changes to this workflow iteratively to reduce developer friction.
The first step towards providing such an experience was solving the problem of how to register each service as an ArgoCD Application to allow the service to be deployed on each cluster. The ArgoCD Application resource is a Custom Resource Definition that points to a source for reading deployment configuration and to a destination for where the deployment configuration should be applied on a cluster. An ArgoCD Application and a service typically have a one-to-one mapping. Manually registering over 120 services as Applications with ArgoCD would be tedious. However, a pattern known as “App of Apps” could be used to make this easier. Following this pattern, a single, “root” ArgoCD Application resource is created, pointing to a GitHub repository that contains all ArgoCD Applications that need to be applied to the cluster. ArgoCD is then able to create these Applications programmatically by syncing to the parent repository.
Employing the App of Apps pattern, the team added a Kustomization to the ArgoCD installation manifests to create the root ArgoCD Application. The root Application was configured to use ArgoCD’s automatic sync functionality. Automatic sync is a feature used to apply any changes made to the repository referenced by the root Application automatically — no manual intervention is required.
The number of services at Hootsuite made populating the App of Apps repository a daunting task. Existing and future services would need an ArgoCD Application resource for each ArgoCD-managed cluster that the service would be deployed to. Rather than manually create these resources, the Compute Platform team built a CLI tool to help with this task. The hs-deployments tool was built to create ArgoCD Application resources and commit them to the App of Apps repository.
A new GitHub organization named “deployments” was created to host the App of Apps repository. Each service’s deployment configuration would be hosted in its own repository underneath the new deployments organization. This would allow for a one-to-one mapping between deployment and application code repositories, which would make it easy to scope down access and to audit changes to deployment configuration on a per service basis. The deployments organization was configured with a webhook pointing to an ArgoCD endpoint. This would allow ArgoCD to immediately recognize any changes made to deployment configuration repositories, rather than waiting for the results of its next poll interval.
In order to migrate to the new ArgoCD workflow, a deployment configuration repository was created for each existing service. To use the ArgoCD workflow out of the box with new services going forward, the service skeleton was modified to automatically create a deployment configuration repository when a new service was generated. The deployment configuration repository contained the Helm chart type, version and values that had historically been committed to the application code repository.
Compute Platform team ownership of the deployment repositories was established by adding a GitHub CODEOWNERS file. To support development workflows, developers were permitted to make changes to environment specific Helm chart values without review by the Compute Platform team. These values could be used to make changes such as adding environment variables, adjusting resource requests/limits, and modifying the replica count for a deployment. Any other changes to the deployment configuration would require review from the Compute Platform team.
One of the challenges with decoupling deployment configuration from application code was where application configuration should live. Since application configuration is inherently coupled to application code, it was sensible to keep it in the application code repository. Local development workflows also depended on the application configuration being present in the application code repository. Keeping the application configuration here would reduce friction with developers.
However, a limitation of Helm is that templates can only refer to files that exist in the chart directory. The service Helm charts created a ConfigMap resource which included the application’s configuration. The Deployment resource for the service mounted the application configuration from the ConfigMap into the application container. In order to create the ConfigMap resources with the application configuration, the application configuration needed to be committed to the deployment code repository. To work around this limitation, it was decided that the build and deploy pipeline would copy application configuration files from the application code repository to the deployment repository. The pipeline would copy and commit these files alongside a new docker image tag whenever the application code was built.
Shifting deployment configuration out of application code repositories required significant changes to the existing build and deploy pipeline. The existing deployment logic, which was contained in each application code repository’s Makefile, was moved to a new Jenkins Shared Library function. Having a single source of truth for the deployment logic would make it easier to iterate and ship changes. To support this kind of iterative development, the Build team developed a new versioning strategy for the Jenkins Shared Library. Previously, the Jenkins Shared Library had been pinned to a released patch version. Going forward, the library would be pinned to a major version branch. Any backwards compatible change would be merged to the major version branch, which would automatically propagate to all services the next time they were built.
The new Jenkins Shared Library function was almost identical to the old library function used to build and deploy services. The key difference was how the deploy steps were defined. A single boolean variable indicated whether or not a service should be deployed to an environment. When this variable was set to true, the ArgoCD workflow would be executed. This workflow would:
- Commit the application configuration files and the tag of the newly built docker image to the deployment configuration repository.
- Wait for ArgoCD to register the corresponding Application as “OutOfSync” (eg. its live state on the cluster did not match what was in GitHub).
- Issue ArgoCD a sync command for the Application.
- Wait for the changes to deploy and for ArgoCD to report back that the Application was healthy.
- If ArgoCD reported the Application as healthy, the same flow would be triggered for the next environment the service was to be deployed on.
- If ArgoCD reported the Application as unhealthy, the pipeline would fail.
To facilitate rolling out changes to the deployment configuration, the same pipeline was used. If the change-set only included deployment configuration, the pipeline would skip building the application and would not commit application configuration or a new image tag to the deployment repository. Instead, the pipeline would skip ahead to issuing sync commands and waiting for the deployment to be rolled out successfully to each environment.
Positive Side Effects We Experienced
While the core goal of this project was to decouple deployment configuration from application code, adopting ArgoCD and GitOps also provided several other benefits:
- ArgoCD paves the way for easier rollbacks. Out of the box, ArgoCD supports commands that make it possible to rollback to an earlier version of the deployment configuration. It is also possible to accomplish a rollback by using a Git revert in the deployment repository.
- Using ArgoCD to manage deployments removes the need for Jenkins to have direct access to each Kubernetes cluster. Jenkins required privileged access to create and update Kubernetes resources for service deployments. While ArgoCD also has privileged access, it can be better scoped using ArgoCD Projects and RBAC.
- The state of each cluster is no longer imperative — it is declarative. The contents of each cluster are declared on GitHub. Historically, this state has been imperative. For example, say the following needed to be answered: What should be deployed on each cluster? What version of an application should be on each cluster? Previously, the only way to answer these questions would be to check either the Jenkins console or to examine the cluster state itself. With ArgoCD, this can be accomplished by generating a diff between the configuration on Github and the configuration on the cluster.
- Because cluster state is declarative, changes to it are easily auditable. It is easy to review history in Github to see when a change was made, who made it and why.
- The likelihood of service outages or degradations is reduced by requiring a review from the Compute Platform team on changes to deployment configuration code.
- Disaster recovery is simpler. If a cluster enters a failure state that cannot be recovered, a replacement cluster can be brought up and bootstrapped with ArgoCD. ArgoCD can then be used to sync all services to the cluster that should be there for the given environment. Previously, backups could be used for recovery, but there is always a risk that they are slightly out of date depending on when they were taken.
Having decoupled application code and deployment code, it was now possible to automate upgrading all existing services to use a new version of the service Helm charts. Soon after the migration to ArgoCD, a new version of the Helm charts needed to be rolled out to patch a sidecar container used for secrets management. Previously, this change would have taken months to roll out. With ArgoCD, it took a day.
It was incredibly satisfying for the migration to be validated in such an obvious and measurable way. The team is now looking forward to using the time savings from ArgoCD and GitOps to make further improvements to the reliability and extensibility of the Kubernetes platform at Hootsuite.