How we improved our CI/CD pipelines over the years

Felipe Peiter
Wildlife Studios Tech Blog
10 min readDec 18, 2020

There isn't a perfect CI/CD solution —and this shouldn't come up as a secret if you have experience on the matter.

There are several use cases and different maturity levels of the software development lifecycle amongst developers. A quick research about CI/CD can be overwhelming, as there are several solutions and different approaches for it, each one with its own advantages and disadvantages.

At Wildlife, our developers are responsible for the full Software Development Lifecycle, from building several products, to configure their own CI/CD pipelines, deploy the applications, and to be on-call for what they built.

Having ownership of the full SDLC helped us achieve our goal of continuing to grow without a co-dependent relationship between teams, and it has worked pretty well for several years.

As the company started to grow, developers started creating solutions to share between them, simplifying the steps needed to deploy their applications. Development teams started using tools to simplify their deployments, but we had different levels of Kubernetes knowledge amongst teams.

When you start to grow fast

This led us to have a few different deployments tools over the years, like the DEIS PaaS for some teams that wanted more abstraction and Jenkins for others that wanted more control over their Kubernetes deployments.

When our SRE team was constituted, we had developers shipping code with DEIS, Jenkins, Gitlab pipelines, Heroku, and even manually. Although the solutions worked great for a long time, we started having a few problems. Jenkins had several custom plugins that blocked us from upgrading it. DEIS was discontinued by its community and couldn’t cover every aspect you would like on a CI/CD solution.

Running applications worked fine, as the pipelines were already solid and properly configured, but we started seeing difficulties from developers to set up new pipelines. This was a red flag for us, as we started to see constant increases in the initial time to set up new applications.

At that time, we had a few Kubernetes clusters and some deployments on EC2 instances, but we were creating more and more Kubernetes clusters, migrating applications out of EC2 instances and integrations were tough. At first, we thought about how we could be the least disruptive, as learning a new process involves a learning curve that could slow us down.

We also researched how to update our current solutions, which, each time, sounded more difficult, as we ran really old versions of them. This brought us to the whiteboard and generated the following questions:

  • Is it worth it updating our CI/CD tools? How much time would it cost us?
  • How can we avoid developers having to write complicated Jenkinsfiles? Can we create some predefined templates?
  • Should we create a DEIS fork and start maintaining it by ourselves, as it was not maintained anymore by its original owners?
  • How can we manage integrations with Kubernetes clusters?
  • How can we create a predictable path to deploying new applications? How can we migrate old pipelines to this new standard?

Those questions led us to one of our core values:

We innovate with research

Having many doubts and no answers made us start researching how other companies tackled this issue. We started studying the code of our Jenkins plugins, the code of DEIS, and research new tools that could be implemented.

We noticed that upgrading Jenkins in-place was impossible, which would have led us to have to create a new Jenkins deployment from scratch and rewrite every pipeline to new standards, which would yield an immeasurable amount of work. So it made sense to start to explore other options as well.

On the SRE team, we rely heavily on creating Proofs of Concept, writing internal Call-For-Comments, and, after deciding on an approach, writing Architecture Decision Records. This research period led us to create several PoC, generate multiple discussions, and present multiple resources, alongside reports of their usage.

At that time, we had researched the following solutions:

After the research, we liked the GitOps approach and had to decide between FluxCD and ArgoCD. The GitOps principles are greatly summarised by WeaveCloud:

  • The entire system described declaratively
  • The canonical desired system state versioned in Git
  • Approved changes that can be automatically applied to the system
  • Software agents to ensure correctness and alert on divergence

At Wildlife, we strongly rely on feedback from our peers so, with both PoCs running, we reached a few developers to see which tool they felt was easier, setting up a few pipelines for their applications.

After this brief validation, we decided that FluxCD was better for our use case, increasing the developer’s experience whenever shipping new code.

How Flux Works

The Flux GitOps approach relies on having a declarative representation of your Kubernetes manifests inside a Git repository. Flux will then sync with that repository, generate the manifests, and apply them to the desired namespace. This flow is represented below:

Flux GitOps workflow

Instead of interacting with Kubernetes, developers are able to push code to a Git repository, validate the manifests with a pipeline and then merge it to the desired branch, which Flux would be watching. Whenever the changes are merged, Flux applies the new configuration. Developers still need to set up a pipeline, to test, lint, and build the application, then push the images to the respective Docker repository.

One thing that we noticed quickly was that Flux wouldn’t scale up if it was a single deployment, as it would have to sync with thousands of git repositories, generate the manifests for that applications and apply them.

To solve this we decided to use Flux in a Multi-tenant approach, as having thousands of resources being managed by the same pod could increase sync times between the Git repos and what’s actually deployed on the Kubernetes clusters.

This approach has a centralized Flux daemon, responsible for deploying multiple other Flux daemons on the namespace of the applications, those namespace-scoped daemons would then manage only the application resources, not a cluster-wide set of resources. The architecture could be exemplified as below, extracted from the Flux documentation:

Multi-tentant Flux architecture

This also enabled us to create namespace bound policies, where the team’s Flux daemons could only manage resources inside that namespace. With cluster-wide resources being managed by the cluster-admin Flux.

Another advantage of this approach is that it allowed us to monitor only the cluster-admin Flux, with the teams monitoring their own stack, which reduced the alert noise of the SRE team.

The default Flux behavior is to watch a Git repository, pointing to a path and branch, reading all manifests declared on the repository, and applying them with a simple kubectl apply -f -.

This behavior could be overridden using a custom file called .flux.yaml where you can specify the actions that Flux should execute in order to generate the manifests, instead of simply applying the existing one. The apply process remains the same, but we can use this action approach to use tools like Kustomize to generate manifests on the fly, instead of relying on a plain static existing one. A simple example, extracted from the Flux docs, looks like below:

version: 1 # must be `1`
patchUpdated:
generators:
- command: kustomize build .
patchFile: flux-patch.yaml

A more detailed explanation can be found here. This generator.command yaml key can be used to specify any application for Flux to run and generate the manifest, in our case, that would be Helm and Kustomize.

There is no perfect solution

One thing we found when using Flux, is that you need to set up a Deploy Key in your application repositories, so it can clone and push commits to it. Those keys are generated whenever the namespaced Flux was created, and it could be a manual step, which is okay for a few namespaces, but it does not scale up for more than a hundred.

Our approach was to develop the Gitlab Flux Controller, which is a Kubernetes Controller that searches for Flux deployments keys and pushes them to the respective repositories.

We also rely heavily on Helm and Kustomize for templating our Kubernetes applications. We like to offer developers tools that they are used to, and FluxCD had support for both. For managing Helm Releases, Flux implemented a CRD called Helm Operator that manages Helm releases, which could further decrease the complexity for developers, but there was a catch.

Whenever you sync a Helm deployment using the Helm Operator, the only resource that Flux manages is the CRD itself. This implies manual changes on other Kubernetes resources not being overridden, which could reflect in the declared Git state drifting from the actual running configuration.

Having that in mind, we decided to use Helm template to generate the manifests, then use Flux to apply the generated manifests as plain text. This would ensure that the resources are always the same as declared on Git.

When trying to use Helm template as a generator command for Flux, we noticed that helm wasn’t present on the Flux image, so we built a custom Flux image and added the Helm binary to it. Our Flux configuration files started to become big and confusing, as each new namespace required a new generator command entry, to understand that, let’s see an example of a Cluster admin Flux repository:

Example Cluster Admin Repository

So, as you can imagine, our .flux.yaml file started looking like this:

Which doesn’t seem like a good approach at all. Besides that, we also had to manage the Namespace creation, which wasn’t included in the default chart.

To handle that, we decided to create a helper program that was able to discover several helm charts and render them without further configuration, also handling the Namespace creation.

We created helm-generate, which uses the Helm SDK to handle manifest creation using Helm functionalities Having this tool helped us to simplify our Flux configurations to be as simple as below:

With those details set up, we then started looking into how to monitor Flux and found the Fluxcloud project, that sends Slack messages whenever some resource is updated. After integrating Fluxcloud with Slack, we decided to create an exporter to send events to Datadog, our monitoring tool of choice here at Wildlife, but after sending a PR and waiting for several days, we discovered that the project was abandoned.

We contacted the Flux community and after some back and forth conversations, they assigned us as the new official maintainer, as our fork was evolving the project and continuing its development.

The new official CI/CD tool: Gitlab Pipelines

After having a great Flux workflow, we moved on to fixing our CI/CD problem. There were many different options running on the company and we needed to push a single one for our backend applications. Gitlab is one of the most used CI/CD tools today, offering multiple resources and even native Kubernetes integrations.

Standardizing pipelines can be really hard as different programming languages and solutions require different configurations. In order to help developers build and standardize their pipelines, we’ve provided a set of external pipelines that can be included as an external file. Instead of creating a complex Job to login to ECR, for example, developers could include our template job with something like below:

You can also notice the usage of make for the build step. One approach for having a standard on your pipelines is the usage of Makefiles with predefined steps. Those predefined steps help us translate complicated steps to simple commands as make lint ,make cover , make tests , for example, enabling us to create predefined stages and Jobs, that only rely on Makefiles.

Flux Automatic Updates and Gitlab

One of the greatest features of Flux is its ability to automatically update images by scanning the docker repositories and updating images based on specific rules. This is done by specifying an annotation to enable the automation: fluxcd.io/automated: true , this enables the default behavior of updating images based on timestamp, the newest always get updated. You could also apply some constraints, like regexes, glob rules, or semver rules.

For our pipelines, developers mostly use two approaches: having a stag/master branches, or using semver to version production images.

For the first approach, whenever something is merged to staging/master, the docker image is built and tagged, then Flux notices the change on the repository and applies the changes.

For the second, the production image is only built whenever a developer pushes a SemVer tag, creating a Git release, and usually we use the automation constraints to ensure that only minors and patches are updated automatically, for example, with the annotation: fluxcd.io/tag.container: semver:~1 , which applies any version within the major 1.

How to abstract Helm

One of the things we also decided to do was to create a generic helm chart that could deploy a simple stateless application. This chart was then versioned and available for all developers.

Having good documentation and a set of good examples helped developers with no knowledge of Kubernetes to deploy applications by simply filling a values.yaml file, instead of needing to understand what is a Deployment, for example. This is important in a company that has several developers that don't interact with backend systems, as enables them to focus on their relevant work, instead of losing weeks trying to deploy an application, as Kubernetes has a harsh learning curve.

What's next?

Reaching a point where developers can deploy an application from scratch with only a few MR hasn’t been easy, as tools usually don’t abstract enough of the needed steps. Building several tools, and making customizations helped us make the process less painful and easier to predict, as we can now know what to expect from any pipeline.

Removing complicated Jenkinsfile, using Makefiles to create templated pipelines, and using FluxCD + Generic app chart have drastically improved our developer’s experience. Although we migrated several applications to this new model, we haven’t managed to migrate every single application to it and there is still much to do:

  • Migrate every remaining app to the new GitlabCI + FluxCD approach.
  • Enable FluxCD integration with Flagger for Blue/Green and Canary deployments
  • Migrate Flux v1 to the GitOps ToolKit

Soon we'lI have some news about those next steps — and I could bring you updates in a new article. For now, I hope this one can help you make better decisions to deploy your applications. Good luck!

--

--