Effectively scaling your Gitlab pipeline configuration in microservice environments

Published in

inganalytics.com/inganalytics

9 min readMay 20, 2020

Gitlab offers a very intuitive and easy solution to configuring a CI & CD pipeline for a project using the .gitlab-ci.yml file. This works fine for a single repository, but what if your application consists of several repositories that all have to share a similar configuration? Or, perhaps, you’re using Gitlab on an organisational scale, and you’ve got different teams, building completely different products. Those teams most likely maintain their repositories, and thus, their pipeline configuration.

This article describes some patterns and some gotcha’s in setting up re-usable and pluggable pipeline scripts to reduce re-engineering and avoid copy-pasting pipeline configuration that’s bound to go outdated.

So, what’s the problem?

At WBAA, we build several independent AI-driven products. These projects are all contained in their own Gitlab group, and often consist of at least 5 repositories. The path of least resistance to get up and running when creating a new repository is to copy-paste a pipeline definition from another repository that uses the same runtime. Oh, so we have to create another Scala microservice? Sure, I’ll just grab the pipeline from this and, update a few references here and there, and be done with it.

But what if you need to update a specific pipeline step that is used in all your repositories? Or, what if your company decides to enforce specific steps to be included in the pipeline? Can you imagine the tedious, manual and error-prone process that will follow to get multiple teams to update a dozen or more repositories with the identical change?

I think it’s clear that the flexibility of having a .gitlab-ci.yml file per repository has its advantages but not without severe drawbacks when scaling beyond a single repository.

I know the solution!

What I’ve seen happen as a result of the growth-pain I describe above is that teams often make drastic decisions that do not come without a technical or operational cost.

One example of this is deployments. Let’s not duplicate all our environment configuration over multiple repositories, let’s just create a separate repository that does all the deployments. The result of this is that your repository, hopefully containing an isolated service, now can’t manage its deployment anymore. Those feature branch deployments to your development environment? Gone! So, you’ve merged a feature branch and a new version has been tagged and pushed to the Docker repository. Now you can go and update a version reference in your deployment repository to kick off another pipeline to do the actual deployment. In my opinion, the advantages of this level of abstraction introduced do not outweigh the disadvantages.

Another example that I’ve seen happen is the push to create a mono repository. If we put all the code together, the problem magically disappears! However tempting as it sounds, this will certainly lead to its challenges, and this topic alone deserves a separate article to do the comparison justice.

Tackling the pain points of a single pipeline configuration file

Can we find a compromise in which we benefit from all the glorious things split repositories bring us while addressing some of the pain points mentioned above?

The .gitlab-ci.yml YAML syntax offers us a few powerful techniques that we can use to reduce the boilerplate and promote re-use. The first one is the ability to include other YAML files into your pipeline using an include block. It allows you to include files from the same repository, a different repository in the same Gitlab instance or a remote file over HTTP.

With the include technique, you can extract specific CI/CD steps into separate files, and let projects that require that step include the file. We’ve used this technique to extract versioning, helm chart generation, container inspection, security scanning and more.

Let’s look at an example, version generation. In a previous blog post, I’ve shown you how you can set up semantic versioning in Gitlab. The first step of this was generating a unique version for the running pipeline. We would like to be able to provide this logic to every repository, as the first step of their pipeline.

We’ve set up a separate repository which contains a generate-version.yml file with the following content:

variables:
  VERSION_STRATEGY: semver
  VERSION_FILE: version.shversion:
  stage: .pre
  tags: [ docker-tag ]
  image: $ARTIFACTORY_URL/$ARTIFACTORY_DOCKER_REGISTRY/engineering-tools/version-update:1.0.0
  script:
    - python3 /version-update/version-update.py --no-release --out-  file=$VERSION_FILE
  artifacts:
    paths:
      - $VERSION_FILE
  except: [ master ]

There’s nothing too special about this, except for the use of the .pre stage. A few versions ago, the .pre and .post stages were introduced to Gitlab. This allows you to define pipeline steps that will always run before or after the stages defined in the .gitlab-ci.yml file where this generate-version.yml is included. This is a powerful technique but, again, not without its limits. One of the major drawbacks that I see is that it only allows you to run steps in parallel. Any step that you define to belong to .pre or .post will run in parallel in that stage. That’s fine for simple, independent tasks, but more complex tasks, like security scanning, often involves running consecutive steps. We’ll get back to this later on.

Now, if we want to use the step that we defined above, all we need to do is include it in our .gitlab-ci.yml file:

include:
  - project: 'engineering-tools/pipeline-scripts'
    file: 'generate-version.yml'

Since we’re using the .pre stage in the generate-version.yml, there’s no need to define an additional stage in the .gitlab-ci.yml.

What about including consecutive steps?

As mentioned before, you cannot utilise .pre and .post for running consecutive steps. Any step defined in either of those stages will run in parallel. Let’s take Helm chart generation as an example. Our setup consists of 3 steps that we want to run consecutively:

generate-helm-chart
validate-helm-chart
build-helm-chart

Let’s assume that we’ve put the definition of these stages into a helm.yml file.

In the most rudimentary form, this file would look somewhat like this:

generate-chart-yaml:
  stage: generate-helm-chart
  ...lint-chart:
  stage: validate-helm-chart
  ...render-chart:
  stage: validate-helm-chart
  ...build-chart:
  stage: build-helm-chart
  ...

When we’d like to use this Helm setup, all we need to do is include the file, as shown before, and then define the 3 stages that we’re referencing in the helm.yml:

include:
  - project: 'engineering-tools/pipeline-scripts'
    file: 'helm.yml'stages:
  - build
  - test
  - generate-helm-chart
  - validate-helm-chart
  - build-helm-chart

The upside of this method is that, by looking at the .gitlab-ci.yml definition, it’s abundantly clear which stages are available in this pipeline. If you forget to define a stage that’s being used in one of the included files, Gitlab will even return an error to indicate that there’s a stage missing.

However, the upside of this method is also immediately the downside. For every included file, you’ll have to know which stages it uses, and where to use those stages in your pipeline definition. Ideally, Gitlab would support including this file without us having to define the stage definitions into the .gitlab-ci.yml. Essentially, allowing us to merge stages across includes. This has been an outstanding request with Gitlab for a long time, and it seems to be quite a difficult problem to tackle.

The biggest risk with how Gitlab supports this right now is that when someone makes a change to one of the stages used in an included file, any project that includes that file will break and will need to be manually updated. To be on the safe side, if you’re heavily investing in using includes, it would be good practise to introduce versioning in the filename whenever you introduce a breaking change.

Using variables in your includes

One of the strengths of using includes is that you can generalise a concept to work for a multitude of repositories. Where this concept can be a hardcoded set of commands to execute, it often will require information about the project or context to run in. For this, just like in your .gitlab-ci.yml, you can utilise variables.

Unfortunately, there’s no mechanism in Gitlab to inform developers about environment variables that an included script uses. Environment variables can be set in different places: the docker runner, Gitlab group variables, any included YAML file or the .gitlab-ci.yml file itself. Therefore, it would be impossible for Gitlab to inform the user about a missing environment variable that an included file requires. However, what it could do is inform the user about a missing environment variable during the execution of the pipeline step.

Currently, if you’re using an environment variable in one of your steps, and that variable is unset, the command will behave abnormally, or error out. However, as a developer, you’re not immediately aware of why it failed. You’ll have to inspect which environment variables are set during the step execution to figure out what was missing.

To mitigate this issue as much as possible, we provide a list of required and optional environment variables with each YAML file that we provide for re-use. This will at least make integration of the YAML file clearer to other developers.

As mentioned before, the risk of breaking all the implementing repositories by introducing breaking changes also counts here.

Beware of anchors

In case you’re not familiar with Gitlab anchors, it’s a feature that allows you to easily duplicate content across your pipeline definition. The official documentation shows the following example in which there’s an anchor called job_definition that’s being used by 2 steps, respectively test1 and test2.

.job_template: &job_definition  # Hidden key that defines an anchor named 'job_definition'   
  image: ruby:2.6   
  services:     
    - postgres     
    - redis  test1:   
  <<: *job_definition           # Merge the contents of the 'job_definition' alias  
  script:     
    - test1 project  test2:   
  <<: *job_definition           # Merge the contents of the 'job_definition' alias  
  script:     
    - test2 project

This is a useful feature for deduplication but is not supported when using the includes feature. If you’re interested in defining re-usable blocks which are supported between file includes, take a look at the extends functionality.

Configuring required pipeline

Remember how we also spoke about how the ability to enforce the usage of specific pipeline steps could be desirable? Not too long ago, Gitlab introduced a feature (in Gitlab premium only) to allow Gitlab administrators to enforce required pipelines to all projects. Whereas the idea is nice, the implementation is lacking usability.

Imagine a Gitlab instance with 100 different repositories, what kind of pipeline would you end up extracting and setting as the required pipeline? Will we add Docker container security scanning? What if the repository is a documentation repository, and it doesn’t contain a Docker container?

The solution here is to simply build your pipeline steps in such a way that they detect whether or not they would need to run, or simply exit with a positive status code in case they don’t. Of course, this would work, but you’d still end up with all your pipeline steps being context-aware, and most of them executing (although for a short period) without the need to. Last but not least, project-specific pipelines would no longer reflect the actual steps that a project requires to execute, but it would also include all the positively marked steps that are simply skipped due to being redundant.

There are on-going efforts to improve project-level compliance support in Gitlab so let’s hope that one day there’ll be a good balance found between enforcing required steps and the required granularity of (automatically) selecting the applicable projects.

Project templates

The last feature that I’d like to highlight is the use of project templates (Gitlab premium). With this feature, you can define multiple Gitlab pipeline templates that projects can re-use. This will allow you to effortlessly define a standard pipeline, for example, per runtime. You could have a default pipeline for all Java, Python, or other runtime services.

While this is a useful feature to even more limit the amount of code duplication, from a compliance perspective, it, unfortunately, does not guarantee that every repository does use one of the available project templates.

Conclusion

As you’re probably aware, no solution is perfect. Often, it’s a matter of finding the way of working that strikes the perfect balance between benefits and drawbacks.

If you’re working in a microservice environment, or you’ve got multiple teams using the same Gitlab instance then by now you should be familiarised with a few techniques to help improve engineering effectiveness and reduce pipeline duplication.

The combination of the techniques described above can be powerful and has allowed us to extract a lot of common pipeline steps to be re-used by multiple product development teams.