CI/CD Pipeline Design Principles — CRAFTS

Kay Ren Fa
10 min readAug 8, 2023

--

As a DevOps Engineer, working on CI/CD Pipelines is pretty much our bread and butter responsibility. In most occasion, most engineers don’t have the opportunity to build from scratch, as they usually inherit pipeline that has been frequently fine-tuned over time.

Luck for me, I had the opportunity to design pipeline numerous time during the course of my career which I has learned from many mistakes and come up with couple of guiding principles when designing and working on pipelines.

In this article, I will not be covering any technical implementation (there are too many widely available), instead, will share the principles that I have picked up.

To give more into context on why I am covering this topic. Couple of months back, my organization is enforcing a organization wide standardisation to use the same set of CI/CD orchestration tooling. My team was using a mix bag of teamcity, bamboo for CI/CD pipeline and bitbucket for SCM and we had to migrate them all to Gitlab and that is when we got our hands dirty again.

I have years of working with Jenkin, developing declarative pipeline, writing in groovy using shared libraries for Tanzu, Openshift, Andriod and IOS Pipelines. Jenkins used to be a popular software for CI/CD but it has grew short in capability and plugin support as compared to the new tools in the market.

Nevertheless, using with Gitlab is rather new to me which I need to go through its documentation understand its functionality. We successfully completed the migration all pipeline within 2 months with little roadblocks. I won’t go into detail what was done to make it a success, rather, I want to share what was the thought process regardless it is Gitlab or other tools.

The Components

Generally a Pipeline comprises of these components, and they play a part how you design your pipeline.

  1. CI/CD Orchestration tool
  2. Source Code Management (SCM)
  3. Branching Strategy
  4. Runners/Agents
  5. Deployment Strategy

CI/CD Orchestration tool

Each tool has it way of using them. It is important to study how the it expect you to use them and do couple of Proof of concept before solidify your design.

For instance, Jenkins decouples from SCM, hence, it expects you to manually setup project whenever there is a new repository (of course there is Github or other integrations which solves some of these pain points).

Gitlab and Github, on the other hand, contain pipeline code in the repository which automatically loads the pipeline code.

Gitlab has a myriad range of rule set such as triggering pipeline when certain file has been modified. This greatly changes the rule how you can design your pipeline, especially more towards monorepo design.

Another point to mention is some has its own way of support viewing of test reports, such as junit test. Jenkins has a way to publish HTML reports to view the unit test reports, whereas Gitlab ingest json output from test stages and display as its own format.

Each tools has its own way syntax of developing pipeline, such as groovy, or .gitlab-ci.yaml and so on. For the case of Jenkins, it will then require you to pick up a “scripting” language and Jenkins DSL.

SCM

While there are tools that combines orchestration tool with SCM, there are some that are standalone like Gerrit, old Bitbucket.

SCM may play a part as some may have different merging strategy support for pull request, or webhooks to orchestration tool.

I generally prefer tools that combines SCM and orchestration tool as it integrate well within it own ecosystem without the need of constant polling, and therefore you get near instantaneous invocation of pipeline.

Branching Strategy

Branch strategy shapes how the team develops, operates and pipeline run.

Before I did the migration, my application which comprises of 2 teams, which has somewhat similar branching strategy. They develop on their feature branch and merge to development branch and development branch will be deployed to CI environment to run a series of regression test before the commit get promoted to master branch and then it get promoted again after it get deployed to staging environment and lastly promoted to release branch. In between these promotion, it is either automated or manually code merge or cherry-picking.

This migration gave me the opportunity to correct their branching strategy to solely trunk based development. Though there was slight resistant at the start on how “bad/ untested” code or feature leaks, or they cant “manipulate” the branch anymore but it was good that we moved forward.

Trunk based development should still be the de facto standard because we should keep the branching as lean as possible without the need to manage nor manipulate and any “bugs” should be fixed on the trunk as early as possible. Feature toggle plays a very important role to toggle broken/unready code.

Runners

All pipelines has to be run on some server. There are 2 forms of runners, Dedicated runner and Container based runner.

Dedicated runners are VMs that is used by all pipelines. It reuses the same workspace. Generally these are the less favoured way of running pipeline because you need to ensure you have the necessary tools in the VM, which in most of the time, abd they are snowflakes in nature, hence, there might be slight disparity in tools version. Some orchestration tools allow multiple builds on the same runner, which introduce more flakiness to failure these jobs co-share the CPU and Memory within the same VM. While it may be faster by reusing the same workspace without the need to re-pull the whole codebase, it is prone to “manipulation” where one may SSH into the server to modify the workspace which can introduce false positive success and false negative failure in the build. Shallow clone can also be used to workaround full git clone issue.

Container based runners are cleaner and consistent approach as it retains the clean slate of the image each time you run the jobs. Most importantly, team should follow good container practices by properly tagging the image rather than using latest .

One of the key factor that my team had little roadblocks was also because I manage my own agents rather than using shared runners. As I setup my own kubernetes executors on Gitlab, I can easily diagnose issues with kubectl top/describe/exec/port-forward .

Deployment Strategy

There are many deployment strategy and it is best to consider what really suit your team and product. Do consider this series of questions

  1. Can you afford downtime?
  2. Can you afford running a lot more resources for smooth transition? Such as running twice amount of compute resource.
  3. Is your team developing the product with backward compatibility? Such as having v1 and v2 api?
  4. Does your team accept small amount of failure just to test out a small subset of audience?

Each deployment strategy will affect the way your code your pipeline. For complicated strategies like canary deployment, at bare minimum you will need to have two stages:

  1. Deploy the application and route X% of the traffic to the new version.
  2. You will need to halt the deployment for the verification before you rollback or switch the application to the version.

A good list of deployment strategies can be here.

There can be more other factors in the play such as application tech stack but I won’t be covering that.

As mentioned, there are some guiding principles I abide when designing the pipeline using this acronyms CRAFTS.

  1. Completeness
  2. Rapid
  3. Audit-ability
  4. Feedback
  5. Traceability
  6. Simplicity

Completeness

While some of these point are talking about the whole pipeline, this point solely focus on deployment as it is the ultimate goal for CI/CD. Hence, what constitute as a complete deployment?

Ideally, it should be a whole new clean slate instead of incremental. This reduces the chances of missing parts and ensures rollbacks are consistent too.

Imagine a deployment strategy is to unzip an artifact and copy the content to a directory.

Sounds complete safe but what is the risk here? As copying files only overwrites files but it doesn’t remove files that are supposed to be there as per the commit. It may result in failure as the “extra” files may also fail your application. Hence, it introduces risk to failure during deployment, hence it is not complete per se.

Whereas for container deployments are generally complete as it the filesystems are “reset” each time.

Rapid

It is no brainer that everyone wants fast feedback from pipeline. Various tool allows parallelism in the build stages which speeds up the testing stages, others allow caching to avoid re-building or re-downloading dependencies. These are good techniques to not repeat time consuming stages.

One example would be docker build caches saves the trouble of rebuilding all layers, and only start from the stage where changes are made.

Another method to make pipelines rapid is to reduce lead time between stages. That is to have as little manual process and human intervention such as parameterizing certain values manually. A good pipeline should be as automated as possible with little stop point and abort immediately whenever there is any failure. This also ties back to the 4th point on feedback so that the team can rapidly react and make the necessary changes.

Teams should strike a balance between the type of test as too many tests may slow things down and too little may result in insufficient test coverages. Developers should also regularly revisit their test cases to remove depreciated test scenarios.

While Rapid is the goal we want to achieve, at times certain test are incredible long such as load testing, DAST. Generally you want to park them out of the main pipeline for more of the schedule based jobs. They are good to have but not mission critical in general that should bar anyone from merging the code. Imagine a scenario where these test takes 2hours, can you imagine how it will affect overall team velocity to wait for these test to complete in order to proceed?

Audit-ability

I suppose this is not a commonly thought of point but I felt this is very important. Why this is important is because a good pipeline should also have control in place to prevent people from finding any loophole though no shame that I have done this numerous time.

It is important is to ensure that there are check in place or gatekeepers to prevent any loophole from occurring.

For instance, only code that is from the trunk and the particular commit has gone through all code scan with no failure can deploy to production. The reason is not all commit in the trunk passes the scan, hence, it might be a minefield of faulty code. Hence, you has to pick the right commit that is safe to use after code merge, therefore the checks should be in place to prevent wrong code from going in to production.

You may also need to ensure you “lock” the pipeline files, and/or print out which commit you are using if you have an external repository for pipeline code. This is to ensure there is no temporary switch of commit to “malicious” commit that may grant bad actors to disable some of the important checks.

Feedback

Feedback plays an important role to let them know what went wrong in their code or deployment. There are two forms of feedback.

Jobs should always fail during any testing such as static code analysis, unit test when things are wrong. Tests should not be flaky which will lead the developers to rerun the jobs hoping it will work the next time. It should always fail stages when something is wrong and prevent subsequent stages such as deployment from proceeding.

Another form of feedback is to send notifications such as slack, telegram, email, or a form of dashboard. This will be extremely useful to get the team to look into the issues immediately.

Traceability

By best practice, deployment should reuse the same deployment artifacts across all environments and it is the configurations that set the application apart.

Quite importantly, there should be a way to determine if an artifact is tied to a particular commit, build and not tampered. This can be done by signing the artifact after build is complete like cosigning.

You can also make it impossible to overwrite existing artifacts such as ECR immutable tags, Nexus repositories restricting overwriting files. These write only once features ensure no artifact can be fabricated from another source.

By naming the artifact based on the commit Id or git tag, it makes it easy to traceback to the version of the code that is used to produce the files. Certain SCM also associates pipeline build to commit Id which further enhances the traceability of the input and output.

Simplicity

A well-designed pipeline is intuitive to users. It should be simple enough for people to understand and use. It should also provide enough flexibility for one to use it for whatever purpose.

One flexibility will always be to deploy a feature branch changes to an environment for review. Ideally we should do most of our testing locally, certain scenario may not be achievable through local testing or limited compute power. Hence, you may still need to deploy to certain environment to perform end-to-end testing. Hence, giving team a flexibility to deploy on need basis, ideally on their own namespace will further empower the developers.

One of my team’s need is to be able to apply hotfix whenever possible. That means to have a backdoor to deploy to production without the need to go through the usual series of test. Though it is not by trunk based development practice to have hotfixbranch, but I do understand the challenge to transit directly to trunk based development as the team doesn’t deploy the the trunk to production frequently enough. Hence, I made it as a “breakglass” scenario whenever we have to do one.

Simplicity can also be drilled all the way down the the error message whenever a job fails. As pipeline development may happen together with feature developments, pipeline may randomly fail due to new changes. Developers may quick to judge that it is pipeline issue than code issue. Pipeline code are usually bunch of bash commands than developers aren’t familiar, hence, these messages play an important part for them understand what is happening rather than random exit 1 errors.

As a closing note, developing a pipeline is no different from application design. There is no silver bullet on how pipeline should be written as every team are bounded by different constraints such as teams culture, tech stack, CI/CD tools, but general principles as shared above do come into play. I hope some of these points will bring value to your work.

--

--