Running project-specific CI/CD pipelines for a monorepo in AWS

Nathan Derave
datamindedbe
Published in
7 min readMar 19, 2022

Enabling Continuous Integration and Continuous Deployment (CI/CD) for your team is a critical task that heavily impacts velocity while delivering as well as code overall quality and operational capabilities.

Continuous integration and continuous deployment have been more formally defined by Martin Fowler in dedicated articles (CI and CD) as follow:

Continuous Integration (CI) is a software development practice where members of a team integrate their work frequently, usually each person integrates at least daily — leading to multiple integrations per day. Each integration is verified by an automated build (including test) to detect integration errors as quickly as possible.

Continuous Delivery (CD) is a software development discipline where you build software in such a way that the software can be released to production at any time.

A well-implemented CI/CD capability will ultimately allow your team to release changes to your software product multiple times a day. This article will illustrate how a CI/CD pipeline can be implemented on AWS using CodeBuild and CodePipeline in the specific case of a monorepo.

CI/CD enables your team to deliver software like cars from automated factories (credits Lenny Kuhne)

🤨 What’s a monorepo and why does it make CI/CD implementation harder?

As explained on Wikipedia, in version control systems, a monorepo (“mono” meaning “single” and “repo” is the abbreviation for “repository”) is a software development strategy where code for many projects is stored in the same repository. Instead of storing each project (or component of your stack) in a dedicated Git repository, some teams may prefer to store them all in a single repository where each project will be a subfolder of the monorepo.

Polyrepo vs. monorepo setup.

This has various advantages like the ability to create changes for multiple components of your stack in a single atomic commit, facilitate large-scale refactoring of code, or make the whole codebase more visible and easy to discover. In an era where microservices are getting every day more traction, monorepo can help your team be more productive.

Of course, the monorepo approach also has its drawbacks like the difficulty to restrict access to portions of the codebase, Git performances decreasing with the codebase growing, or CI/CD pipelines setup becoming more challenging. At this point, you got it, this article will focus on this last aspect of the monorepo drawbacks involving CI/CD pipelines setup.

The main challenge behind CI/CD pipelines for monorepo comes from the fact that a unified changelog makes it harder to resolve which specific pipeline (component-specific by definition) should be triggered for a given change (potentially impacting multiple components at the same time).

For example, let’s say two given changes in a branch impact the frontend and the basket API components. With polyrepos, there isn’t any complexity. Those two changes will be emitted against different repositories. A change in a repository will always lead to the triggering of the same pipelines (CI and CD of the given repo/project). We have a 1:1 relationship.

Code changes in polyrepos are easy to map with associated CI/CD pipelines.

By contrast, with a monorepo, those two changes are living in the same Git log history (logic, it’s a single repo) but we still need to be able to resolve which components CI/CD pipelines need to be triggered. We have a 1:N relationship. This means that, for monorepos, our CI/CD pipelines system must implement a change filtering feature allowing us to define rules like:

  • For changes in frontend/** → Trigger frontend master pipeline or frontend PR pipeline.
  • For changes in backend/** → Trigger backend master pipeline or backend PR pipeline.
  • For changes in basket_api/** → Trigger basket API master pipeline or basket API PR pipeline.
  • etc.

We need to be able to use the changes metadata (mainly the path of the changed files) to define what are the appropriate CI/CD pipelines to trigger.

For monorepos, we need to be able to use the changes information to trigger the right CI/CD pipelines.

Some CI/CD pipelines providers like Azure DevOps, GitHub Actions, or CircleCI implement such a path filtering feature. AWS CodeBuild offers a file_path configuration field allowing you to define file path patterns to trigger CodeBuild jobs. However (and sadly), this approach lacks a few critical features:

  • You can’t automatically terminate a build job that would be already running. On top of a waste of resources, that can lead to strange race conditions like builds canceling each other.
  • It doesn’t allow you to create manual approval steps (e.g. to promote your project deployment from one environment to another).

AWS CodePipeline copes with most of these limitations but can only monitor a single branch (highly likely to be master) hence making it specifically designed for CD (and not the CI of pull request changes). There’s a lack of features made natively available by both CodeBuild and CodePipeline to make them for monorepo setups.

Feels kinda frustrating, heh?

🛠 Implementing flexible CI/CD triggering logics with AWS Lambda functions.

Hopefully, if we can’t rely on AWS CodeBuild and CodePipeline, we can still implement a fairly simple (yet flexible and powerful) Lambda function that will take for us to trigger the right build jobs (and/or cancel redundant ones).

The idea shared in this AWS DevOps blog post is to use a Lambda function as a middle layer between our monorepo and the CI/CD build jobs available in our AWS account. By connecting our monorepo with the Lambda through a webhook (configured on the repository’s side), the function can receive all the events that are happening in the repository. Based on the metadata of those events (e.g. file diffs, etc.) and our implemented logic (e.g. ignored files), it can trigger the appropriate CI/CD jobs. And the cherry on the cake - this solution also accommodates regular repositories.

A flexible CI/CD architecture based on Lambda functions and AWS CodeBuild/CodePipeline.

It all starts with a developer pushing changes to a feature branch. This change will trigger a configured repository webhook that will send a payload to our CI/CD manager Lambda function’s API Gateway. The latter will authenticate the call (e.g. using an API key or an IP whitelisting) and forward it to the CI/CD manager Lambda function itself. The Lambda function will analyze the content of the payload to extract a few interesting information like:

  • What kind of event is this? Pull request, merge to master, something else?
  • What are the impacted components?
  • Same but once the ignored files are removed?
  • For the given impacted components, what are the resources the Lambda needs to trigger? With what reference version (for repository cloning)?

All those questions and logic can be answered directly by the Lambda implementation and will be used to trigger accordingly build resources using the boto3resource clients: CodeBuild for pull request changes and CodePipeline for merge to master.

In order to do the mapping between CodeBuild/CodePipeline deployed resources and changed files events, the Lambda can rely on config files living in a dedicated S3 bucket. The idea is to have one configuration file per project/component instead of centralized configuration files maintained by a single team. The sum of all those files represents the complete list of CI/CD resources available. By operating this way, the set of triggerable CI/CD pipelines known by the Lambda function can be extended by any entity able to push a configuration file to the Lambda’s S3 bucket. This heavily improves the capability for development teams to self-service their CI/CD pipelines and the overall scalability of the system.

The configurations in question can take various formats like .yaml or .json for the time the Lambda implements the rights methods to read from them. Here’s an example of what could be the configuration in question for a given component in the .json format:

The configuration example above is dedicated to the basket-api. You can observe multiple keys in this .json like component_path, repository, or pipelines. component_path refers to the path of the component relative to the repository’s root. They will be used, based on the repository’s event and associated changes diff, to trigger a set pr-codebuild and master-codepipelinelisted under the pipelineskey.pr-codebuild and master-codepipelineare keywords used by the Lambda function.

🚀 Concluding on this design.

Continuous Integration and Continuous Deployment are unavoidable capabilities to implement nowadays for performant software development teams. This article illustrated how a CI/CD pipeline can be implemented on AWS using CodeBuild and CodePipeline in the specific case of a monorepo.

We’ve seen that implementing proper CI/CD pipelines in the context of monorepos can quickly become challenging if the automation provider you’re using doesn’t have the right set of features available natively.

That’s the case for AWS CodeBuild/CodePipeline which, even though providing a disparate set of monorepo-friendly features, doesn’t fully enable their integration. Thanks to their versatility, AWS Lambda functions allow us to extend the feature of CodeBuild/CodePipeline in an elegant and easy-to-maintain way.

👀 Okay, and now?

This blog post mainly presented how this modular CI/CD design idea can fulfill requirements we may have for monorepos (and polyrepos too!). Cool, but that doesn’t show you the code right? No worries, we built and tested -for real- this solution. In a follow-up article, we’ll answer the few questions that haven’t been answered (yet):

  • How to gracefully implement and operate the Lambda CICD manager?
  • How can we allow developers to self-service the deployment of their CI/CD stack? How can we make that easy and repeatable?

See you there 👋 !

I work at Data Minded, an independent Belgian data engineering consultancy. Reach out if you’re interested in working with us!

--

--

Nathan Derave
datamindedbe

Platform & Data Engineer releasing more than 10x a day. Design and implement large-scale data platforms empowering businesses.