SSENSE-TECH
Published in

SSENSE-TECH

Improving our Serverless CI/CD up to 60% in a Couple of Simple Steps

Source: https://unsplash.com/photos/NcTQ602gKLI

Context

At SSENSE, we invest ample time not only creating new internal products, but also improving our technology stack and its related processes to stack on the edge. We leverage the use of serverless technologies to iterate faster in our product development, this way we are able to focus more on delivering business value, rather than the scalability of our services. Currently, our engineering teams develop some of our products and solutions using the serverless framework and TypeScript , which enable us to do rapid development of Minimum Viable Products (MVPs) of software, reduce our feedback loop times, concentrate on delivering business value, and easy management of Amazon Web Services (AWS) resources and deployments.

We are currently developing an inventory management solution, which follows Domain-Driven Design (DDD), but even though we had concepts and data persistence isolated per Bounded Context (BC), the CI/CD pipeline was not. In our application stack, we managed multiple bounded contexts, deployed in different resources, yet organized as a monolithic application, defeating the whole purpose of microservices and DDD.

The solution that we came up with at the time was to make the necessary changes to guarantee every bounded context would have a separate:

  • CI/CD Pipeline (Changed)
  • Application Stack (Changed)
  • API (Already using this structure)
Example of our CI/CD Structure

By using this structure, every bounded context would manage its own set of resources to be created in AWS, execute the tests that corresponded to it, deploy to QA and production, and importantly, it would run as many pipelines as bounded contexts that were modified. For example, if we had three bounded contexts and only two contained changes, only the pipelines for those two would be triggered in parallel. After our separation of BCs in our CI/CD pipeline, we started to notice a worrying trend after a couple of weeks.

Problem

The trend we noticed was that every time we added a new change to one or many of our bounded contexts, our Lead Time For Changes in our CI/CD pipeline increased by minutes. At some point, every check that we needed to do in our PRs would take about 20 minutes, from the moment we pushed our changes until the last change. And, about 35 minutes from when we merged our PRs to the main branch until it got deployed to production.

This trend got to unacceptable levels, and it impacted two of the four key metrics mentioned in the Accelerate book:

  • Lead Time for Changes — As previously mentioned, our delivery time from the moment we push to the main branch until our code gets deployed to production was taking around 35 minutes in total.
  • Mean Time to Restore — If we had a critical bug in production (something that never happened, fortunately), our time to restore to a previous stable version was at risk of getting severely impacted, which also would have negative effects on the business.

Investigation

We took on the challenge of investigating the root cause and applying all possible fixes to improve our CI/CD pipeline times in one sprint. The goal was to reduce the time by at least 50% of the observed ~35 mins per run.

In our first step of discovering what was happening in our pipeline, we needed to identify the bottlenecks and the most problematic steps, timewise. Thanks to the magic of DataDog metrics and its integration with our CI/CD pipelines, we were able to quickly get very detailed information about every step running in a pipeline’s execution, but also about what was happening in the very bones of it.

After jumping in and analyzing the information provided by DataDog, we were able to identify the steps that were consuming the most time and resources per bounded context.

  • The longest build step of our Docker image step took around 10 minutes.
  • Our Testing execution steps (unit and functional) were taking about 5 minutes per test execution.
  • Our bundling + deployment step took ~10 minutes per environment.

With these problematic steps identified, and the times provided, we were ready to jump in and fix each individual step.

Docker Problem

As I mentioned before, we identified that the longest execution time in our Docker image build step was about 10 minutes. Additionally, we noticed that the shortest was about 2 minutes. There was an intriguing escalation of time per bounded context, the more bounded contexts we had in our pipeline, the more it took for some of them.

This symptom had us perplexed for a moment, and got us thinking about what was happening for a good chunk of time, until we found an interesting pattern. For context, to build every single bounded context, we use a Dockerfile and multi-stage build to make sure we package what we really need in each image. I know this is not ideal, but we have to deal with some legacy concerns too ¯\_(ツ)_/¯.

The pattern we noticed with our multi-staged Docker build was that the deeper the layer was, the more layers it used if it wasn’t in the same stage. What do I mean by that? Here’s a small example. We have a Dockerfile similar to this:

## BOUNDED CONTEXT IMAGE BUILDING STEPS
FROM node:14 as base_bc
WORKDIR /code
# Add layers common to the bounded Contexts
# XXXX BC Build
FROM base_bc as base_bc1
# Add layers specific to this bounded Context
# YYYY BC Build
FROM base_bc as base_bc2
# Add layers specific to this bounded Context

When running BC1 and BC2 in parallel, we noticed that BC1 was building its Docker image just as expected, but BC2 was using layers meant to be used by BC1 only.

The reason for this behavior was that multi-staged builds can be used only if Docker Buildkit is enabled, which locally it was by default, but not in our CI/CD pipeline.

We applied the fix for the issue by enabling Buildkit in our CI/CD executions, using the environment variable `DOCKER_BUILDKIT=1`.

After this minor config fix, we saw an improvement for our Docker build steps running, at most, 2 minutes per bounded context, including the one that took 10 minutes at most. A massive success for a minor fix. 🥳.

Testing Execution Problem

The next step to improve in our CI/CD pipeline is both test executions (unit and functional) per every bounded context. Our first step to reduce the test execution time was to optimize our tests, which we did, but without the expected results. Our results showed a noticeable improvement of 1 minute in our execution time, reducing it by 20%. Any performance improvement is good and we knew we could improve it much further.

After taking a deeper look into the situation, we noticed that the compilation of the tests plus the source from Typescript to executable JavaScript took a massive amount of time; our benchmarking identified at least ~3 minutes to compile the code plus the tests. We determined that the main culprit was one of our packages, ts-jest, which transformed test suits.

After a team discussion about our alternatives to improve the transformation step, we decided to try esbuild, a recent tool for JavasScript bundling written in golang, gaining a lot of popularity because It is advertised to improve the efficiency and performance of your bundling process from 10x to 100x. Worth mentioning it is on the SSENSE tech radar to assess.

After migrating our test transformer to use esbuild, with a package called esbuild-jest, by doing:

module.exports = {
testEnvironment: 'node',
transform: {
'^.+\\.tsx?$': 'esbuild-jest', // Only line changed
},
collectCoverageFrom: ['./src /**/*.ts'
],
collectCoverage: true,
coverageReporters: ['lcov', 'json', 'html', 'text', 'text-summary'
],
testPathIgnorePatterns: ['/node_modules/', '<rootDir>/.build/'
],
};

And configuring babel in the following way:

{
"presets": [
"@babel/preset-typescript"
],
"plugins": [
"babel-plugin-transform-typescript-metadata",
[
"@babel/plugin-proposal-decorators",
{
"legacy": true
}
],
[
"@babel/plugin-proposal-class-properties",
{
"loose": true
}
]
]
}

Our test steps started to perform extraordinarily well, by going from ~4 minutes to around 30 seconds to 1 minute. Small changes make a big difference. 🎉

Building + Deployment Problem

The Serverless deploy command is a command with two main tasks:

  1. Compiles from TypeScript to Javascript and bundles our code to be a deployable lambda
  2. Deploys our Serverless application as a CloudFormation Stack in AWS by deploying all the functions + resources specified in the serverless.yml file

One of the bottlenecks we found in this command was the compilation step, taking up to 60% of the time. We principally use a Serverless plugin called serverless-webpack that helps us compile our code using ts-loader and bundling it “easily.”

We then saw another opportunity to use esbuild again; however, we didn’t want to fully move from webpack to esbuild, since webpack helps us tree-shake most of the code and dependencies to keep the lambda bundle as small as possible. Still, we wanted to have the performance improvements from esbuild.

After looking for solutions online, mostly on Github, we found a library that would help us have the best of both worlds without putting too much effort into it and while minimizing risks. The library is called esbuild-loader, and it allows projects using webpack to leverage the use of esbuild and still have some of the performance improvements of the bundler itself. After migrating our config to this library, we improved from 10 to around 4 minutes per deployed environment.

Conclusion

As you can see, small changes can have a significant impact. After measuring all of our time improvements, we reduced our Lead Time For Changes from 30 minutes to 10 minutes, a time reduction of more than 65%. We feel that there’s still room for further optimizations, but for the 2-week sprint of work, these were great improvements.

Editorial reviews by Catherine Heim & Pablo Martinez.

Want to work with us? Click here to see all open positions at SSENSE!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store