Optimising GHA Test Pipelines: A Guide to Efficient Caching

ydrako
tech-gwi
Published in
4 min readJan 15, 2024

Introduction

In the realm of Microfrontends, where each repository functions as a powerhouse in itself, is more important than ever to establish reliable and swift pipelines for code building and test suite execution. This article explores the optimisation of GitHub Actions (GHA) test pipelines by leveraging caching strategies, specifically focusing on the caching of node_modules and yarn packages.

The Challenge

As software repositories grow in complexity, the installation of dependencies becomes a bottleneck in continuous integration pipelines. Most pull requests involve unchanged dependencies, making it inefficient to download and install them repeatedly. The primary objective is to reduce the installation time of modules from the typical 5–15 minutes to a matter of seconds.

Caching Strategy

To achieve the node modules caching, it’s important to consider the two following folders.

  1. yarn_cache: This directory functions as a repository for yarn packages, ensuring their swift availability on the cache server. The “prefer-offline” flag is employed to prioritise cache utilisation during installations.
  2. node_modules_cache: Dedicated to storing installed packages and their dependencies for each unique “yarn.lock” file.

Infrastructure Setup

Utilising an NFS cache, each repository within the organisation has its designated folder, facilitating isolated caching for every project.

First of all, the cache is mounted in the container:

container:
image: <custom_image>:latest
volumes:
- /mnt/nfs-ci-cache:/mnt/cache
env:
YARN_CACHE_FOLDER: "/mnt/cache/${{ github.repository }}/cache"
NODE_PATH: "/mnt/cache/${{ github.repository }}/lock_node_modules"

Caching Initialization

The checkout step employs the sparse-checkout option to pull only the essential files, optimising the process by fetching only “package.json” and “yarn.lock”. For big repositories this step can save a lot of time.

- uses: actions/checkout@v4
with:
ref: ${{ github.event.pull_request.head.ref }}
sparse-checkout: |
package.json
yarn.lock

Before proceeding with installation, a quick check determines whether a cache already exists for the given yarn.lock file. If not, the script initiates the creation process:

- name: Install node_modules for new yarn.lock
id: yarnInstall
shell: bash
run: |
set -x
export checksum="$(sha1sum yarn.lock | awk '{print $1}')"
if [ -d "${NODE_PATH}/${checksum}/node_modules" ]; then
echo "Cache folder already exists for yarn.lock: ${NODE_PATH}/${checksum}/node_modules"
else
echo "Creating cache folder for yarn.lock: ${NODE_PATH}/${checksum}"
echo "started=true" >> $GITHUB_OUTPUT
mkdir -p ${NODE_PATH}/${checksum}/node_modules
ln -s ${NODE_PATH}/${checksum}/node_modules node_modules
yarn install --frozen-lockfile --prefer-offline
echo "Done!"
fi

Notice that when the procedure starts we are echoing a started=true variable in GITHUB_OUTPUT.

In case of a failure during this step, a failsafe mechanism is in place to archive the potentially corrupted cache, allowing for a fresh start.

- name: Delete residues on failure
if: ${{ always() && (steps.yarnInstall.outputs.started == 'true' && !(steps.yarnInstall.outcome == 'success')) }}
shell: bash
run: |
set -x
export checksum="$(sha1sum yarn.lock | awk '{print $1}')"
if [ -d "${NODE_PATH}/${checksum}/node_modules" ]; then
echo "Cache folder will be deleted for safe measure"
mv ${NODE_PATH}/${checksum} ${NODE_PATH}/${checksum}_archive${{ github.run_number }}
echo "Done!"
fi

Utilising the Cache

In order to use the generated cache we have to link the cached modules and ensure the setup is correct before build the code and run our test scripts.

- name: Link node_modules
run: |
set -x
export checksum="$(sha1sum yarn.lock | awk '{print $1}')"
ln -s ${NODE_PATH}/${checksum}/node_modules node_modules

- name: Yarn Install (NFS Cache)
run: |
yarn global add serve && yarn install --frozen-lockfile --prefer-offline

Putting everything together

In order to use everything in our pipelines we can move them in reusable workflows and actions. In the following image we can see an example of cache population for a new “yarn.lock” file.

A snapshot from GHA. We can see the different steps that ran to populate the cache. Check out the necesary files in 1second and then install everything in 8minutes and 34 seconds.
Cache population (1st run)

When the pipeline is triggered again, the install node_modules step finds the existing cached folder and continues with the test execution.

Cache reuse

The prepare-the-cache job ensures that everything is set up correctly so we can then proceed with our test execution:

Test pipeline

Let’s have a look now inside a test job:

Here are the steps of a test job. The link nodu_modules part took less than a second and the installation 2 seconds
Test Job that uses the cache

The linking and installation process takes just a few seconds. It’s important to highlight that if we were to install everything from scratch in all the different parallel thread that we are using, we would need additional storage, memory, and CPU usage.

Conclusion

By strategically implementing caching at key stages of the GitHub Actions workflow, we have significantly reduced dependency installation times, streamlining the continuous integration process. This approach ensures that Microfrontend repositories can be developed and tested with greater efficiency and speed.

--

--