Continous Integration and Deployment for Clinical Bioinformatics Pipelines

Published in

genomics-healthcare-systems

10 min readJul 15, 2024

First, a brief disclaimer statement

Recently, in 2021, I finished my specialization in Agile Project Management at Cesar School, a well-known and traditional private techonology university located at Recife, Pernambuco. My conclusion project thesis was about Continuous Improvement for Clinical Bioinformatics Pipelines Development.

Im summary, I have studied and reviewed our bioinformatics software development process, which resulted into a collection of experience notes, proposed artifacts and systems for optimizing and accelerating our pipeline build process without compromising our test results. Using PDCA research method (Plan-DO-Check-Act cycle), one of the stages that we could see a potential positive significant change was in the build, test and deployment workflows of the pipelines. By applying continuous integration services we could build and test our pipelines in an automated and iterative manner.

*The common bioinformatics clinical pipeline development lifecycle. Each step in this workflow must be consolidated with the inputs/outputs mandatory for the following steps.*

After some months, our team took the challenge and implemented some of the concepts that I brought to the team. In this post I will present our strategy for our bioinformatics development lifecycle under our CI system implemented at Varsomics, Hospital Israelita Albert Einstein.

The full paper is available at this link. (One special mention to my professor and advisor at the time Juliano Borges!)

Introduction

CI/CD, short for Continuous Integration and Continuous Deployment/Delivery, is a programming practice aimed at enhancing the automation and monitoring of code from integration through testing and deployment stages. This structured chain of CI/CD involves several stages, each designed to execute processes that contribute to a smoother and error-free codebase. These stages ensure that code modifications are integrated, tested, and deployed effectively.

Continuous Integration (CI) streamlines frequent code modifications by automating essential tasks like code reviews, testing, and linting. This automation enables developers to integrate changes more frequently and catch errors early, thereby maintaining high code quality and speeding up the development cycle.

Following CI, Continuous Deployment/Delivery (CD) automates the integration of code changes into the deployment environment. This phase ensures that each new code modification is deployable without manual intervention.

Implementing CI/CD chains often involves leveraging code versioning systems such as GitLab or GitHub to complement and manage the development process effectively.

**Continuous integration (CI)** is a development process of automatically building and performing unit tests upon making changes to the source code. **Continuous delivery (CD)** is an extension of continuous integration. It’s a process in which DevOps teams develop and deliver complete portions of software to a repository — such as GitHub or a container registry — in short, controlled cycles. Continuous delivery makes releases regular and predictable events for DevOps staff, and seamless for end-users. Source: https://www.navisite.com/blog/insights/ci-cd-vs-devops/

In our bioinformatics workflows, integrating with source control systems such as Git, the CI/CD pipeline enables efficient development, testing, version management, and deployment of workflow updates. Whenever code changes are committed, the pipeline automatically builds, tests, and deploys the new workflow version. In the next section we present our previous development workflow and the one after the CI/CD integration.

Our integration/deployment workflow

Before the CI, our tests were manually handled and all the builds were performed manually when the developers remembered to perform them. Essentially, from the moment a change is incorporated into the workflow until its deployment into production, the following steps were necessary:

1. Open a PR with a modification to the development branch in the workflow repository.
2. If approved, create a development release of that workflow, tag the version release, and manually generate assets for the release (main WDL and the ZIP of dependencies).
3. Update the workflow registered in the development admin of the application (Varstation/VarsMetagen) to this new version and run the test samples.
4. Once validated, merge accumulated changes from development to staging(homologation).
5. Create a staging release, tag the version release and manually generate assets for it.
6. Update the workflow registered in the staging admin of the application to this new version and run the tests.
7. At the end of a sprint, merge staging to production.
8. Update or create a new version in the production admin.

Although functional, this workflow requires bioinformaticians to spend significant time generating artifacts , tagging releases in multiple environments and registering them in the application, which results on delivery delays at the end of the sprint for both bioinformatics and development teams. Due to the involvement of many manual steps, this process also increases the risk of human errors.

Our bioinformatics pipelines development workflows requires multiple manual steps by our bioinformaticians in order to release our pipelines into multiple environments. The figure above illustrates one fictional example of release of a pipeline from development, homologation and finally production. It can induce to human errors and it costs time from our development and bioinformatics teams. Author: Filipe Dezordi

Multiple deployment environments

Separate environments is a good practice for developing, building and testing any bioinformatics pipeline. The goal is to ensure that a stable version can be used in production, allow the end-users to validate a new release without any impact on both the version used in production or the version under development, and enable the software developers to add new functionalities and modify the code without any impact on the end-users who are validating a new version and/or using the current version in production.

Therefore, three deployment environments are used: a development ( dev), a validation for pre-production (hom) and a production environment (prod).

Each environment is also associated to a branching model, which we will explain in the next section.

Version control and branching model

Each environment has an associated git branch. Depending on the context and the step of the development workflow the following branches on the remote repository are used:

Dev — all approved PRs are merged into the dev branch, which reflects in this environment.
Homolog — from time to time, the maintaner merges the latest commits from the dev branch to the homolog branch, reflecting in this environment.
Prod — periodically, the maintaner merges the latest commits from the homolog branch to the main branch, reflecting in this environment.

According to international clinical bioinformatics pipelines guidelines is mandatory to have a minimail version control policy. The version control of the pipeline should include semantic versioning of the deployed instance of a pipeline as a whole. Every deployment, including an update to the production pipeline, should be semantically versioned (e.g., v1.2.2 to v1.8.1).

In our scenario, we adopted the semantic versioning (SemVer) to registry the pipeline’s changes. SemVer will help you to manage the pipeline’s evolution and facilitate the error tracing.

Semantic versioning (SemVer) is a versioning scheme that assigns a 3-part version number (major.minor.patch) to each release of a software project.

The meaning of each part of the version number is as follows:

Major version (X): A change in the major version number indicates a significant change to the software that may not be backward compatible. For example, a major version change might involve a change in the , a new feature that is not backward compatible, or a significant performance improvement. Examples that lead to a new MAJOR version: Changing the name of an input or output; Adding or removing inputs or outputs; Making inputs or outputs optional or mandatory; Changing the structure of objects (struct); Adding or removing runtime parameters and metadata for integration; Changing the version of the WDL specification (e.g., from 1.0 to 1.1).
Minor version (Y): A change in the minor version number indicates a new feature or improvement that is backward compatible with previous versions. For example, a minor version change might involve a new bug fix, a performance improvement, or a new feature that does not change the API.Examples that lead to a new MINOR version: Adding or removing tasks; Updating task or program versions; Adding, removing, or updating task parameters; Changing pipeline flows and optimizations (scatter); Modifying command lines (command); Changing runtime parameters (CPU, memory, disk); Adding annotation databases.
Patch version (Z): A change in the patch version number indicates a bug fix that is backward compatible with previous versions. For example, a patch version change might involve a fix for a security vulnerability, a crash bug, or a typo. Examples of patching updates: adding code comments and documentation; reformatting files; adding or changing metadata (meta, parameter_meta); bug fixes; updating annotation database versions.

Our current CI/CD pipeline

Our approach for our bioinformatics releases workflow is focused on automating the assets creation using CI/CD and using AWS S3 as a storage gateway point. Each workflow will have a CI/CD pipeline to create assets and upload them to S3. In order to run routines, the application could download assets directly from S3 instead of fetching them from GitHub repository.

Bioinformaticians could also download them for testing executions in the development environment or for routines, projects, and validations in the production environment. A established standard for organizing these workflows in S3 must be followed to ensure a continuous development flow and, simultaneously, security in maintaining production versions.

Here are some recommendations that worked for us:

We use Github Actions for our CI/CD processs development. It enables using a simple template for building many automation tasks such as testing, building, and deploying code, thereby reducing manual errors and speeding up the development cycle.

See here some examples of Github Actions automations.

We managed to create a folder structure organization which we evaluate the corresponding branch merged or pushed and the target environment. These inputs trigger specific actions that could create or update the S3 build artifacts. In development and homologation environment we replace the old path with the new one, only adding a metadata with the main tag of the branch in the name. In production , we need to maintain the version history, so we create new S3 paths containing the workflow, branch/release and asset. Our assets have generic names such as main.wdl and imports.zip, so it can facilitate the artifact delivery in our final application.

We managed to create an automation workflow based on special conditions such as target branch, release versioning and environment, our CI/CD pipeline can organize the folder structure for faster releases, in case of dev/homolog environments, or maintaning the history track in case of production.

We use many pre-built third parties actions available for trivial tasks such as automatic repository checkout, releases creation at Github, and even copy/upload data to AWS S3.

The third-party Github Actions packages can save time from your bioinformatics and development team. THere are a lot of actions available for many tasks. In the figure above we have a action that automates the process of creating a release in github with a new version tag.

Watch out for any credentials and sensible information written in your CID/CD pipeline code. It is a huge security issue for your organization. You may find many alternatives such as Github Secrets or for outside tools actions to use the security best practices. In our CI pipeline we hide all sensible information that could be exposed in the code.

In our example above for writing in our S3 buckets, we found out the package aws-actions/configure-aws-credentials, which encapsulates all the commands for enabling the AWS security policies.

The use of Github Secrets can help you automate your CI/CD script to make it generic and hide any sensible information in the code.

Finally, we use some python packages for helping us in handling our related WDL tasks such as wdl-packager (Package a WDL and its imports into a zip file) and mini-wdl lint comands.

Study Case: VarsTests

In 2023, our QA team came with a continuous pain in their pipeline test process. Many releases were published every week, and for each one , they had to run the updated bioinformatics pipelines with the validations samples in order to verify the results (test the reproducibility and precision and recall metrics) and the analysis application to check if any errors could emerge from the pipeline outputs.

Our biodevops team came to rescue, and brought a solution to them in order to help them in automating this process. That comes Vars-Tests.

The job to be done: A simple tool for automating the process of running our pipelines with specific versions along with the validation samples required for each one.

The tool came to life in a couple of sprints, and the QA team, using the Github interface, could select the pipeline, the release tag and run the pipeline as a Github action. It saved many hours from our team, who before needed to run all theses steps manually.

It also shows how we can use low-code strategies to create a simple interface and solve the problem without have to design, develop a complete web application.

Github Inputs helped our team to develop a simple input form interface in Github so the QA team could select all the parameters needed to start the pipeline release automatically.

Any errors have happened during the testing process, the action would fail and notify the QA team for further investigation.

The main script is basic a series of Github Actions commands that follows all the steps required for setup the environment, the target environment, the pipeline code and run the pipeline tests.

Conclusions

In this post I presented the challenges and recommendations to put the CI/CD tools in our bioinformatics development process. By using Github Actions and automation tools we could reduce time from the pipeline creation to production deployment. Many benefits emerged from our approach: S3 integration, versioning, convenience for our bioinformatics team (the whole process is now automated with a simple PR merged in Github repository), improved feasibility to synchronize bioinformatics development with our Application Development, reduced risk of human error when generating releases and finally the easy access to workflows from within the AWS environment — a bioinformatician does not need to worry about cloning a repository from within an instance or sending files to S3: Just download directly from the bucket.

Acknowledgment

This project came to life on 2022/2023 at one of our internal workshops. The idea grew up and stood up as potential solution in 2023 when our bioinformatics teams took the challenge to study and put theses tools in action. I would like to thank all the Varsomics bioinformatics team, with special mention to Giovanna Bloise, Filipe Dezordi and Gabriel Hideki. These guys took a further step by using the CI/CD workshops and tutorials, to build our CI/CD for bioinformatics pipelines releases workflows.

Thanks also for our Varsomics product and QA team also who came with many ideas and suggestions for our Vars-Tests.