Continous Integration and Deployment for Clinical Bioinformatics Pipelines
First, a brief disclaimer statement
Recently, in 2021, I finished my specialization in Agile Project Management at Cesar School, a well-known and traditional private techonology university located at Recife, Pernambuco. My conclusion project thesis was about Continuous Improvement for Clinical Bioinformatics Pipelines Development.
Im summary, I have studied and reviewed our bioinformatics software development process, which resulted into a collection of experience notes, proposed artifacts and systems for optimizing and accelerating our pipeline build process without compromising our test results. Using PDCA research method (Plan-DO-Check-Act cycle), one of the stages that we could see a potential positive significant change was in the build, test and deployment workflows of the pipelines. By applying continuous integration services we could build and test our pipelines in an automated and iterative manner.
After some months, our team took the challenge and implemented some of the concepts that I brought to the team. In this post I will present our strategy for our bioinformatics development lifecycle under our CI system implemented at Varsomics, Hospital Israelita Albert Einstein.
The full paper is available at this link. (One special mention to my professor and advisor at the time Juliano Borges!)
Introduction
CI/CD, short for Continuous Integration and Continuous Deployment/Delivery, is a programming practice aimed at enhancing the automation and monitoring of code from integration through testing and deployment stages. This structured chain of CI/CD involves several stages, each designed to execute processes that contribute to a smoother and error-free codebase. These stages ensure that code modifications are integrated, tested, and deployed effectively.
Continuous Integration (CI) streamlines frequent code modifications by automating essential tasks like code reviews, testing, and linting. This automation enables developers to integrate changes more frequently and catch errors early, thereby maintaining high code quality and speeding up the development cycle.
Following CI, Continuous Deployment/Delivery (CD) automates the integration of code changes into the deployment environment. This phase ensures that each new code modification is deployable without manual intervention.
Implementing CI/CD chains often involves leveraging code versioning systems such as GitLab or GitHub to complement and manage the development process effectively.
In our bioinformatics workflows, integrating with source control systems such as Git, the CI/CD pipeline enables efficient development, testing, version management, and deployment of workflow updates. Whenever code changes are committed, the pipeline automatically builds, tests, and deploys the new workflow version. In the next section we present our previous development workflow and the one after the CI/CD integration.
Our integration/deployment workflow
Before the CI, our tests were manually handled and all the builds were performed manually when the developers remembered to perform them. Essentially, from the moment a change is incorporated into the workflow until its deployment into production, the following steps were necessary:
1. Open a PR with a modification to the development branch in the workflow repository.
2. If approved, create a development release of that workflow, tag the version release, and manually generate assets for the release (main WDL and the ZIP of dependencies).
3. Update the workflow registered in the development admin of the application (Varstation/VarsMetagen) to this new version and run the test samples.
4. Once validated, merge accumulated changes from development to staging(homologation).
5. Create a staging release, tag the version release and manually generate assets for it.
6. Update the workflow registered in the staging admin of the application to this new version and run the tests.
7. At the end of a sprint, merge staging to production.
8. Update or create a new version in the production admin.
Although functional, this workflow requires bioinformaticians to spend significant time generating artifacts , tagging releases in multiple environments and registering them in the application, which results on delivery delays at the end of the sprint for both bioinformatics and development teams. Due to the involvement of many manual steps, this process also increases the risk of human errors.
Multiple deployment environments
Separate environments is a good practice for developing, building and testing any bioinformatics pipeline. The goal is to ensure that a stable version can be used in production, allow the end-users to validate a new release without any impact on both the version used in production or the version under development, and enable the software developers to add new functionalities and modify the code without any impact on the end-users who are validating a new version and/or using the current version in production.
Therefore, three deployment environments are used: a development ( dev), a validation for pre-production (hom) and a production environment (prod).
Each environment is also associated to a branching model, which we will explain in the next section.
Version control and branching model
Each environment has an associated git branch. Depending on the context and the step of the development workflow the following branches on the remote repository are used:
- Dev — all approved PRs are merged into the dev branch, which reflects in this environment.
- Homolog — from time to time, the maintaner merges the latest commits from the dev branch to the homolog branch, reflecting in this environment.
- Prod — periodically, the maintaner merges the latest commits from the homolog branch to the main branch, reflecting in this environment.
According to international clinical bioinformatics pipelines guidelines is mandatory to have a minimail version control policy. The version control of the pipeline should include semantic versioning of the deployed instance of a pipeline as a whole. Every deployment, including an update to the production pipeline, should be semantically versioned (e.g., v1.2.2 to v1.8.1).
In our scenario, we adopted the semantic versioning (SemVer) to registry the pipeline’s changes. SemVer will help you to manage the pipeline’s evolution and facilitate the error tracing.
The meaning of each part of the version number is as follows:
- Major version (X): A change in the major version number indicates a significant change to the software that may not be backward compatible. For example, a major version change might involve a change in the , a new feature that is not backward compatible, or a significant performance improvement. Examples that lead to a new MAJOR version: Changing the name of an input or output; Adding or removing inputs or outputs; Making inputs or outputs optional or mandatory; Changing the structure of objects (struct); Adding or removing runtime parameters and metadata for integration; Changing the version of the WDL specification (e.g., from 1.0 to 1.1).
- Minor version (Y): A change in the minor version number indicates a new feature or improvement that is backward compatible with previous versions. For example, a minor version change might involve a new bug fix, a performance improvement, or a new feature that does not change the API.Examples that lead to a new MINOR version: Adding or removing tasks; Updating task or program versions; Adding, removing, or updating task parameters; Changing pipeline flows and optimizations (scatter); Modifying command lines (command); Changing runtime parameters (CPU, memory, disk); Adding annotation databases.
- Patch version (Z): A change in the patch version number indicates a bug fix that is backward compatible with previous versions. For example, a patch version change might involve a fix for a security vulnerability, a crash bug, or a typo. Examples of patching updates: adding code comments and documentation; reformatting files; adding or changing metadata (meta, parameter_meta); bug fixes; updating annotation database versions.
Our current CI/CD pipeline
Our approach for our bioinformatics releases workflow is focused on automating the assets creation using CI/CD and using AWS S3 as a storage gateway point. Each workflow will have a CI/CD pipeline to create assets and upload them to S3. In order to run routines, the application could download assets directly from S3 instead of fetching them from GitHub repository.
Bioinformaticians could also download them for testing executions in the development environment or for routines, projects, and validations in the production environment. A established standard for organizing these workflows in S3 must be followed to ensure a continuous development flow and, simultaneously, security in maintaining production versions.
Here are some recommendations that worked for us:
- We use Github Actions for our CI/CD processs development. It enables using a simple template for building many automation tasks such as testing, building, and deploying code, thereby reducing manual errors and speeding up the development cycle.
- We managed to create a folder structure organization which we evaluate the corresponding branch merged or pushed and the target environment. These inputs trigger specific actions that could create or update the S3 build artifacts. In development and homologation environment we replace the old path with the new one, only adding a metadata with the main tag of the branch in the name. In production , we need to maintain the version history, so we create new S3 paths containing the workflow, branch/release and asset. Our assets have generic names such as main.wdl and imports.zip, so it can facilitate the artifact delivery in our final application.
- We use many pre-built third parties actions available for trivial tasks such as automatic repository checkout, releases creation at Github, and even copy/upload data to AWS S3.
- Watch out for any credentials and sensible information written in your CID/CD pipeline code. It is a huge security issue for your organization. You may find many alternatives such as Github Secrets or for outside tools actions to use the security best practices. In our CI pipeline we hide all sensible information that could be exposed in the code.
- Finally, we use some python packages for helping us in handling our related WDL tasks such as wdl-packager (Package a WDL and its imports into a zip file) and mini-wdl lint comands.
Study Case: VarsTests
In 2023, our QA team came with a continuous pain in their pipeline test process. Many releases were published every week, and for each one , they had to run the updated bioinformatics pipelines with the validations samples in order to verify the results (test the reproducibility and precision and recall metrics) and the analysis application to check if any errors could emerge from the pipeline outputs.
Our biodevops team came to rescue, and brought a solution to them in order to help them in automating this process. That comes Vars-Tests.
The job to be done: A simple tool for automating the process of running our pipelines with specific versions along with the validation samples required for each one.
The tool came to life in a couple of sprints, and the QA team, using the Github interface, could select the pipeline, the release tag and run the pipeline as a Github action. It saved many hours from our team, who before needed to run all theses steps manually.
It also shows how we can use low-code strategies to create a simple interface and solve the problem without have to design, develop a complete web application.
Conclusions
In this post I presented the challenges and recommendations to put the CI/CD tools in our bioinformatics development process. By using Github Actions and automation tools we could reduce time from the pipeline creation to production deployment. Many benefits emerged from our approach: S3 integration, versioning, convenience for our bioinformatics team (the whole process is now automated with a simple PR merged in Github repository), improved feasibility to synchronize bioinformatics development with our Application Development, reduced risk of human error when generating releases and finally the easy access to workflows from within the AWS environment — a bioinformatician does not need to worry about cloning a repository from within an instance or sending files to S3: Just download directly from the bucket.
Acknowledgment
This project came to life on 2022/2023 at one of our internal workshops. The idea grew up and stood up as potential solution in 2023 when our bioinformatics teams took the challenge to study and put theses tools in action. I would like to thank all the Varsomics bioinformatics team, with special mention to Giovanna Bloise, Filipe Dezordi and Gabriel Hideki. These guys took a further step by using the CI/CD workshops and tutorials, to build our CI/CD for bioinformatics pipelines releases workflows.
Thanks also for our Varsomics product and QA team also who came with many ideas and suggestions for our Vars-Tests.