Rocket Joint: Our tool that integrates the AWS cloud, Joint Genotyping, performance, and large-scale sequencing management

Marcel Caraciolo
genomics-healthcare-systems
11 min readJul 1, 2024

During my journey in Hospital Albert Einstein and Varsomics I had the opportunity to manage the bioinformatics team dedicated to Rare Genomes project.

The project is one of the first large-scale sequencing efforts in Brazil to compile a cohort dataset of human genetic variation for rare diseases, and seeks to address the challenge of diagnosing rare diseases in Brazil by leveraging Whole Genome Sequencing (WGS) technology. Over 8,000 genomes from rare disease patients and individuals at risk for hereditary cancer syndromes will be sequenced at the Hospital Israelita Albert Einstein in São Paulo until 2026. The project also aims to update standard operational protocols for data collection, sample processing, and bioinformatics analysis, while creating a phenotypic database and publishing scientific articles with the results. This initiative not only contributes to precision medicine efforts but also has the potential to reduce healthcare costs, increase diagnosis rates, and improve access to healthcare services, ultimately enhancing the quality of life for Brazilians, particularly those served by the Brazilian national Universal Healthcare System (SUS).

At the time of this post, The Rare Genomes Project is in its third phase. During the first phase of the project, our team focused on building, validating wet lab protocols, and validating bioinformatics pipelines. The second and third phases set all workflows in production for processing all the samples sequenced in our facilities.

When we reached thousands of samples processed, our research team wanted to explore all the variants call set to perform ancestry studies, rare variant analysis, and genotype/phenotype associations discovery. The team aimed to develop disease cohort-level catalogs of genetic variants.

Famous GATK Best practices for Variant Calling and Discovery for Population Scale Analysis. Our task is to provide a pipeline to build a cohort variant analysis dataset to enable researchers to perform queries about rare variants. Source: BroadInstitute

Introduction to the problem

There are several tools designed to perform the aggregation of call sets from large cohorts. One example is GATK and its Joint Genotyping workflow, which has set the standard for many years. Joint genotyping is used in genomic studies to analyze genetic variations across multiple individuals. It handles merging data from multiple samples to perform genotype determination while improving the accuracy and recall of individual genotyping results.

However, The GATK Joint Genotyping tool only works for variant sets called by GATK and does not apply to the output from other variant callers, such as Illumina’s DRAGEN germline pipelines. Another challenge for many existing tools is an efficient iterative analysis, allowing the addition of new samples and their variants to an existing call set.

How does GATK Joint Genotyping work ?

The GVCF Workflow. This joint genotyping method allows researchers to process samples independently before combining them with the full cohort. This makes this strategy suitable for larger cohorts or studies involving many samples. Source: Broad GATK Workshop Tutorial

Looking at the picture above, GATK “HaplotypeCaller” (HC) pipeline is used to call germline variants according to the GATK Best Practices on each sample individually. This per sample analysis generates an intermediate file called genomic variant calling format (gVCF).

gVCF files stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information. We will not take a deeper look at the specs, but we can summarize that the key difference between a regular VCF and a GVCF is that the GVCF has records for all sites, including for positions with reference allele calls (for example, GT=0/0, 0/1 or 1/1), whereas regular VCF files contain only the alternate allele calls (GT=0/1 or 1/1). The goal is to have every genomic site in the file to ensure all variants from all individuals included in a joint analysis are represented. The records in a GVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the Variant Caller’s built-in reference model.

gVCF files weredeveloped to store sequencing information for both variant and non-variant positions, which is required for human clinical applications. It allows the representation of genotype, annotation, and other information across all sites in the genome in a compact format. Source: Broad GATK Tutorials

GATK Joint Genotyping issues

Naturally, we tried to use GATK Joint Genotyping to build our cohort-level call set. But we faced several compatibility issues with our outputs from DRAGEN Germline Variant Caller that we use at our projects in production. For example, issues included missing header information. Moreover, the GATK Joint Genotyping process is composed from many steps, which means more resources (time and memory) consumption. When we deal with large cohorts, the processing costs are a paramount concern, especially in settings in which more samples are welcomed in the project.

Therefore, we decided to seek other strategies. We soon found out that the Illumina’s DRAGEN team was also working on their own tool: DRAGEN gVCF Genotyper.

Introducing Dragen's approach for aggregating large cohort sets

DRAGEN gVCF Genotyper is Illumina’s solution to aggregate small germline variants at population scale. The input to gVCF Genotyper is a set of gVCF files produced by the DRAGEN germline variant caller, whereas the output is a multi-sample VCF file (msVCF) with genotypes for all variants detected in the cohort. The output also contains some metrics (for example, genotype quality and DRAGEN’s hard filters) that can be used for quality control and variant filtering.

DRAGEN gVCF Genotyper offers a iterative workflow to aggregate new samples into an existing cohort. The iterative workflow allows users to incrementally aggregate new batches of samples with existing batches, without having to redo the analysis from scratch across all samples every time new samples are available.

Overview of the step-by-step mode and its intermediate file formats. Source: https://www.illumina.com/science/genomics-research/articles/gVCF-Genotyper.html

The DRAGEN gVCF Genotyper employs a systematic workflow designed to handle the computational demands of processing large-scale genomic datasets, depicted in Figure above. This workflow is structured to distribute compute tasks across multiple nodes efficiently, making it scalable for analyzing several hundreds of thousands of samples.

The first step of the workflow involves aggregating GVCF files from a batch of samples into a cohort file and a census file. The cohort file consolidates variant data from multiple samples into a compact format like a multi-sample GVCF, while the census file contains summary statistics for all variants and blocks of homozygous reference genotypes per sample in the cohort. This step is designed to handle large numbers of samples by processing them in manageable batches, typically around 1000 samples per batch.

Following the aggregation of batch census files, the next step is to merge them into a single global census file. This process, known as census aggregation, scales effectively to manage thousands of batches, ensuring comprehensive coverage and accuracy in variant statistics across the entire dataset.

Once the global census file is established, it serves as a foundational resource for generating multi-sample VCF (msVCF) files for each batch of samples. This final step, msVCF generation, utilizes both the global census file and the per-batch cohort and census files to produce comprehensive VCF files containing genotype information for all samples in each batch.

To enhance parallel processing across multiple compute nodes, the workflow allows users to partition the genome into shards of equal size. Each shard can then be processed independently on different nodes using the iterative gVCF Genotyper, optimizing computational efficiency while maintaining data integrity. This approach ensures that the workflow remains scalable and adaptable to varying computational resources and dataset sizes.

Overall, the iterative nature of this workflow, combined with its ability to update and expand with new sample batches seamlessly, facilitates ongoing genomic analysis at scale. As new batches of samples are added, the workflow dynamically incorporates them through incremental updates to cohort and census files.

The figure shows the steps that are required to add a new batch of samples to an existing cohort. For the gVCF files in the new batch, only the gVCF aggregation must be executed to build a new per-batch cohort and census file. After this, the workflow will execute census aggregation to build the new global census file that is needed to update the msVCF files for each batch in msVCF generation. Source: https://www.illumina.com/science/genomics-research/articles/gVCF-Genotyper.html

The challenges with Dragen gVCF genotyper

We had some options for running DRAGEN gVCF genotyper:

a) use an on-premises DRAGEN server;

b) use the Illumina Connected Analytics (ICA), the Illumina’s cloud computing platform.

c) use our current cloud computing infrastructure with custom scripts.

The first option had some limitations: it is feasible only for cohorts of moderate size, whereas the second option would generate extra costs, coming from integration with another new cloud computing platform, data governance and management, and so on.

We decided to tackle the challenge of the third option, since we had the infrastructure (AWS), a team of talented bioinformatics specialists, and already implemented tools (WDL/Cromwell + AWS Batch). The objective was to develop our own tool to encapsulate the Dragen gVCF genotyper, automate all the steps and produce the output files required for efficient and scalable aggregation of genomic variants. That comes to life Rocket Joint.

Meet Rocket Joint!

Rocket-Joint (RJ) is an integrative tool that allows thousands of samples in GVCF format to be merged in the cloud in parallel, allowing sample management and results organization with simple commands. Written in Python3 and WDL, it wraps around Illumina’s DRAGEN gVCF genotyper to use it as the main merging tool. The AWS cloud environment supports hundreds of processes in parallel containing the ideal amount of sample genomic shards to reduce costs with virtual machines and improve efficiency in processing time. The resulting files are organized within structured folders in an AWS S3 bucket which is automatically tracked by the tool. The folders hold all the necessary information required to allow the next addition of samples, including the final msVCF files.

Rocket-Joint reads from AWS S3 buckets the multiple GVCF sample files that must be aggreggated. We identify each aggreggation round with an output_prefix, allowing us to track the history of aggregation rounds. Rocket-Joint tasks are written in WDL scripts that are submitted to Cromwell. The scheduler will then orchestrate the required steps required by DRAGEN gVCFGenotyper, collect the results, and then organize all outputs into an AWS S3 folder structure. Rocket-Joint also produces a JSON file containing all the metadata info with consensus file location, the cohorts file and other important information for future batch additions into the recently produced cohort. Through Rocket-Joint, our team also may delete old, previous files to maintain control over storage size requests. Image by: Lívia Moura.

Rocket Joint behind the scenes

Rocket-Joint uses the split-up strategy to process the aggregation in parallel, to ensure that the resulting intermediate outputs are not so large as to make them unworkable for computing. Let’s depict this pipeline into four steps.

The first step is to split the batch of samples (for instance 1,000 samples) from the input file list into smaller batches (for example, 10 batches of 100 samples). This parameter may be adjusted to determine optimum resource usage. In our use cases, we believe that this initial batch of 1,000 samples is indeed the optimal size in terms of resource usage and compute speed for this job.

Afterwards, the second step is to split up each genome sample into n chunks or “shards” by genomic position, so that each region can be processed in parallel. Genomic shards are truncated at chromosome boundaries, imposing an even lower bound on the number of shards. These shards were determined by testing empirically several sizes and evaluate the process time elapsed. In our example, we set 100 shards. This workflow results in (number of sample batches) * (number of genomic shards) consensus and cohort files.

The next step is the consensus aggregation, and this is a tricky task, because of the multiple number of output files. What Rocket Joint does is to track all these files and reorganize them, so we can aggregate them into n (number of batches) global census files (in our example 100 global census files) and batch msVCF files. The final step is to combine msVCF of all batches, so it merges all the batch msVCF files into a single msVCF containing all samples.

Diagram of Rocket Joint main WDL task. It is based on the split-up strategy for multiple sample batches and genomic shards split per batch. This strategy allow parallel processing, speeding up all pipeline. Automation for calling the Dragen gVCF interactive mode and file tracking for reorganizing produced for us results in even less computing time.. Image by: Lívia Moura

Study case scenario

As I mentioned earlier in the introduction, the Brazilian Rare Genomes Project research team requested us to compile a catalog of human genetic variation for the whole genome samples. We had to perform the joint genotyping of the processing of 4,808 whole genome samples (about 24 TB worth of data) with 100 genomic shards.

Rocket Joint (RJ) took less than 20 hours to complete the task, consuming a thousand dollars’ worth of resources, considering data storage and processing. Some time later, we added 1138 more samples, which only required reprocessing the genotypes of the previously merged samples. It took about the same amount of time (20 hours) to process the new samples, and obtain the new outputs, containing the variant calls from both previous and new samples. Therefore, we have shown that Rocket-Joint saves time and processing resources during sample addition scenarios.

In both cases, we generated 100 msVCF files that were imported into Hail 0.2.78 running on AWS Elastic MapReduce (EMR) clusters. Hail is an open-source downstream tool for genomic data exploration.

Through Hail, we performed sample- and variant-level quality control, by removing both samples and variants with call rate < 85%, variants with transition/transversion (Ti/Tv) ratio < 1.9, any variants not passing DRAGEN’s default hard filters, and removing samples with cryptic relatedness with other samples from the dataset via kinship coefficient calculations. Following these processing steps, 4724 samples and over 130 million variants were retained into the final Hail MatrixTable dataset. About 89.4 million variants had allele frequency ≤ 0.001. We then converted the Hail dataset into parquet files which are queryable through AWS Athena or EMR clusters running Jupyter notebooks with PySpark kernels. This catalogue is available for our research team to perform any questions related to rare candidate disease-causing variants from the large numbers of variants detected.

The workflow above illustrates our downstream pipeline using the merged VCF output from Rocket Joint. In our study scenario we could ingest the output msVCF into Hail to perform variant and sample-level QC, combine it with functional annotations by VEP and output parquet files which our ETL Genomics Data Lake pipeline would make it available at our AWS Athena Querying tool. Image by: Lívia Moura and Antonio Coelho

Conclusions

Joint Genotyping is one of the strategies for processing accurate genotypes from vast raw genome call sets. In our lab, our bioinformatics team was responsible for processing thousands of genome samples from the Rare Genomes Project, a public healthcare project focused on diagnosing rare diseases in Brazil by leveraging Whole Genome Sequencing (WGS) technology. During the project our research team requested us to build a unified dataset with all variant calls. We have developed the Rocket-Joint tool to streamline the aggregation of multiple samples in GVCF format to be merged in the AWS cloud in parallel, using the Illumina’s Dragen gVCF Genotyper . Our tool performed the joint genotyping of the processing of more than 4,000 whole genome samples (about 24 TB worth of data) in 20 hours. The main benefits was to perform using our own cloud, AWS, (instead of purchasing local servers or additional third-license private clouds), the automation of all process, including the split-up strategy which requires many steps and multiple files output handling, and file track management , which keeps all the necessary information required to allow the next addition of samples, including the final jointed msVCF files. The project turned out to be successful and we have delivered over 130 million variants stored on parquet files gathered by ETL pipelines into our genomic data lake, which can be efficiently queried by our research team.

Acknowledgment

I would like to acknowledge all current and past members from my Bioinformatics team, especially the Brazilian Rare Genomes Project Bioinformatics squad. I had the opportunity to lead outstanding people: Antonio Coelho, Catarina Gomes, Gustavo Oliveira, Luciana Souto, Rafael Albuquerque, Rafael Guedes and Rodrigo Barreiro. My special thanks go to the main developer of the Rocket-Joint tool, Lívia Moura, who took the challenge from scratch, and put lots of effort into experiments, reports, technical presentations and many other activities regarding the development of this tool. Finally, I am very grateful to my two leaders: Murilo Cervato (the VarsOmics manager at the time) and João Bosco Oliveira (the Main Researcher and leader of the Rare Genomes Project), since both delegated this research task to our team. Many thanks also to all other VarsOmics squads (Development, DevOps, BioDevOps , Support and Data Science).

--

--

Marcel Caraciolo
genomics-healthcare-systems

Entrepreneur, Product Manager and Bioinformatics Specialists at Genomika Diagnósticos. Piano hobby, Runner for passion and Lego Architecture lover.