High-performance genetic datastore on AWS S3 using Parquet and Arrow

Tulasi Paradarami, Sr. Engineering Manager at 23andMe

23andMe Engineering
23andMe Engineering
8 min readFeb 8, 2021

--

Introduction

In bioinformatics, Variant Call Format (VCF) is a popular text file format for storing genetic variation data, and is the standard output for popular imputation methodologies like minimac. It’s unambiguous, it’s flexible, and it supports arbitrary metadata. A drawback of the VCF format, however, is that it’s a text file and it requires a lot of storage space, even with compression. In this post, we’ll discuss the benefits of using columnar storage formats such as Apache Parquet for storing genetic data and share the results from experiments comparing performance between these two file formats.

For reference, here are definitions of some of the terms commonly used in this blog post:

  • Dosage: the predicted dosage of the non-reference allele given the data available; it will always have a value between 0 and 2 (reference)
  • Marker: also called a single nucleotide polymorphism (or SNP, pronounced “snip”); markers are variations in a single DNA nucleotide between people

File Formats

The choice of the file format has a significant impact on the overall performance and maintainability of the data system. This has led to the development of a number of open source solutions for storing data efficiently. Some common storage formats are JSON, Apache Avro, Apache Parquet, Apache ORC, Apache Arrow, and of course the age-old delimited text file formats like CSV. There are tradeoffs involved with each of these file formats around flexibility, software compatibility, efficiency, and performance.

Parquet

Traditionally, data is stored in a DBMS in a n-ary storage model (NSM), where bytes associated with a given row are stored contiguously in a row-major orientation within data pages. This is the default storage format in a number of popular RDBMS engines. While this format works well for queries that retrieve all or majority of the columns for a given predicate, it is not well suited for queries that typically access only a small subset of the columns. A decomposition storage model (DSM) is an alternative to NSM, where data is organized and stored in a columnar format. Both NSM and DSM suffer from certain limitations. Partition Attributes Across (PAX) is a hybrid storage model that addresses the limitations in NSM and DSM storage models. Parquet utilizes this hybrid storage model and combines the horizontal and vertical data partitioning strategies of NSM and DSM models such that data is organized into row groups and column chunks. NSM, DSM, and PAX data models can be visualized as follows:

Visual representation of storage models
Figure 1. Visual representation of storage models

Column-major layout optimizes for disk I/O since this orientation of the data allows for scanning and retrieving only relevant parts of the data pages for analytical workloads. Fewer disk seeks and better compression (because the data type of a single column or column-chunk is homogenous) leads to better query performance compared to a row-major layout. Furthermore, a columnar format also supports column-specific compression and/or encoding techniques that optimize I/O even more. Apache Parquet provides an open-source columnar storage format that was originally created for the Hadoop ecosystem but has since evolved to become a standard for highly performant data I/O.

Arrow

Apache Arrow is an in-memory columnar serialization standard designed for performing highly efficient analytical operations by enabling O(1) random access to data and zero-copy reads. Arrow trades off mutability in favor of constant-time random access and data locality for read-heavy workloads. Since Arrow is written in C++, it provides a genuine multi-processing framework and does not suffer from the limitations of the Python GIL. Arrow specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytical operations on modern hardware, and PyArrow provides Python Arrow bindings.

Figure 2. In-memory buffers comparing Arrow with the traditional memory (source: https://arrow.apache.org/)

Variant Call Format (VCF)

The Variant Call Format (VCF) specification is used in bioinformatics for storing gene sequence variations, typically in a compressed text file. According to the VCF specification, a VCF file has meta-information lines, a header line, and data lines. Compressed VCF files are indexed for fast data retrieval (random access) of variants from a range of positions. The complete specification for version 4.2 is found here.

Storage Schema

For scalability and maintainability, VCF files are chunked by markers (typically, markers associated with a chromosome are stored in one file) and batched by samples. A common access pattern in genomic analysis retrieves genetic data for all samples at a given marker, and executing such a query requires scanning every batch of VCF files for a given chunk. For databases with millions of customers, performance of the query is slow because a large number of batches are scanned to retrieve data for all customers for a given marker. Performance of the query is improved by indexing the VCF files, but the fundamental constraint with VCF files is that data is stored in a row-major layout, which is not optimal for fast genomic analysis.

Transposing the VCF file schema such that samples are stored as rows and markers as columns brings it closer to a traditional database table schema that allows for appending/inserting new samples into the table as they are processed by the data pipelines. A natural extension of this schema is a column-major format, similar to Parquet. In practice, data in Parquet files is organized as column chunks and row groups such that genetic data for a given marker and a “set” of samples is stored as contiguous bytes. This format is optimized for the access pattern “retrieve all samples for a marker”.

Experiments

In this section, we share the experiments we performed to compare the performance of queries running against compressed VCF and Parquet files. Each VCF file has dosages for 1000 samples and over 200,000 markers and a Parquet file has dosages for 1 million samples and 1000 markers. In other words, a VCF file has 200 million dosages while a Parquet file has 1 billion dosages. For a database with roughly 10 million samples and 58 million markers, compressed VCF files take up roughly 600 TB of disk space compared to just 200 TB for the Parquet files.

The following experiments were run on a c3.8xlarge AWS EC2 instance with 32 vCPU, 60G RAM, and a maximum of 10 Gigabit network throughput. Before executing queries against the VCF file, a tabix index file was created using Pysam to enable fast random access to marker(s). Pysam is a python module for reading, manipulating, and writing genomic data sets, including VCF formatted files. An example of creating the tabix index file and sample code for loading dosages from the VCF files is provided below:

An example of creating a tabix index file
Sample code for loading dosages from VCF files

To retrieve dosages for 1 million samples for a given marker, it took about 2.5 seconds using the VCF files and only 0.03 seconds to get the same data from Parquet files — nearly 80x faster. We also ran experiments to compare the performance of queries against Parquet files stored in S3 using s3FS and PyArrow. In this case, the time taken to query the S3-based Parquet file is 3.5 seconds. Performance of queries retrieving 10 markers similarly performed very well against the Parquet files, as shown in Figure 3.

Comparison of query performance between VCF and Parquet for 1 and 10 columns
Figure 3. Comparison of query performance between VCF and Parquet for 1 and 10 columns

To further compare VCF and Parquet files, additional experiments to retrieve 100 and 1000 columns were performed and, in each case, Parquet outperformed VCF. The following table shows the results from these experiments:

Table 1. Results from comparing the performance of queries against VCF and Parquet

We observed that the performance of reading the Parquet files using s3FS and PyArrow is significantly impacted by the default_block_size property initialized as part of opening a S3 connection. This property configures the number of bytes to read ahead during a seek() before closing and reopening the S3 HTTP connection. The default value of 5 MB is too small for large files. By setting this property to a high value, most of the file contents are prefetched and cached, thus cutting down the number of times a S3 connection is closed and reopened. Since opening and closing the connections are fairly expensive operations, reading the majority of the file contents and discarding is cheaper than incrementally pre-fetching small blocks. This is the recommended access pattern when working with object stores like S3. With a larger block size, scanning the Parquet file with 1000 markers using PyArrow/s3FS (without multithreading) is over 1000x faster than with VCF files. An example of loading 100 columns to a PyArrow table directly from S3 is provided below.

An example of loading 100 columns to a PyArrow table directly from S3

In addition to the block size, enabling the use_threads option in PyArrow cut down the query execution time from 15 seconds to 8.6 seconds for 1000 markers. It is also worth noting that, when selecting all columns in a Parquet file, PyArrow performs better by skipping the columns argument entirely compared to listing of every column. This approach further cut down the overall query execution time from 8.6 seconds to 6.5 seconds. Finally, for the sake of completeness, query performance was also analyzed when reading Parquet files from the local disk. As shown in the table above, queries are extremely fast when reading from local disk — reading 1000 markers with 1 million samples took less than 1 second. An example of loading all columns in a Parquet file from local disk to a PyArrow table is provided below:

An example of loading all columns in a Parquet file from local disk to a PyArrow table

Recommendations

  • When reading from S3, file open/close are very expensive operations, so set a high value for the default_block_size in s3fs.S3FileSystem. This is especially important when reading from large files; experiment and derive the number of bytes to forward seek empirically
  • When reading all the columns in a Parquet file, do not list the column names in pyarrow.parquet.ParquetFile.read
  • Pyarrow provides multi-threading for free; utilize it by setting use_threads=True in pyarrow.parquet.ParquetFile.read

Limitations

The experiments were executed on a compute-optimized AWS instance (c3.8xlarge) and the actual performance of the queries can vary significantly based on instance type. The effects of different types of encoding, compression in Parquet, and GIL were not studied as part of these experiments. Performance evaluation of pysam compared to other Python VCF readers (ex: PyVCF, vcfpy, cyvcf2) is also not part of this analysis.

Conclusions

The availability of high-performance networks, combined with efficient disk and memory storage formats like Parquet and Arrow, provides a strong framework for developing applications that require large-scale genomic analysis. We’re happily using Parquet and Arrow as part of the common framework for developing high-performance data services at 23andMe.

References

  1. A Storage Model to Bridge the Processor/Memory Speed Gap
  2. The Variant Call Format (VCF) Version 4.2 Specification
  3. Minimac4 Documentation
  4. IGSR: The International Genome Sample Resource

About the Author

Tulasi Paradarami is a Sr. Engineering Manager at 23andMe, leading the big data engineering teams. He is excited about helping teams develop solutions for fast genetic analysis.

23andMe is looking for a talented Tech Lead to join our Big Data team! Apply today!

--

--