Scaling Genetic Variant Search, Part I: Postgres

Matt Tieman
Life and Tech @ LifeOmic
5 min readJun 19, 2020

The LifeOmic Platform enables users to search across and combine data from multiple domains, e.g. genomics, EHRs and imaging. From among these different domains, the one with the greatest volume of data and number of features is genetic variants. The Platform supports whole genome sequence results and below. This data typically can be measured in gigabyte sized VCFs (variant call files).

A genetic variant is essentially a single difference between an individual’s genome and a reference genome. Each person’s genome typically has millions of variants. Among those millions, a few thousand have known links to disease. With knowledge of a person’s genetic variants a physician can administer more targeted therapies and researchers can develop novel treatment protocols.

Variants have dozens of features, which the LifeOmic Platform pulls from several public sources and combines into a comprehensive model of the variant. This detailed view makes it possible to administer treatments used for very specific genetic mutations. However, to enable this discovery it must be possible to search across tens of millions of variants, using any combination of features.

In the Beginning

At LifeOmic we are all-in on AWS and decided early on to use AWS RDS to manage a cluster running Aurora Postgres as our variant database. The database was deployed as a binary cluster, one write instance and one read-replica. The table schema was very simple: one table containing all the variant data, one row for each variant, and one table for the samples (patient tissue that was sequenced).

The variants table contained 29 columns, which resulted in 27 b-tree indexes to ensure queries against any searchable features were performant. We had attempted to further normalize the table, but found joins between even a few of the normalized tables seriously degraded query performance. The denormalized variants table worked well for most of the early queries we needed to answer.

Enter the GIN

As more variant features were added to our Subject Search and Omics Explorer views, more column combinations were requested in the variant queries. While Postgres is able to make use of multiple single-column indexes and perform a bit-map scan, we still experienced query times in the tens of seconds.

After some experimentation with adding additional b-tree indexes with the most common combination of columns, we found success using a GIN index. Moving the majority of our columns into a GIN and using b-trees for the small subset with well established query patterns dropped our query times back down to sub-second performance.

Time to Break It Up

Not long after rolling out the new Postgres indexes, another issue began to emerge. The volume of variant data being ingested into the platform was quickly increasing. While adding the GIN index greatly increase our read performance, it hindered the performance of our writes. Updates to GIN indexes are much slower than b-tree or even GIST indexes. The Postgres documentation even recommends dropping and recreating GINs instead of performing large updates.

To increase write performance, we decided to partition the variants table. The single table that once contained all the variants across all user projects was partitioned by the users’ project id. After partitioning out the variants table we were able to update project-specific partitions in parallel, with each partition table hosting a GIN of greatly reduced size. Query performance was also improved by the partitioning. With variant queries being scoped to a single project, it was only necessary to hit a single partition table.

Partitioning the original variant table by project-id

Rebuilding Instead of Updating

Despite the gains in write performance after partitioning the variants table, we still had issues with inserting a large number of rows. In the case of inserting a few hundred variants, write performance was acceptable, typically running less than an hour. However, when users performed large uploads containing thousands or even millions of variants, our writes would frequently fail, timing out on busy waits. Occasionally, the Postgres write instance would become so CPU saturated that it caused a failover to the read instance.

Taking a clue from the Postgres documentation on GIN indexes, we decided if the size of a variant update breached a certain threshold, the table would be completely rebuilt rather than updated. If we decided a table needed to be rebuilt, all the rows would be inserted into the new tables without any indexes. Only after all the data was loaded would commands be issued to build the indexes. Finally, after the indexes were built we would mark the table as active and deprecate the old table.

Rebuilding a table rather than updating

The Only Prescription is More Partitions

With each project’s variants being persisted to separate partition tables and conditionally rebuilding the tables, both reads and writes were performing well. However, there were a handful of partitions which continued to grow at an alarming rate. As these projects grew to tens of millions of variants each, read and write performance began to decay.

Following similar steps as when we decomposed the original, multi-tenant variant table, we broke each of the partition tables into smaller shard tables. However, unlike the original partitioning of the multi-tenant table, these shard tables would be partitioned by the unique identifier of each variant, its variant-id, a composite value of multiple identifying characteristics. This sharding strategy would ensure an even distribution of rows across the shard tables. Additionally, we could decided to scale a single project’s partition independently of the others by rebuilding with more shard tables.

Partition the project tables further using variant-id to distribute rows

And in the End

After multiple indexing and schema changes we arrived at a solution which enabled performant queries using any combination of columns and could scale projects independently as whole genomes come in. However, this experience forced us to start considering alternatives to Postgres. In Part II I will discuss the process we took in our decision to migrate off of Postgres.

--

--