TileDB in genomics: getting started with TileDB-VCF to store and query genomic variant data

Published in

truwl

8 min readMay 24, 2022

TileDB describes itself as “The Universal Database” for all data and cites use cases in finance, LiDAR, imaging, and genomics on its website among others. In genomics TileDB has been used for single-cell data as the database that back’s the Chan Zuckerberg Initiative’s cellxgene project. The most prominent usage though is in population genomics. Major adopters of TileDB for storing population-level genomic data include Helix, and Rady Children’s Institute for Genomic Medicine(RCIGM). In this video Dr. Stephen Kingsmore of RCIGM describes that TileDB solves two major problems for them: the n+1 problem and the complex query problem. There is a lot to learn about TileDB — what it is and how it works — but the point of this post is to rapidly get you set up and using it with a small dataset and some simple queries.

TileDB, which is the name of the database library and company that supports it has an enterprise cloud product but the main embeddable C++ library, TileDB Embedded, is open source and distributed under the MIT license. With TileDB Embedded you can create your own database arrays, attributes, and schemas, but TileDB has made a package called TileDB-VCF (also has MIT license) that is built with TileDB Embedded that is ready to go for working with VCF data and that is what I’ll use here.

Installing

TileDB has multiple options for installing it including building from source, a docker container, and conda. I hit an error on my first try when installing from source using CMake, but installing with conda went without a hitch so that is what I’ll use. I also want to create and interact with the database with Python so I’ll install the Python API too. I’m working on an Ubuntu 18.04.3 LTS machine that I manage with Vagrant (corresponding file content shown below), but any Linux/Unix machine with conda installed should work fine. Here’s how to set up your machine and install conda:

Update package indexes:
sudo apt-get update

2. Install additional packages:
sudo apt-get install -y python3-piplibbz2-dev liblzma-dev autoconf tree automake zlib1g-dev

Not all these packages are needed when using conda, but I included them so you’re set up to install from source as well.

3. To install conda, download the Miniconda installer and run it following directions at https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html

Alternate setup with Vagrant

I’ve automated the above steps with Vagrant using the following Vagrantfile and bash script which installs conda. If you already have conda installed, skip this.

Here’s the Vagrantfile.

# VagrantfileVagrant.configure("2") do |config|
  config.vm.box = "hashicorp/bionic64"
  config.vm.provider "virtualbox" do |vb|
    vb.memory = "4096"
    vb.name= "TileDB"  end  config.vm.provision "shell", inline: <<-SHELL
    # This will run as root
    apt-get update
    apt-get install -y python3-pip libbz2-dev liblzma-dev autoconf tree automake zlib1g-dev    # This will run as user vagrant.    su vagrant -c '/vagrant/user_provision.sh'SHELLend

And here’s the bash script it calls

#! /bin/bash
# user_provision.sh# Commands to run as user 'vagrant' when provisioning vm.# some local variablesminicondash=Miniconda3-latest-Linux-x86_64.sh
installpath=/home/vagrant/bin# check if user bin directory is there.if [[ ! -d $installpath ]]; then
  mkdir -p $installpath
fi# check if conda is installed (shouldn't be, unless re-provisioning)if conda --version 2> /dev/null; then
  echo "conda is already installed"
else
  cd /vagrantif [[ ! -f $minicondash ]]; then
    echo Downloading miniconda installation script
    wget -q http://repo.continuum.io/miniconda/$minicondash
  fi
chmod +x $minicondash
echo installing miniconda
./$minicondash -b -p $installpath/miniconda
ficat >> /home/vagrant/.bashrc << END
PATH=$PATH:$installpath/miniconda/bin
END$installpath/miniconda/bin/conda init bashsource /home/vagrant/.bashrc

Once you have conda installed run conda init bash if you haven’t already. This allows you to use the conda activate command.

Now make a new conda environment:conda create tiledbvcf-py
Activate your new environment: conda activate tiledbvcf-py
Install tiledbvcf-py: conda install -c conda-forge -c bioconda -c tiledb tiledbvcf-py
Check that the install worked. You should now have two ways of interacting with TileDB-VCF — the CLI and Python API.
- Check the CLI: tiledbvcf --version
- Check the Python API: python -c "import tiledbvcf; print(tiledbvcf.version)”

If these commands worked, congrats, you’re ready to start using TileDB-VCF.

Adding some samples

Creating a dataset

Okay. Time to create a database and put something in it. First you need to create a dataset and the tiledb documentation does a good job of showing how. You can use Python or the CLI here. I’ll use Python. I first make a Python file create_dataset.py and add the following code. This creates a blank dataset and will overwrite any existing dataset with the same uri so I’m living dangerously here.

# create_dataset.pyimport tiledbvcf
from pathlib import Path
import shutil

"""
Creates an empty dataset. Deletes dataset and recreates if it already exists.
"""
uri = "my_first_tiledb_vcf_dataset"
ds = tiledbvcf.Dataset(uri, mode = "w")
dataset_path = Path.cwd() / uri
if dataset_path.exists():
    shutil.rmtree(dataset_path)
ds.create_dataset()

Run this code with python -m create_dataset

You should now have a new directory titled my_first_tile_db_vcf_dataset. There are a bunch of directories and files in there now. You can see the structure with tree my_first_tile_db_vcf_dataset .

Getting some data and adding it

Now we’re ready to add some data, but we need to get some first.

I put three VCF files (created with the DRAGEN pipeline on the Genome in a Bottle [GIAB] Ashkenazi Jewish trio fastq samples with the GRCh38 reference) AND their corresponding index files on Google Drive which can be downloaded with this link (~1.3GB). Once you download and unzip that file — the files within should remain gzipped — you can add them to your dataset. This can be done one at a time, with a wildcard pattern, or by passing a text file with a list of the sample VCF file names. I’ll use a wildcard pattern and the tiledbvcf CLI:

tiledbvcf store --uri my_first_vcf_dataset \
data/GIAB_vcfs/HG00?.hard-filtered.vcf.gz

Note, that the command will fail if the corresponding index files (e.g. HG002.hard-filtered.vcf.gz.tbi) are not present in the same directory. This step might take a few minutes.

Querying

Now that you’ve ingested some data, you can do some queries. You’ll likely want to make some Python scripts for your favorite queries but we’ll just do a few from the CLI for now. You can get the names of the samples with

tiledbvcf list -u my_first_tile_db_dataset

which returns

HG002
HG003
HG004

Hurray. All three samples made it. You can get some stats about the dataset with

tiledbvcf stat -u my_first_vcf_dataset/

which returns

Statistics for dataset 'my_first_vcf_dataset/':
- Version: 4
- Tile capacity: 10,000
- Anchor gap: 1,000
- Number of samples: 3
- Extracted attributes: none

Let’s look at what fields are in the VCF files in the first place. There are a bunch of comment/metadata lines in VCFs before the tab separated data starts. The tabulated data starts with a column header row with the first column named CHROM. We can use zcat and grep to find where that data starts, what the column headers are, and what the data looks like:

zcat HG002.hard-filtered.vcf.gz | grep ‘#CHROM’ -n -A 5

which shows the column headers (in row 3419) and 4 rows of data.

Column headers in VCF file.

Okay. Now lets query the data in the database instead of reading straight from the files. You can get subsets of the data with the export subcommand which allows you to specify samples, fields, and regions. I almost always start with the —-count-only option which only returns the number of records in the query to potentially keep tens of thousands of lines being printed to my screen. This query returns the number of records in the the first 1,000,000 base pairs from chromosome 2:

tiledbvcf export \
--uri my_first_vcf_dataset/ \
--sample-names HG002 \
--regions chr2:1-1000000 \
--count-only

This tells me that there are 1949 records in this region of chromosome 2 for subject HG002 which is the son of the trio; HG003 is the father, and HG004 is the mother. I can do the same thing for the subjects by replacing the sample name.

Now lets find a reasonable size query to take a look at. To see output I need to remove the —-count-onlyoption and add the--output-formatand--tsv-fields arguments so I get some records with fields returned. I’ll query the region of the Von Willebrand Factor A Domain (VWA1)on chromosome 1 and return the reference, alternate, and position of the ‘variants’ in TSV format (you can also export in BCF or VCF formats).

tiledbvcf export \
--uri my_first_vcf_dataset/ \
--sample-names HG002 \ 
--output-format t \ 
--tsv-fields CHR,REF,ALT,POS \ 
--regions chr1:1435690-1442882

One thing to note is that the naming of the fields in --tsv-fields do not necessarily match the field names in the VCF header — here we had to use CHR, not CHROM. You can see the appropriate field names by looking at the help tiledbvcf export -h . The --regions field however does need to match how chromosomes are named in the VCF file. If you were to try the query in the TileDB documentation with these VCFs, --regions 1:113409605–113475691,1:113500000–113600000, you will get zero results because it will not find a chromosome named ‘1’.

With the above query we see 2 records: an SNV and an insertion

Let’s see if any of the parents have these by adding them to the sample-names: --sample-names HG002,HG003,HG004 . This gives us

All three subjects have the insertion, but only the son and mother have the SNV. If you have your own large dataset you can look at the frequencies of these variants across different populations but I’ll just take a look on gnomAD, being sure to use gnomAD v3.1.2 which uses GRCh38 coordinates. From gnomAD I can see the C->G variant is in an intron and has an allele frequency of about 33% for both the general population and in Ashkenazi Jewish population. The insertion is in the 3' UTR and is present in ~98% of alleles in the general population and over 99% in Askenazi Jewish population so it’s not surprising that all 3 subjects have it. In this case, the absence of the variant call would be a real variation from ‘normal’.

Conclusions

Hopefully, that’s enough to get you started and trying some different things. We created a database, put some data in it, and queried it. If I left out something major, please let me know. This was a simple case with only 3 samples and some simple queries. TileDB-VCF’s strengths are really supposed to appear as the size of the dataset increases into thousands and tens of thousands of samples and you run more complex queries. I haven’t done this yet. Regions can also be specified with BED files instead of ranges on the command line. If you want to try it, you can find a diverse set of predefined regions from the GIAB ftp site. Maybe give the MHC region a try.

Disclosure: I know several members of the TileDB team and their Series A investors and am cheering for their success as a company.