Variant Filtering of Genome VCF Files with Snowflake Utilizing the Registry of Open Data on AWS

Published in

Snowflake Engineering

7 min readJul 12, 2024

Introduction

In this article, I will explain how to import Variant Call Format (VCF) files, generated from genomic data such as Whole Genome Sequence data obtained from next-generation sequencers, into Snowflake for analysis. The data published in the Registry of Open Data on AWS exists in S3, making it easy to integrate it as an external stage in Snowflake for utilization. I will primarily focus on explaining this method in this blog post.

What is the Registry of Open Data on AWS?

The Registry of Open Data on AWS, provided by Amazon Web Services (AWS), is a platform that serves as a catalog of open datasets freely accessible to researchers, data scientists, and developers. This registry contains open data sets from various fields, making it easy to discover and use data. There is also a significant amount of data related to bioinformatics.

In this blog post, we will use the DRAGEN data from the 1000 Genomes Project and the ClinVar data prepared by NIH, featured in this registry, to register external stages of public data and annotate VCF data. We will then proceed to challenge ourselves by performing variant filtering using SQL.

Registering External Stages of Public Data and Annotating VCF Data

The general flow of the analysis is as follows:

Register external stage for S3 containing DRAGEN 1000-Genomes project genomic data from the Open Data Registry.
Ingest data into Snowflake tables.
Register external stage for S3 containing Panel information and perform annotation.

registered as external stages for analysis in Snowflake

We will import the genomic data from the DRAGEN 1000-Genomes project into Snowflake to make it analyzable. Next, we will bring the data registered in the external stage into Snowflake tables, enabling data filtering and analysis. Finally, we will import ClinVar’s annotation data into Snowflake and add annotations to the genomic data.

Registering DRAGEN’s S3 as an External Stage

To begin with, we will register the URI of DRAGEN’s S3 (s3://1000genomes-dragen-3.7.6/data/individuals/hg38-graph-based) in Snowflake’s stage. By registering the S3 bucket in Snowflake’s external stage, direct access to the data in S3 from Snowflake is enabled. Using the registered external stage, we will execute SQL queries to examine the datasets stored in S3. The datasets in S3 can be confirmed with a query using the DIRECTORY function.

create or replace stage dragen_all
 directory = (enable = true)
 url = 's3://1000genomes-dragen-3.7.6/data/individuals/hg38-graph-based'
 file_format = (type = CSV compression = AUTO)
;
SELECT * FROM DIRECTORY( @dragen_all )
where relative_path rlike '/HG0011[0–7].*.hard-filtered.vcf.gz'
;

Upon executing the query, files matching the specified pattern will be listed. Below is a sample of the query results. In this case, we will be working with VCF files of eight genomes containing pre-filtered variants.

Query results from querying the DRAGEN stage registered as an external stage

Importing DRAGEN into Snowflake Table and Executing Queries

We will define a table to import DRAGEN. Since the table is optimized for performance in Snowflake, data verification through SQL can be done quickly and efficiently.

create or replace table GENOTYPES_BY_SAMPLE (
   CHROM       varchar,
   POS         integer,
   ID          varchar,
   REF         varchar,
   ALT         array  ,
   QUAL        integer,
   FILTER      varchar,
   INFO        variant,
   SAMPLE_ID   varchar,
   VALS        variant,
   ALLELE1     varchar,
   ALLELE2     varchar,
   FILENAME    varchar
);

insert into GENOTYPES_BY_SAMPLE (
   CHROM       ,
   POS         ,
   ID          ,
   REF         ,
   ALT         ,
   QUAL        ,
   FILTER      ,
   INFO        ,
   SAMPLE_ID   ,
   VALS        ,
   ALLELE1     ,
   ALLELE2     ,
   FILENAME    )
with file_list as (
   SELECT file_url, relative_path FROM DIRECTORY( @dragen_all )
   where relative_path rlike '/HG0011[0-7].*.hard-filtered.vcf.gz'
 order by random()
)
select
   replace(CHROM, 'chr','')         , 
   POS           ,
   ID            ,
   REF           ,
   split(ALT,','),
   QUAL          ,
   FILTER        ,
   INFO          , 
   SAMPLEID      ,
   SAMPLEVALS    ,
   allele1(ref, split(ALT,','), SAMPLEVALS:GT::varchar) as ALLELE1,
   allele2(ref, split(ALT,','), SAMPLEVALS:GT::varchar) as ALLELE2,
   split_part(relative_path, '/',3)
from file_list,
    table(ingest_vcf(BUILD_SCOPED_FILE_URL(@dragen_all, relative_path), 0)) vcf
;

*ingest_vcf function uses the UDF introduced in [this article](https://zenn.dev/t_koreeda/articles/b4c2519e4824c9).

It summarizes the distribution of Variant Call Depth (DP) values per sample and shows the 5th, 10th, 50th, 90th, and 95th percentile values of this important quality metric.

Query to the table where DRAGEN has been imported

Registering an External Stage with ClinVar in S3

We will be using ClinVar data for annotation, which is also managed by the Registry of Open Data on AWS.

create or replace stage clinvar
   url = 's3://aws-roda-hcls-datalake/clinvar_summary_variants'
   file_format = (type = PARQUET)
;

create or replace table CLINVAR (
 chrom                  varchar(2),
 pos                    number    ,
 ref                    varchar   ,
 alt                    varchar   ,
 ALLELEID               number    ,
 chromosomeaccession    varchar   ,        
 CLNSIG                 varchar   ,
 clinsigsimple          number    ,  
 cytogenetic            varchar   ,
 geneid                 number    ,
 genesymbol             varchar   ,
 guidelines             varchar   ,
 hgnc_id                varchar   ,
 lastevaluated          varchar   ,  
 name                   varchar   ,   
 numbersubmitters       number    ,     
 origin                 varchar   ,
 originsimple           varchar   , 
 otherids               varchar   ,
 CLNDISB                array     ,
 CLNDN                  array     ,
 rcvaccession           array     , 
 CLNREVSTAT             varchar   ,
 RS                     number    ,
 startpos               number    ,
 stoppos                number    ,
 submittercategories    number    ,        
 testedingtr            varchar   ,
 type                   varchar   ,
 variationid            number    ,
 full_annotation        variant  
);


insert into CLINVAR
select
   $1:chromosome                 ::varchar(2)  chrom,
   $1:positionvcf                ::number      pos,
   $1:referenceallelevcf         ::varchar     ref,
   $1:alternateallelevcf         ::varchar     alt,
   $1:alleleid                   ::number      ALLELEID,
   $1:chromosomeaccession        ::varchar     chromosomeaccession,
   $1:clinicalsignificance       ::varchar     CLNSIG,
   $1:clinsigsimple              ::number      clinsigsimple,
   $1:cytogenetic                ::varchar     cytogenetic,
   $1:geneid                     ::number      geneid,
   $1:genesymbol                 ::varchar     genesymbol,
   $1:guidelines                 ::varchar     guidelines,
   $1:hgnc_id                    ::varchar     hgnc_id,
   $1:lastevaluated              ::varchar     lastevaluated,
   $1:name                       ::varchar     name,   
   $1:numbersubmitters           ::number      numbersubmitters,
   $1:origin                     ::varchar     origin,
   $1:originsimple               ::varchar     originsimple,
   $1:otherids                   ::varchar     otherids,
   split($1:phenotypeids::varchar,  '|')       CLNDISB,
   split($1:phenotypelist::varchar, '|')       CLNDN,
   split($1:rcvaccession::varchar,  '|')       rcvaccession,
   $1:reviewstatus               ::varchar     CLNREVSTAT,
   $1:rsid_dbsnp                 ::number      RS,
   $1:start                      ::number      startpos,
   $1:stop                       ::number      stoppos,
   $1:submittercategories        ::number      submittercategories,
   $1:testedingtr                ::varchar     testedingtr,
   $1:type                       ::varchar     type,
   $1:variationid                ::number      variationid,
   $1                            ::variant     full_annotation
FROM '@clinvar/variant_summary/'
where $1:assembly = 'GRCh38'
order by chrom, pos
;

Variant Filtering in ClinVar

Let’s try variant filtering in ClinVar. We will examine all positions related to hereditary colon cancer. Please note that you do not need to explicitly specify the chromosome or position, as filtering is implicitly applied through the connection to ClinVar.

select   g.chrom, g.pos, g.ref, allele1, allele2, count (sample_id)
from genotypes_by_sample g
join clinvar c
   on c.chrom = g.chrom and c.pos = g.pos and c.ref = g.ref
where
   array_contains('Hereditary nonpolyposis colorectal neoplasms'::variant, CLNDN)
group by 1,2,3,4,5
order by 1,2,5
;

When dealing with the CLNDN column in ClinVar, since the column may contain multiple values, specify the desired indicators using Snowflake’s ARRAY_CONTAINS function in the filtering conditions. Next, you can narrow down the query by revealing only variant values where either ALLELE1 or ALLELE2 of the variant call matches the annotation.

select   g.chrom, g.pos, g.ref, c.alt clinvar_alt, genesymbol, allele1, allele2, count (sample_id)
from genotypes_by_sample g
join clinvar c
   on c.chrom = g.chrom and c.pos = g.pos and c.ref = g.ref
   and ((Allele1= c.alt) or (Allele2= c.alt))
where
   array_contains('Hereditary nonpolyposis colorectal neoplasms'::variant, CLNDN)
group by 1,2,3,4,5,6,7
order by 1,2,5
;

Benefits of Variant Analysis in Snowflake

After trying variant analysis on VCF files using Snowflake, I realized there are various benefits to it, so let’s summarize them.

Management of Filtering Workflows

Variant analysis typically generates large intermediate files such as CSVs or VCFs. These files are generated at each step of data transformation, filtering, and analysis, often having different formats and structures. By using Snowflake, you can save all these intermediate files as tables for centralized management. Storing the data in table format makes it easier for visualization, search, filtering, and analysis, streamlining data management. Additionally, organizing data based on Snowflake’s schema helps maintain data integrity and quality.

Convenient Data Filtering with SQL

Using Snowflake allows you to perform data filtering through SQL. SQL is a widely used standard language in database management systems that enables easy operations such as data selection, filtering, aggregation, and joining. The fact that you don’t need to catch up with dedicated tools like Snpsift is another advantage.

Utilizing Storage Compression Benefits on Snowflake

Snowflake supports not only structured data but also unstructured data including VCF files. Storing VCF in Snowflake allows you to benefit from storage compression, potentially reducing costs compared to storing data on services like S3.

snowflake supports VCF as unstructured data

[Learn more about Snowflake’s unstructured data support](https://docs.snowflake.com/en/user-guide/unstructured-intro)

In Conclusion

How was it? By incorporating open data like the Registry of Open Data on AWS into Snowflake, the scope of analysis in Snowflake can be expanded. Let’s make our analyses richer by effectively utilizing data from public databases. In this instance, we used samples that were pre-called for variants, but with Snowpark Container Service available in Snowflake, it is also possible to start from earlier stages such as mapping to the reference genome, alignment (using tools like BWA, Bowtie2, STAR), and variant calling (GATK, Samtools/BCFtools) using SAM/BAM files. We will continue to verify this process periodically.