Variant Filtering of Genome VCF Files with Snowflake Utilizing the Registry of Open Data on AWS
Introduction
In this article, I will explain how to import Variant Call Format (VCF) files, generated from genomic data such as Whole Genome Sequence data obtained from next-generation sequencers, into Snowflake for analysis. The data published in the Registry of Open Data on AWS exists in S3, making it easy to integrate it as an external stage in Snowflake for utilization. I will primarily focus on explaining this method in this blog post.
What is the Registry of Open Data on AWS?
The Registry of Open Data on AWS, provided by Amazon Web Services (AWS), is a platform that serves as a catalog of open datasets freely accessible to researchers, data scientists, and developers. This registry contains open data sets from various fields, making it easy to discover and use data. There is also a significant amount of data related to bioinformatics.
In this blog post, we will use the DRAGEN data from the 1000 Genomes Project and the ClinVar data prepared by NIH, featured in this registry, to register external stages of public data and annotate VCF data. We will then proceed to challenge ourselves by performing variant filtering using SQL.
Registering External Stages of Public Data and Annotating VCF Data
The general flow of the analysis is as follows:
- Register external stage for S3 containing DRAGEN 1000-Genomes project genomic data from the Open Data Registry.
- Ingest data into Snowflake tables.
- Register external stage for S3 containing Panel information and perform annotation.
We will import the genomic data from the DRAGEN 1000-Genomes project into Snowflake to make it analyzable. Next, we will bring the data registered in the external stage into Snowflake tables, enabling data filtering and analysis. Finally, we will import ClinVar’s annotation data into Snowflake and add annotations to the genomic data.
Registering DRAGEN’s S3 as an External Stage
To begin with, we will register the URI of DRAGEN’s S3 (s3://1000genomes-dragen-3.7.6/data/individuals/hg38-graph-based) in Snowflake’s stage. By registering the S3 bucket in Snowflake’s external stage, direct access to the data in S3 from Snowflake is enabled. Using the registered external stage, we will execute SQL queries to examine the datasets stored in S3. The datasets in S3 can be confirmed with a query using the DIRECTORY function.
create or replace stage dragen_all
directory = (enable = true)
url = 's3://1000genomes-dragen-3.7.6/data/individuals/hg38-graph-based'
file_format = (type = CSV compression = AUTO)
;
SELECT * FROM DIRECTORY( @dragen_all )
where relative_path rlike '/HG0011[0–7].*.hard-filtered.vcf.gz'
;
Upon executing the query, files matching the specified pattern will be listed. Below is a sample of the query results. In this case, we will be working with VCF files of eight genomes containing pre-filtered variants.
Importing DRAGEN into Snowflake Table and Executing Queries
We will define a table to import DRAGEN. Since the table is optimized for performance in Snowflake, data verification through SQL can be done quickly and efficiently.
create or replace table GENOTYPES_BY_SAMPLE (
CHROM varchar,
POS integer,
ID varchar,
REF varchar,
ALT array ,
QUAL integer,
FILTER varchar,
INFO variant,
SAMPLE_ID varchar,
VALS variant,
ALLELE1 varchar,
ALLELE2 varchar,
FILENAME varchar
);
insert into GENOTYPES_BY_SAMPLE (
CHROM ,
POS ,
ID ,
REF ,
ALT ,
QUAL ,
FILTER ,
INFO ,
SAMPLE_ID ,
VALS ,
ALLELE1 ,
ALLELE2 ,
FILENAME )
with file_list as (
SELECT file_url, relative_path FROM DIRECTORY( @dragen_all )
where relative_path rlike '/HG0011[0-7].*.hard-filtered.vcf.gz'
order by random()
)
select
replace(CHROM, 'chr','') ,
POS ,
ID ,
REF ,
split(ALT,','),
QUAL ,
FILTER ,
INFO ,
SAMPLEID ,
SAMPLEVALS ,
allele1(ref, split(ALT,','), SAMPLEVALS:GT::varchar) as ALLELE1,
allele2(ref, split(ALT,','), SAMPLEVALS:GT::varchar) as ALLELE2,
split_part(relative_path, '/',3)
from file_list,
table(ingest_vcf(BUILD_SCOPED_FILE_URL(@dragen_all, relative_path), 0)) vcf
;
*ingest_vcf function uses the UDF introduced in [this article](https://zenn.dev/t_koreeda/articles/b4c2519e4824c9).
It summarizes the distribution of Variant Call Depth (DP) values per sample and shows the 5th, 10th, 50th, 90th, and 95th percentile values of this important quality metric.
Registering an External Stage with ClinVar in S3
We will be using ClinVar data for annotation, which is also managed by the Registry of Open Data on AWS.
create or replace stage clinvar
url = 's3://aws-roda-hcls-datalake/clinvar_summary_variants'
file_format = (type = PARQUET)
;
create or replace table CLINVAR (
chrom varchar(2),
pos number ,
ref varchar ,
alt varchar ,
ALLELEID number ,
chromosomeaccession varchar ,
CLNSIG varchar ,
clinsigsimple number ,
cytogenetic varchar ,
geneid number ,
genesymbol varchar ,
guidelines varchar ,
hgnc_id varchar ,
lastevaluated varchar ,
name varchar ,
numbersubmitters number ,
origin varchar ,
originsimple varchar ,
otherids varchar ,
CLNDISB array ,
CLNDN array ,
rcvaccession array ,
CLNREVSTAT varchar ,
RS number ,
startpos number ,
stoppos number ,
submittercategories number ,
testedingtr varchar ,
type varchar ,
variationid number ,
full_annotation variant
);
insert into CLINVAR
select
$1:chromosome ::varchar(2) chrom,
$1:positionvcf ::number pos,
$1:referenceallelevcf ::varchar ref,
$1:alternateallelevcf ::varchar alt,
$1:alleleid ::number ALLELEID,
$1:chromosomeaccession ::varchar chromosomeaccession,
$1:clinicalsignificance ::varchar CLNSIG,
$1:clinsigsimple ::number clinsigsimple,
$1:cytogenetic ::varchar cytogenetic,
$1:geneid ::number geneid,
$1:genesymbol ::varchar genesymbol,
$1:guidelines ::varchar guidelines,
$1:hgnc_id ::varchar hgnc_id,
$1:lastevaluated ::varchar lastevaluated,
$1:name ::varchar name,
$1:numbersubmitters ::number numbersubmitters,
$1:origin ::varchar origin,
$1:originsimple ::varchar originsimple,
$1:otherids ::varchar otherids,
split($1:phenotypeids::varchar, '|') CLNDISB,
split($1:phenotypelist::varchar, '|') CLNDN,
split($1:rcvaccession::varchar, '|') rcvaccession,
$1:reviewstatus ::varchar CLNREVSTAT,
$1:rsid_dbsnp ::number RS,
$1:start ::number startpos,
$1:stop ::number stoppos,
$1:submittercategories ::number submittercategories,
$1:testedingtr ::varchar testedingtr,
$1:type ::varchar type,
$1:variationid ::number variationid,
$1 ::variant full_annotation
FROM '@clinvar/variant_summary/'
where $1:assembly = 'GRCh38'
order by chrom, pos
;
Variant Filtering in ClinVar
Let’s try variant filtering in ClinVar. We will examine all positions related to hereditary colon cancer. Please note that you do not need to explicitly specify the chromosome or position, as filtering is implicitly applied through the connection to ClinVar.
select g.chrom, g.pos, g.ref, allele1, allele2, count (sample_id)
from genotypes_by_sample g
join clinvar c
on c.chrom = g.chrom and c.pos = g.pos and c.ref = g.ref
where
array_contains('Hereditary nonpolyposis colorectal neoplasms'::variant, CLNDN)
group by 1,2,3,4,5
order by 1,2,5
;
When dealing with the CLNDN column in ClinVar, since the column may contain multiple values, specify the desired indicators using Snowflake’s ARRAY_CONTAINS function in the filtering conditions. Next, you can narrow down the query by revealing only variant values where either ALLELE1 or ALLELE2 of the variant call matches the annotation.
select g.chrom, g.pos, g.ref, c.alt clinvar_alt, genesymbol, allele1, allele2, count (sample_id)
from genotypes_by_sample g
join clinvar c
on c.chrom = g.chrom and c.pos = g.pos and c.ref = g.ref
and ((Allele1= c.alt) or (Allele2= c.alt))
where
array_contains('Hereditary nonpolyposis colorectal neoplasms'::variant, CLNDN)
group by 1,2,3,4,5,6,7
order by 1,2,5
;
Benefits of Variant Analysis in Snowflake
After trying variant analysis on VCF files using Snowflake, I realized there are various benefits to it, so let’s summarize them.
Management of Filtering Workflows
Variant analysis typically generates large intermediate files such as CSVs or VCFs. These files are generated at each step of data transformation, filtering, and analysis, often having different formats and structures. By using Snowflake, you can save all these intermediate files as tables for centralized management. Storing the data in table format makes it easier for visualization, search, filtering, and analysis, streamlining data management. Additionally, organizing data based on Snowflake’s schema helps maintain data integrity and quality.
Convenient Data Filtering with SQL
Using Snowflake allows you to perform data filtering through SQL. SQL is a widely used standard language in database management systems that enables easy operations such as data selection, filtering, aggregation, and joining. The fact that you don’t need to catch up with dedicated tools like Snpsift is another advantage.
Utilizing Storage Compression Benefits on Snowflake
Snowflake supports not only structured data but also unstructured data including VCF files. Storing VCF in Snowflake allows you to benefit from storage compression, potentially reducing costs compared to storing data on services like S3.
[Learn more about Snowflake’s unstructured data support](https://docs.snowflake.com/en/user-guide/unstructured-intro)
In Conclusion
How was it? By incorporating open data like the Registry of Open Data on AWS into Snowflake, the scope of analysis in Snowflake can be expanded. Let’s make our analyses richer by effectively utilizing data from public databases. In this instance, we used samples that were pre-called for variants, but with Snowpark Container Service available in Snowflake, it is also possible to start from earlier stages such as mapping to the reference genome, alignment (using tools like BWA, Bowtie2, STAR), and variant calling (GATK, Samtools/BCFtools) using SAM/BAM files. We will continue to verify this process periodically.