DNA Sequencing of 1000 Cannabis Strains publicly available in Google BigQuery

Published in

Google Cloud - Community

6 min readMar 31, 2017

Applying standard bioinformatics analyses to newly sequenced Cannabis strains via Google BigQuery could help modernize our understanding of the biology and evolutionary history of the species.

Today I’m excited to announce the release of a new open data set as part of Google Cloud’s BigQuery Public Datasets project, and first genomics dataset to be made available on Google Cloud (with more on the way).

In October 2016, Phylos Bioscience released a genomic open dataset of approximately 850 strains of Cannabis via the Open Cannabis Project. In combination with other genomics datasets made available by Medicinal Genomics, Michigan State University, NCBI, Sunrise Medicinal, University of Calgary, University of Toronto, and Yunnan Academy of Agricultural Sciences, the total amount of publicly available data exceeds 1,000 samples taken from nearly as many unique strains. I’ve summarized the available data for nearly 1000 strains and made them publicly available in the BigQuery genomics_cannabis dataset. Examples for how to explore this dataset are available in the accompanying dataset description.

Why Cannabis?

Copious literature is available describing the importance of Cannabis as an agricultural species. Briefly, it is an important source of fibers and nutritional and medicinal compounds (used to treat: pain, nausea, anxiety, and inflammation as the cannabinoid compounds mimic compounds our bodies naturally produce, called endocannabinoids, and grows quickly in a range of environments.

The Marihuana Tax Act of 1937 effectively criminalized possession or transfer of Cannabis under US federal law. Federal criminalization of Cannabis predates the 1953 discovery of the DNA double helix by James Watson and Francis Crick. This discovery gave rise to molecular biology, which is concerned with characterizing the functions of genes within biological systems. Because of this 1937 criminalization, relatively little is known about the genetics of Cannabis compared to other important domesticated plants such as maize, rice, wheat, and soy, and therefore Cannabis is lagging in hardiness and yield optimization. Therefore, an opportunity exists to apply standard bioinformatics analyses to the newly sequenced Cannabis strains to modernize our understanding of the biology and evolutionary history of the species.

The Open Cannabis Project is releasing documents into the public domain in the form of DNA sequences to foster such innovation. Contributors to the Open Cannabis Project have released data in the format produced by the sequencer, which is available on Google Cloud. However, as further analysis of the data were not made available by project contributors, I’ve performed additional de-facto standard analyses of the dataset as a convenience for users. Specifically, today we are publishing the following analysis results:

Sequence alignments of all samples to Cannatonic that can be used to create a higher quality Cannabis genome assembly, and
Detected genetic variants from (1). A genetic variant is an atomic unit of biodiversity.

See the Methods section below to learn specifically how the data were prepared.

Why Now? Genome Sequencing: Individuals and Populations

The genome sequence of a single individual is useful for establishing the evolutionary history of that individual’s species, but is not useful for developing applications that incorporate that species (e.g. agricultural, industrial, and medical applications). Sequencing a population of related individuals measures the biodiversity within the species, and enables these applications. While some Cannabis genomic data has been publicly available for several years, the recent release of data for a large population of strains from Phylos Bioscience makes it a good time to begin deeper bioinformatic analysis of the available data.

To understand why this is true and more completely understand the next phase of the genomics era, one needs to understand more about the underlying biology, and why genome sequencing is important.

Although the price/performance ratio of DNA sequencing technologies has improved dramatically, this advance hasn’t led to smaller projects at lower cost; conversely, many project budgets are growing as the benefits of producing DNA sequence are realized. Why is this happening?

Genome sequencing projects can be broken into two broad categories:

De novo genome sequencing, in which an individual of an as-yet-unsequenced species genome is sequenced (resulting in a reference genome), and
Resequencing, in which a de novo genome has been produced, and in which a new individual (query) genome is sequenced and compared relative to that existing (reference) genome.

An increasing proportion of medically and economically relevant genomes have already had de novo genome sequences produced. So why are we resequencing large numbers of additional individuals?

In order to answer questions about the mechanisms of biological systems, such as:

Why are some humans resistant to communicable disease, while others contract it?
Why do some diseases follow family trees / are inherited?
Why do some plants produce higher yields than others, given identical growing conditions?

We need to know the genetic makeup of each individual so that we can establish a causal link between an individual organism’s

Genomic state, and
Phenotypic state
In the context of the environment in which it lives

Resequencing individuals and measuring the differences relative to a reference genomes lets us do just that. A group of individuals is a called a population, and as mentioned previously, project budgets are not decreasing because the real value of genomic data comes from measuring the biodiversity of a species relative to the reference genome. Population genomics is the next phase of genomics in which we can measure population biodiversity.

Next steps

Several next actions are possible now that the Cannabis genome data is available. Beyond simple descriptive analyses of strain relationships, more detailed descriptive work is needed, including:

Producing a more highly refined assembly of the genome itself. The Cannatonic assembly used to produce today’s results is an early draft. A better assembly allows higher precision in all subsequent use of the data.
Improving annotations of the genome. The Cannabis genus has many related species for which reference genomes have already been produced. These related data can be carried over to Cannabis to help annotate the genome — for example, identifying the location of genes and suggesting what the functional characteristics of those genes are.
Associating paired genomic/phenotypic observations and improving breeding and domestication processes. The data presented here are genomic only. No phenotypic data were made available as part of the public data releases used. One next step is to associate phenotypic data, such as a strain’s average time-to-flower, grams of flower yield, or biomass at X days. This allows promotion of genomic variants to QTLs with methods like GBLUP. This allows breeders to accelerate domestication of the plant using the QTLs in marker assisted selection with the plant equivalent of EPDs.
Reducing supply chain risk. Identifying specific genetic signatures associated with strains enables genetic identification of plant tissue. This in turn enables improvements in supply chain quality control.

Each of these next steps requires further bioinformatic analyses, all of which can be performed on Google Cloud, and many of which are greatly simplified by using the prepared data published to BigQuery today.

In summary: Cannabis is an important agricultural and medicinal plant. The recent release of data for a large population of Cannabis strains marks now as a good time to begin deeper bioinformatic analyses. Google Cloud provides high-performance tools to work with these data, such as BigQuery and Dataflow, including highly specialized tools via the Genomics API for working with population-scale genomic data.

Methods

FastQ data files were downloaded from National Center for Biotechnology Information’s Sequence Read Archive (NCBI SRA) with sra-tools for nine projects with Cannabis DNA read data available.

At the same time, reference genomes (for strains Cannabis sativa subsp. Cannatonic, LA Confidential, Chemdawg 91, and Purple Kush) were also downloaded, and all further analysis used the Cannatonic strain as a reference (GenBank MNPR01) for analysis. SRA FastQ primary data are available in Google Cloud Storage at gs://gcs-public-data — genomics/cannabis in the SRA-projects folder.

DNA reads for each sample were aligned to MNPR01 using the BWA mem algorithm. Aligned reads are available via the Google Genomics API as dataset ID 918853309083001239, and an additional duplicated subset of only transcriptome data as dataset ID 94241232795910911.

Aligned reads were then processed using the Freebayes algorithm to identify locations of possible genetic variants for each individual strain relative to the reference. Variant calls were made per-sample, and variant calls for samples are available via the Google Genomics API under dataset ID 17527604790083478309.

Finally, variants were exported to the BigQuery 1000 Cannabis Genomes: Public Dataset.

DNA Sequencing of 1000 Cannabis Strains publicly available in Google BigQuery

Written by Allen Day