CNVs (Copy Number Variants)— Context, detection methods and exploratory data analysis with Python

Pubudu samarakoon

Published in

Into the genomics

12 min readJan 24, 2019

In this article, I’ll be providing a detailed view on Copy Number Variants (CNVs) in three main sections.

Introduction: In the first section, I’ll discuss what a CNV is and why it is essential to study these variants.
CNV detection methods: Here, I’ll provide an overview of how to identify CNVs using experimental and computational techniques.
Visualisation of CNVs using python: Today, computational CNV prediction is becoming a popular variant detection method. Therefore in this third section, I’ll demonstrate how to analyse and visualise computationally predicted CNVs to help identify systematic biases in results.

1. What are genetic variants?… What are CNVs?… Why study CNVs?…

1.1 Variants in the human genome

These are the genetic differences between individuals, within and among populations. There are three broad categories of genetic variants: sequence variants-changes in DNA sequence, structural alterations, and modifications in chromosome number.

1.2 Copy Number Variant (CNV)

CNVs are a type of structural alterations that affect genomic regions ranging from one kilobase (Kb) to several megabases (Mb). These large-scale genetic variants can occur throughout the genome and alter the structure. For example, in the human genome, there are 23 pairs of chromosomes. Two chromosomes in these 23 pairs are from each parent. Therefore historically, regions in a genome of a healthy individual were thought to occur in two copies — one copy from the mother and the other from the father. However, with the identification of copy number variants, scientists found out that the number of copies of genomic regions can vary between individuals in a population. Therefore due to CNVs, some individuals may have more (or fewer) copies of genomic regions compared to others in a population.

CNVs can change the number of copies of a specific region either by deleting or duplicating these regions. These CNVs are known as deletions (losses) or duplications (gains).

Deletion (loss) or duplication (gain) of the chromosomal region — C

Although CNVs are primarily known as the variants that change the number of copies, inversions — another type of CNVs, only change the orientation of genomic regions (not the number of copies). However, the vast majority of methods used in current CNV detection protocols were developed to detect deletions and duplications. Therefore this article is mainly focussed on deletions and duplications.

1.3 Why study CNVs?

CNVs can cause a broad range of diseases including Mendelian diseases (single-gene diseases), complex diseases (diseases due to many genes) and cancer. These diseases can occur in different ways. For example, deletion of an entire gene or part of a gene can disrupt the function of a gene and cause diseases. CNVs that affect multiple genes can also cause diseases by changing the structure of these genes (e.g. fusion genes). Moreover, CNVs cause diseases by affecting genomic interactions that are critical to a gene function (e.g. deleting or duplicating enhancers or promoters that could affect their interactions). Additionally, by altering the number of copies of genes, CNVs can change the level of their products and cause diseases.

2. CNV detection methods

2.1 Classical methods used to detect CNVs

Since the advent of genetic testing in clinical laboratories, various techniques have been introduced to detect CNVs. Comparative Genomic Hybridization (CGH) was one of the earliest techniques used to detect genome-wide CNVs (CNVs spread throughout the genome). CGH uses two different fluorescent stains to label test and reference (normal) DNA samples. Then these differentially labelled samples are co-hybridised to normal human metaphase chromosomes. Next, changes in the fluorescence intensity ratios between test-reference samples are used to identify the regions of chromosomal losses and gains. Chromosomal losses are the regions with decreased intensities (test-specific fluorescence), while chromosomal gains are the regions with increased intensities. CGH enabled the detection of genome-wide CNVs, yet the resolution is limited to CNVs > 5Mb. Therefore, the CGH method was further improved using microarrays to detect genome-wide CNVs in high-resolution.

Ref: DOI: 10.1002/9780470015902.a0002651.pub2

Microarray-based methods implement the same principle used in CGH techniques. For example, they use differently labelled test and reference DNA samples as starting materials and detect CNV regions by examining changes in fluorescence intensity ratios (between test-reference samples). However, in contrast to normal metaphase chromosomes used in CGH, microarrays contain probes printed on glass slides. These probes are spread throughout the genome and can hybridise to specific DNA segments. First generation microarray platforms used few thousand probes (large-insert clones — bacterial artificial chromosomes) and had a resolution of 1Mb-3Mb. Next wave of arrays contained millions of short probes to improve the CNV detection resolution. There are two main types of microarrays — SNP-arrays and CGH-arrays (aCGH). SNP-arrays use probes that target single nucleotide changes (polymorphisms) — SNPs to tag CNVs. aCGH uses ~60bp long oligonucleotide probes as array elements. Both these arrays can contain 1-4 million probes to detect CNVs in high-resolution (e.g. 1Kb-5Kb resolution).

Microarrays offer a reliable, efficient and scalable method to detect CNVs in both clinical and research setting. Today, commercial services like Agilent, Nimblegen, Affymetrix and Illumina provide a broad range of array platforms for various purposes. Additionally, it is possible to custom design arrays for scientists with specific requirements. However, the sensitivity and precision of these platforms vary greatly depending on their design (the spread of probes across the genome). Therefore differences in CNV detection sensitivities in different platforms is considered as the main drawback of these arrays. For example, a study has shown that large CNVs (>1Mb) were detected with high accuracy even in arrays with different probe distributions, but these arrays have shown variable efficiencies in detecting short CNVs. Scientists have developed next-generation sequencing-based methods to predict CNVs to address this issue.

2.2 CNV prediction using next-generation- sequencing (NGS) based methods

Next Generation Sequencing (NGS)

NGS allows to sequence DNA and RNA rapidly and cost-effectively than the previously used Sanger sequencing method. NGS has revolutionised the field of genomics and molecular biology. NGS is also known as high-throughput- sequencing and represents many different technologies. Illumina sequencing by synthesis is the most commonly used technology today.

NGS Terminology
Fragmentation:
Typical sequencing experiment involves fragmentation of the genome into millions of molecules. These DNA molecules are known as DNA fragments. Then these fragments are sequenced to produce a set of reads.
Read:
A read is a sequence of base pairs corresponding to all or part of a single DNA fragment.
Read mapping or alignment:
After generating a set of reads, these are aligned to the reference genome of a target organism. For example, if the sequencing process generates a set of reads from a patient genome, these are aligned (or mapped) to the human reference genome.
Depth/Coverage (Depth-of-coverage):
Fragmentation and sequencing process generates sets of overlapping reads covering the entire genome or exome. Then the mapping process aligns these reads to the reference genome so that each set of reads maps to its’ relevant genomic region. Coverage (or depth) is the number of reads that overlap with a specific region in the reference genome. In other words, it the number of time a region in the reference genome is covered during the sequencing process.
Read mapping or alignment process provides various measures that could help detect sequencing artefacts. Coverage calculation step uses these measures to filter-out sequencing artefacts. For example, mapping process considers reads with the same chromosomal start and end positions (100% overlap) as sequencing artefacts (duplicated reads) and reads that map to more than one genomic regions as low-quality reads. Coverage calculation step removes all the duplicated and low-quality reads and counts the number of unique reads that map to a given region in the reference genome. Deep sequencing refers to the general concept of aiming for a high number of unique reads of each region of a sequence.

https://www.illumina.com/documents/products/illumina_sequencing_introduction.pdf

With the introduction of NGS, whole genomes could be rapidly characterised at single base pair resolution. The continuous development of these technologies lead to the rapid decrease in sequencing cost, and therefore massive amounts of sequence data were generated with ever-decreasing effort. In 2010–2011, when whole-genome sequencing was not available at an affordable price, exome sequencing was introduced as a cost-effective approach that sequence coding regions of the human genome.

Whole-genome sequencing is the process of determining the entire DNA sequence of an organism’s genome.
Exome sequencing is the process of determining the complete sequence of all the protein-coding regions in a genome.

When exome sequencing was first introduced, it was estimated that ~85% of clinically relevant (disease-relevant) variants were in exonic regions. While this figure is likely an overestimate due to the scarcity of data on disease causality, scientists still expect that the majority of causal variants to be in the exome. Since the introduction, exome sequencing has been a useful genetic diagnostic test that revolutionised the detection of a broad range of disease-relevant variants (novel, de novo — present for the first time in one family member and rare — less than 1% of the population variants).

Predicting CNVs in exome sequence data

Since the introduction of exome sequencing technologies, various CNV prediction programs have been developed to detect CNVs in exomes. These programs use different strategies to predict CNVs. Today, depth-of-coverage based method is the most commonly used strategy in predicting CNVs in exomes.

In NGS, copy number (CN) variable genomic regions (CN≠2) generate higher or lower read counts compared to copy number neutral regions (CNV=2). Here, duplicated regions generate higher read counts (higher coverage) compared to CN neutral regions while deleted regions generate lower read counts (lower coverage). Therefore when calling CNVs, prediction programs expect that the coverage is proportional to the copy number of the region.

In depth-of-coverage based CNV prediction, first, a control dataset is created from a set of target exomes. This control dataset represents the depth-of-coverage distribution in a copy number neutral sample. Then the prediction program compares each exome to the control and looks for regions with increased or decreased coverage. Duplications (amplification) are the regions with increased coverage and deletions are the regions with decreased coverage.

Schematic representation of CNV prediction in exomes

With the increasing popularity of exome sequencing as a diagnostic tool, a large number of detection programs have been developed to call CNVs in exomes. Word-cloud in this figure lists commonly used prediction programs (font size represents the google citation count). These programs have different strengths and weaknesses, and therefore selecting a program has become a challenging task. I have published scientific articles discussing these matters and presenting a way to improve the disease-relevant CNV detection. Following is a list of critical points discussed in these articles.

CNV prediction programs are better at predicting long CNVs (containing 4 or more CNVs)
These programs show a high false positive rate when predicting short CNVs
ExCopyDepth is optimised to predict short CNVs with high accuracy (we have used this to predict disease-causing single exon heterozygous deletions)
When detecting rare CNVs, ExomeDepth showed the highest sensitivity and CONIFER showed the lowest false positive rate
cnvScan can be used effectively to reduce false positive counts based on CNV quality scores
cnvScan provides a broad range of information that can improve the functional and clinical interpretation of predicted CNVs
Computationally predicted CNVs need to be validated using a relevant microarray when detecting a disease-causing variant

3. Exploratory CNV analysis with python

After predicting, annotating and filtering CNVs from a group of patients, we manually examine variants containing genes that are relevant to a particular disease. These disease-relevant genes are commonly known as candidate genes. However, before the manual examination, we need to analyse CNVs further and identify various biases in the results. This step is also useful to prioritise the most relevant gene set in the dataset especially when a long list of candidate genes is not available.

Access UCSC add custom track and upload the bed file as shown here

Here we’ll discuss how to visualise CNVs using publically available ExAC dataset. You can access the FTP site and download the “exac-final.autosome-1pct-sq60-qc-prot-coding.cnv.bed” file. This file lists all the high-quality deletions and duplications identified in the ExAC project. Since this file is not a tab-delimited file, we can either write a script or use UCSC table browser to extract ExAC deletions and duplications.

CNV analysis with python

This GitHubGist provides a detailed description of how to explore quality-scores, length and genes in ExAC deletions and duplications. The main steps in this script are as follows.

Read ExAC deletions and duplications into a pandas dataframe. Then, delete columns that are not relevant to the analysis.

exac = pd.concat([pd.read_table("ExAC_deletions.bed", names=['chr', 'start', 'end', 'population', 'score', 'strand', 'x1', 'x2', 'x3'], skiprows=1).assign(cns="del"),
 pd.read_table("ExAC_duplications.bed", names=['chr', 'start', 'end', 'population', 'score', 'strand', 'x1', 'x2', 'x3'], skiprows=1).assign(cns="dup")]) ## In [2]exac = exac.drop(columns=['x1', 'x2', 'x3']) ## In [4]

Here we read ExAC_deletions.bed and ExAC_duplications.bed files into two pandas dataframes and merge them using pandas concat function.

Visualise deletion and duplication counts in the dataset

We can use the merged dataframe from the previous stage to count and visualise the number of the deletion and duplication in each population (with seaborn countplot). As you can see, the majority of the CNVs are from Non-Finnish Europeans (ExAC-NFE) and duplications are overrepresented in each population.

Analysis of CNV quality-score

CNV quality score provides statistical support for the prediction. In other words, this provides a measure of how likely a CNV called incorrectly. Different prediction programs use different strategies to calculate CNV scores. For example, ExomeDepth and ExCopyDepth use log-likelihood ratio, ExomeCopy uses log odds ratio, XHMM use Phread scaled likelihood and CoNIFER do not report the score (normalised SVD score). After calling CNVs using the preferred program, we need to analyse the quality of the predictions to study systematic biases in the results.

Above figure shows the distribution of ExAC deletions and duplications using density plot, cumulative distribution plots and box plot. We can use these plots to study various aspects of quality scores of the CNVs in ExAC dataset.

Study the distribution of CNV length

Similar to CNV quality scores, we can study the distribution of CNV length using seaborn plotting functions. CNV length varies from ~100bp — ~35mb. Therefore CNV length density plots are highly skewed (In[21]). We can avoid that by plotting the length in log10 scale (e.g. In [26] and In [30]).

Visualise the distribution of CNVs after binning them based on their quality scores and length

Binning CNVs based on quality scores:

$ score_bins = list(range(min(exac['score']), max(exac['score']+10), 10 ))
$ score_bins
> [60, 70, 80, 90, 100]$ exac['s_bin'] = pd.cut(exac['score'], bins=score_bins, labels=score_bins[:-1])

Binning CNVs based on length:

$ len_bins = list(range(int(min(exac['Len_inKb'])), 10000, 1000 ))
$ len_bins[:5]
> [0, 1000, 2000, 3000, 4000]$ exac['l_bin'] = pd.cut(exac['Len_inKb'], bins=len_bins, labels=len_bins[:-1])

Figures show CNV quality score and length distribution in different bins. Here, CNV length (log 10 scale) do not vary in different quality score categories, but quality score distribution varies greatly when CNVs are in different length categories. For example, CNV score varies from 60–100 when deletions and duplications are in 0–10kb bin. However, duplications are overrepresented in bins over 7000kb (7mb) and have high CNV scores.

Identification of top 25 most affected genes in the dataset

Finally, we can identify the most affected genes in the CNV dataset. Here we identify the genes internal to each CNV and count the number of times each gene appear in CNV calls. These gene counts are then used to select the top 25 most affected genes in the CNV dataset.

We can follow the steps discussed in my post on “Annotating gene coordinates and gene lists — python way” to identify the genes in the dataset. Then we can use seaborn barplot to visualize most affected genes.

After exploratory analysis of CNV predictions, we can study the functional and clinical effects of these variants using annotations provided by programs like cnvScan. In my scientific article on cnvScan, I have provided a detailed account of techniques used to interpret CNVs.

PyScripts

Jupyter notebook with the python script