Coverage analysis from the command line
Concepts and simple tools to evaluate the sequence coverage of a short reads dataset aligned against a reference genome.
Sequence coverage (or sequencing depth) refers to the number of reads that include a specific nucleotide of a reference genome. In the screenshot below a small region of the Human genome is shown in a genome browser, where the alignment of a re-sequencing experiment was loaded. Each gray arrow in the main area of the window represents a single read, while the highlighted area is a coverage plot, summarizing for each nucleotide how many times was sequenced in this experiment.
If the sequencing was performed using paired ends or mate pairs, it’s possible to physical coverage is the number of times a base is read or spanned by paired ends or mate paired reads¹. The picture below shows the difference.
Physical coverage plays a pivotal role in detecting mis-assembled regions: even if the sequence coverage track shows no interruptions, its possible that the physical coverage one will, thus hinting at a possible chimeric sequence:
The above example, from my PhD thesis, was obtained in the early days of NGS when mixing technologies (Roche 454 for producing contigs, thanks to the longest reads and SOLiD Mate Paired library for scaffolding) was as common as painful².
Visualizing coverage from a BAM file
As shown in the introduction, IGV is a simple and effective tool to inspect an alignment file. You need the reference genome used to perform the alignment (load it via Genome → Load from file…) and the alignment file in BAM format (sorted and indexed).
The major limitation is that you can view up to 50 kbp with IGV, if you zoom out you’ll stop seeing aligments. This make IGV perfect for detailed inspection of small regions (you’ll see the alignment read by read!) but less useful for broader inspection.
Typical case: which regions have a very low (or no) coverage? Which ones have a very high coverage?
Extracting coverage information with samtools
There is a samtools subprogram, called depth, that calculates the sequence coverage at each position³:
samtools depth -a FILE.bam > FILE.txt
The output is a tabular three columns table: chromosome name, position, coverage. Here ten lines from an experiment performed on E. coli:
Bedtools genome coverage
Bedtools has an even better option to calculate coverage, as it produce a standard BED file.
bedtools genomecov -bga FILE.bam > FILE.coverage.bed
The output is a standard BED file: a tabular file with these columns: chromosome, start, end, coverage. The difference with the previous output is that adjacent bases with the same coverage will be collapsed in the same interval. Again, we can see an example:
If we want to quickly detect the zero coverage regions, we can exploit the fact that the last column indicates the coverage:
bedtools genomecov -bga FILE.bam | grep -w "0$" > FILE.0X.bed
Both samtools and bedtools can extract coverage information from a BAM file. Using text manipulation tools (like grep or awk) we can further filter the output they produce to select regions of interest.
Getting the regions covered within a coverage range
A quick tool to produce a coverage profile is covtobed
⁴, that takes a sorted BAM file as input and will produce a BED file, optionally specifying to only prints those intervals having a coverage within a range (or only the regions with no coverage).
To install the tool you can use Miniconda:
conda install -y -c bioconda covtobed
An then the usage is as simple as:
covtobed -m 0 -M 1 input.bam > not_covered.bed
References and footnotes
¹ Meyerson (2010) doi:10.1038/nrg2841
² Telatin (2012)
³ The -a
switch impose to print the coverage also when it’s null (0X).
⁴ Birolo G, Telatin A, (2020), Covtobed — a simple and fast tool to extract coverage tracks from BAM files (JOSS)