Coverage analysis from the command line

Concepts and simple tools to evaluate the sequence coverage of a short reads dataset aligned against a reference genome.

Published in

#!/ngs/sh

4 min readMar 13, 2018

Sequence coverage (or sequencing depth) refers to the number of reads that include a specific nucleotide of a reference genome. In the screenshot below a small region of the Human genome is shown in a genome browser, where the alignment of a re-sequencing experiment was loaded. Each gray arrow in the main area of the window represents a single read, while the highlighted area is a coverage plot, summarizing for each nucleotide how many times was sequenced in this experiment.

A screenshot of IGV showing a 0.5 kbp genome region. A BAM file with reads aligned is loaded as a track, and IGV will automatically plot a coverage track on top of the alignments.

If the sequencing was performed using paired ends or mate pairs, it’s possible to physical coverage is the number of times a base is read or spanned by paired ends or mate paired reads¹. The picture below shows the difference.

Sequence coverage (“cov”) and physical coverage (“phy”) shown for two positions of the genome (green bar).

Physical coverage plays a pivotal role in detecting mis-assembled regions: even if the sequence coverage track shows no interruptions, its possible that the physical coverage one will, thus hinting at a possible chimeric sequence:

An example of a **chimeric contig** identified plotting the *physical coverage track of a mate paired library.*

The above example, from my PhD thesis, was obtained in the early days of NGS when mixing technologies (Roche 454 for producing contigs, thanks to the longest reads and SOLiD Mate Paired library for scaffolding) was as common as painful².

Visualizing coverage from a BAM file

As shown in the introduction, IGV is a simple and effective tool to inspect an alignment file. You need the reference genome used to perform the alignment (load it via Genome → Load from file…) and the alignment file in BAM format (sorted and indexed).

The major limitation is that you can view up to 50 kbp with IGV, if you zoom out you’ll stop seeing aligments. This make IGV perfect for detailed inspection of small regions (you’ll see the alignment read by read!) but less useful for broader inspection.

Typical case: which regions have a very low (or no) coverage? Which ones have a very high coverage?

Extracting coverage information with samtools

There is a samtools subprogram, called depth, that calculates the sequence coverage at each position³:

samtools depth -a FILE.bam > FILE.txt

The output is a tabular three columns table: chromosome name, position, coverage. Here ten lines from an experiment performed on E. coli:

Bedtools genome coverage

Bedtools has an even better option to calculate coverage, as it produce a standard BED file.

bedtools genomecov -bga FILE.bam > FILE.coverage.bed

The output is a standard BED file: a tabular file with these columns: chromosome, start, end, coverage. The difference with the previous output is that adjacent bases with the same coverage will be collapsed in the same interval. Again, we can see an example:

If we want to quickly detect the zero coverage regions, we can exploit the fact that the last column indicates the coverage:

bedtools genomecov -bga FILE.bam | grep -w "0$" > FILE.0X.bed

Both samtools and bedtools can extract coverage information from a BAM file. Using text manipulation tools (like grep or awk) we can further filter the output they produce to select regions of interest.

Getting the regions covered within a coverage range

A BED file produced from the coverage profile of a BAM file

A quick tool to produce a coverage profile is covtobed⁴, that takes a sorted BAM file as input and will produce a BED file, optionally specifying to only prints those intervals having a coverage within a range (or only the regions with no coverage).

To install the tool you can use Miniconda:

conda install -y -c bioconda covtobed

An then the usage is as simple as:

covtobed -m 0 -M 1 input.bam > not_covered.bed

References and footnotes

¹ Meyerson (2010) doi:10.1038/nrg2841

² Telatin (2012)

³ The -a switch impose to print the coverage also when it’s null (0X).

⁴ Birolo G, Telatin A, (2020), Covtobed — a simple and fast tool to extract coverage tracks from BAM files (JOSS)