Exome Part 2: Pair-end Sequence Assembly 

With BWA, Samtools, & Picard

The following post is a continuation of a series comparing sequence assembly strategies, the previous post in the series can be found @PetriDishTalk.

Burrows-Wheeler transform has, since the turn of the millennium, become a staple of trusted sequence assembly, while used in many data compression methods it has seen significant utility in NGS algorithms.

Great explanation of Burrows-Wheeler transform from Wikipedia

The Burrows-Wheeler Alignment Tool (BWA) will often show up in publications, and is included as part of the Broad Institute’s “Best Practices.” Thus making it a prudent choice for any alignment strategy. With BWA we must construct an index of our reference genome. Some careful thought should be given as to which reference one uses, as it will determine compatibility with downstream algorithms for variant detection, annotation, and other processes.

$bwa index -a bwtsw reference.fasta

The Genome Reference Consortium is a good place to get references, but they may not be the most compatible with the downstream tools one decides to use, making other choices i.e. UCSC hg19, more applicable.

Building the BWA reference index can take up to several hours

Note: on Mac OS X 10.8.2 the newest BWA releases, bwa-0.6.2, will build an index without error however this index will contain inconsistancies and missing files which will hinder analysis downstream. Consequentially, it is recommended on up-to-date OS X using bwa-0.5.9. Once the index is built, sequence alignment indices of our raw fastq files need to be built, fastq file structure is discussed in the previous post in this series.

$bwa aln reference.fasta short_pair_1.fastq > short_pair_1.sai
$bwa aln reference.fasta short_pair_2.fastq > short_pair_2.sai

We can now combine our reference index, the two alignment indices, and our two pair-end raw fastq files to create a single sequence alignment map.

$bwa sampe reference.fasta short_pair_1.sai short_pair_2.sai
short_pair_1.fastq short_pair_1.fastq > short_pair.sam

This step is the most resource/time intensive and the final output is an unsorted sam file, which must be converted to binary and then sorted using samtools as discussed in the previous post. Importantly, there are some quirks of a BWA alignment that must be dealt with using Picard. Firstly, BWA uses “*” as a flag for unaligned reads when “0″ is preferred/required. Second, mate-pair info must also be adjusted. Third, we may have to add read headers to our file; all of this can be done with Picard.

$java -Xmx[memory]g -jar ValidateSamFile.jar I=[.bam or .sam]
#displays MAPQ flag error, mate-pair error, and header error

$java -Xmx[memory]g -jar FixMateInformation.jar I=[.bam] O=[.bam] VALIDATION_STRINGENCY=LENIENT
#requires significant scratch disk space, recommend utilizing $ -Djava.io.tmpdir = /scratch_directory

$java -Xmx[memory]g -jar ValidateSamFile.jar I=[fixed.bam] IGNORE=INVALID_MAPPING_QUALITY
#validate file accepting MAPQ flag “*”

$java -Xmx[memory]g -jar AddOrReplaceReadGroups.jar I=[.bam] O=[.bam]
LB=anything PL=illumina PU=anything SM=anything

We will have to re-sort the Picard output one last time, and all of this will be worth it, as we will be left with a high quality assembly that meets stringent quality standards and interfaces well with variant discovery & annotation tools present in well-tested packages, i.e. the GATK.

This post is the part of a set providing initial documentation of a systematic comparison of various pipelines with a wide range of algorithms and strategies.