Common Alignment Tools

A small summary of common alignment tools used by bioinformaticians and how to get started

Image by Arek Socha from Pixabay

Alignment tools are a major contributor to the domain of bioinformatics. From assemblers to database search and similarity calculations sometime in the line of work you may have come across some kind of an assembler. In this article, I will be talking about the common alignment tools that I have been using.

BLAST

BLAST stands for Basic Local Alignment Search Tool. As the term describes this is mostly used for search purposes. The idea of local alignment is useful when we want to study the containment of a sequence in another, thus the use case of search.

Applications

BLAST is mostly used as a database search tool due to its fast nature with sensitivity parameters. The most common service using the BLAST is NCBI search database (https://blast.ncbi.nlm.nih.gov/Blast.cgi). However, in research work BLAST comes handy when we want to perform taxonomic annotations or to label sequences with database sequence annotations such as plasmid nature, coding and non-coding regions searched against known strains, etc. In the context of a database search, BLAST is extremely fast. However, in scenarios where precise base-wise complete alignments are needed, it is better to switch on to a more sensitive aligner like BWA-MEM or Minimap2.

Algorithm Overview

  1. Removal of low complexity regions (tandem repeats and N bases for DNA) of the query sequence
  2. Obtain a k-mer list for the query sequence (k=11), list possible matching words and score them using BLOSUM62 matrix. This is done for all k-mers
  3. Obtain the high scoring k-mers from step 2, decided by a specified threshold
  4. Scan the database for these high scoring k-mers and obtain high scoring segment pairs (HSPs)
  5. Extend the search from the exact match and outwards until the accumulated score starts to drop

The algorithm is simple to explain and fast for a large search space. Database index usually contains k-mers of k-11 for nucleotide sequences. K=3 is used for protein sequences.

Installation

Compiled binaries or source files can be downloaded from here. Compilation can be done by;

cd c++
./configure
cd ReleaseMT/build
make all_r

More information can be found here. You can refer here as to how you might build your own database using sequence files.

BWA

BWA stands for Borrows Wheeler Transform. This transforms in a manner that makes it easy to perform compression on data. This is the key idea behind the popular alignment too BWA-MEM. BWA-MEM uses a prefix index to perform the indexing and alignment. You could read deeper in Heng Li’s GitHub.

Applications

BWA-MEM is commonly used for aligning short reads to the reference genomes. This is a key step in the reference-based assembly of the human genome.

Installation

git clone https://github.com/lh3/bwa.git
cd bwa; make

Usage of the BWA-MEM has several steps. In the first step, you are required to index the reference genome. The tool is designed for short-reads thus you could use both paired-end reads or sing ended reads. Following are the commands from the GitHub page for your reference.

./bwa index ref.fa
./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz

Minimap2

Minimap2 is my favourite alignment tool, which is indeed very fast and versatile. It is robust with much longer sequences with noise. Few of the common use cases are as follows.

Applications

  1. Align long noisy reads to the references genomes
  2. Align reads to the contigs to compute base-wise coverage
  3. Aligning all reads against themselves as a preliminary step for assembly and read correction
  4. Aligning reads to the assembly graph

Algorithm Overview

The algorithm is based on the idea of minimizes. A minimizer is a minimum (lexicographically) k-mer in a window of w k-mers. This is one of the main reason the algorithm is fast at a bit of a compromise on the sensitivity.

Minimap2 obtains (k, w) minimizers for all the references and query sequences. The matching minimizers that are below a certain frequency in the set of references those are called seeds and used for alignment.

In my experience, the alignment could be not sensitive enough in certain scenarios where I tried to align reads-vs-reads. That is reasonable and mentioned on the GitHub page. It is always wise to use another alignment for such scenarios.

Installation

git clone https://github.com/lh3/minimap2
cd minimap2 && make

For extended use cases, you can refer the original GitHub page.

I hope this will help someone who is a bit new to the field of bioinformatics. Thanks for reading.

I will introduce a few multiple sequence alignment and visualization tools in a future article.

Cheers!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store