How to Use BLAST in Your Research

How to use bio-python based blast search to support research work

--

NCBI Blast Page

BLAST stands for Basic Local Alignment Search Tool which is one of the most popular search tools. Few example use cases are as follows.

  • Search against a bacterial reference database
  • Searching sequences in the PLASDB (Curated Plasmid Database)

Why do we need blast when there are many more fast aligners?. The main reasons are the sensitivity of BLAST and its ability to have sequences indexed for re-use. Furthermore, BLAST is mostly used on bacterial genomes due to its k-mer based indexing approach.

You can find a brief introduction and installation guide here;

Latest binaries are available here.

Installing BioPython

You can check the docs here. However, you can readily use one of the following commands to install biopython.

Reading FASTA data

One of the baby steps in analysing biological sequences is reading the FASTA formatted sequences. For this, we can use biopython SeqIO API.

The above code will iterate each of the FASTA record in the file. The print commands will output sequence id, description text, length of sequence record and first 50 characters of the sequence respectively. Here is a sample output for the first iteration of the FASTA file.

Using BLAST

Image by mohamed Hassan from Pixabay

Before using BLAST we must prepare a database containing FASTA sequences to search. For this let’s consider a scenario where you’ll need to do a search against the Bacterial Genomes.

Assume I follow steps here, and have downloaded all the bacterial sequences from NCBI. For this, you should do some post-processing. For this article, let me skip that since your requirement might be searching against some other set of sequences. Now I have a file called bacterial.fasta and want to create the database. You can use the following command to create the nucleotide database.

This will create the following database files;

These number at the end could continue to count as the number of sequences increase. This is a mechanism used by BLAST to chunk up the index for better RAM usage.

Once you do this, you should be able to use blast command as you’d do over the command line. Now let’s try to use this database in our python program for analytics.

Using BLAST in BioPython

http://biopython.org/DIST/docs/tutorial/Tutorial.html

As the first step, you should import the following dependencies from biopython.

Now let’s build a function that’d search a given sequence in our BLAST database.

Note that I have passed the fasta_path, but you can simply edit this to pass the sequence itself. However, I have limited the number of alignments to 1 and set the percentage identity to be 95 or 95%. These parameters can be changed to fit the requirement.

How this might help you in research

I have been using this to classify contig fragments into discrete bacterial bins using a bacterial BLAST database. So the first thing I’d say is think makes it a lot easy to obtain ground truth.

As a practice, I have bacterial genomes indexed to a BLAST database. Furthermore, PLASDB has compiled a BLAST database out of the shelf for you to search in a curated plasmid database.

The list could extend depending on your line of research. Note that I have completely omitted the protein and gene search topic.

I believe this could be a helpful article for budding bioinformatics. Happy reading. Cheers! :)

--

--