How to Use BLAST in Your Research
How to use bio-python based blast search to support research work
BLAST stands for Basic Local Alignment Search Tool which is one of the most popular search tools. Few example use cases are as follows.
- Search against a bacterial reference database
- Searching sequences in the PLASDB (Curated Plasmid Database)
Why do we need blast when there are many more fast aligners?. The main reasons are the sensitivity of BLAST and its ability to have sequences indexed for re-use. Furthermore, BLAST is mostly used on bacterial genomes due to its k-mer based indexing approach.
You can find a brief introduction and installation guide here;
Latest binaries are available here.
Installing BioPython
You can check the docs here. However, you can readily use one of the following commands to install biopython.
pip install biopython#ORconda install -c conda-forge biopython
Reading FASTA data
One of the baby steps in analysing biological sequences is reading the FASTA formatted sequences. For this, we can use biopython SeqIO API.
from Bio import SeqIOfor record in SeqIO.parse("PATH", "fasta"):
print(record.id)
print(record.description)
print(len(record.seq))
print(record.seq[:50])
The above code will iterate each of the FASTA record in the file. The print commands will output sequence id, description text, length of sequence record and first 50 characters of the sequence respectively. Here is a sample output for the first iteration of the FASTA file.
>>>
NC_009937.1
NC_009937.1 Azorhizobium caulinodans ORS 571, complete sequence
5369772
CATCCAGAGCGGCGGGGTCTTCTGCGGGAAGCCGTCCAGCACGTTCAGAA
Using BLAST
Before using BLAST we must prepare a database containing FASTA sequences to search. For this let’s consider a scenario where you’ll need to do a search against the Bacterial Genomes.
Assume I follow steps here, and have downloaded all the bacterial sequences from NCBI. For this, you should do some post-processing. For this article, let me skip that since your requirement might be searching against some other set of sequences. Now I have a file called bacterial.fasta
and want to create the database. You can use the following command to create the nucleotide database.
makeblastdb -in bacteria.fasta -title "Bacteria" -dbtype nucl
This will create the following database files;
Chroms.fasta.nal
Chroms.fasta.00.nhr
Chroms.fasta.00.nin
Chroms.fasta.00.nsq
Chroms.fasta.01.nhr
Chroms.fasta.01.nin
Chroms.fasta.01.nsq
These number at the end could continue to count as the number of sequences increase. This is a mechanism used by BLAST to chunk up the index for better RAM usage.
Once you do this, you should be able to use blast command as you’d do over the command line. Now let’s try to use this database in our python program for analytics.
Using BLAST in BioPython
As the first step, you should import the following dependencies from biopython.
from Bio.Blast.Applications import NcbiblastnCommandline
from Bio.Blast import NCBIXML, Record
from glob import glob
from Bio import SeqIO
from io import StringIO
import os
Now let’s build a function that’d search a given sequence in our BLAST database.
def search(query_path, database_path):
query_string = open(query_path).read() per_id = 95 blastn_cline = NcbiblastnCommandline(
db=database_path,
evalue=0.001,
outfmt=5,
perc_identity=per_id,
max_target_seqs=1,
num_threads=8) out, err = blastn_cline(stdin=query_string)
io_result = StringIO(out)
blast_records = list(NCBIXML.parse(io_result)) for blast_record in blast_records:
if len(blast_record.alignments) == 0:
continue
else:
alignment = blast_record.alignments[0]
title = alignment.title
length = alignment.length
query_id = blast_record.query
query_length = blast_record.query_letters print(title)
print(length)
print(query_id)
print(query_length)
Note that I have passed the fasta_path, but you can simply edit this to pass the sequence itself. However, I have limited the number of alignments to 1 and set the percentage identity to be 95 or 95%. These parameters can be changed to fit the requirement.
How this might help you in research
I have been using this to classify contig fragments into discrete bacterial bins using a bacterial BLAST database. So the first thing I’d say is think makes it a lot easy to obtain ground truth.
As a practice, I have bacterial genomes indexed to a BLAST database. Furthermore, PLASDB has compiled a BLAST database out of the shelf for you to search in a curated plasmid database.
The list could extend depending on your line of research. Note that I have completely omitted the protein and gene search topic.
I believe this could be a helpful article for budding bioinformatics. Happy reading. Cheers! :)