Biopython to Retrieving Fasta Sequences From Ncbi Databases

The Biopython Project is an open-source collection of non-commercial Python tools for computational biology and bioinformatics. It contains a set of modules for different biological tasks, which include: sequence annotations, parsing bioinformatics file formats (FASTA, GenBank, Clustalw etc.), retrieving data from biological databases (NCBI, ExPASy), running bioinformatics tools from Python scripts by wrapper functions (Clustal, BLAST, SAMtools etc.), workingwith phylogenetics trees, 3D structures etc. In this tutorial we will show some basic functions of Biopython and how to run them on the InsideDNA platform. It won’t be hard; however, you need to have a basic knowledge of Python syntax in order to use it. For more detailed description of Biopython project tools follow the link: http://biopython.org/DIST/docs/tutorial/Tutorial.html. In this tutorial we will fetch biological (protein) sequences from NCBI databases, parse them and convert into different file formats.

Firstly, we need to open the terminal window by clicking “SHOW TERMINAL” button.

Then type the following commands “mkdir biopython” and “cd biopythonto” create and move into working directory.

Now we need to download amino acid sequences of tumor protein p53 from mammalian species (you can use any other combinations). Web interface for Entrez search system can be used, however, since we want to have a practice in Biopython it is better to write the python script. Type “nano main.py” (or any another text editor) to write our python script (nano manual: https://www.nano-editor.org/dist/v2.2/nano.html). Copy the following lines (which we are going to discuss now) one by one, save and exit the text editor (Cntrl + X).

from Bio import Entrez

This line imports Bio.Entrez module from Biopython (Bio). We will use several functions from this module to access the data from NCBI and get some information about available databases. All the following python commands you can run locally in the python command line (python) or you can create a new script with all these commands written one after another and run it locally (python main.py) or submit the script as a task:

isub -c 2 -r 7.5 -t [task_id] -e 'python /data/[your_username]/biopython/main.py'

In this tutorial we will write the output to standard output, so you can read it by the following command:

less ../tasks_out/[task_id]-[…].stdout and less ../tasks_err/[task_id]-[…].stderr for standard error file;

However you can write the output into another file (remember to specify the full path to the output file).

Entrez.email = "your@email" databases = Entrez.einfo()

print(Entrez.read(databases))

With Entrez.einfo() function we can download the list of NCBI databases in XML format. To make this list easier to read we use Entrez.read() function. Also you need to specify your e-mail.

For each of these NCBI databases we can get additional information about the following features:

db = Entrez.einfo(db="gene") descr = Entrez.read(db) print(descr["DbInfo"].keys()) print(descr["DbInfo"]["Count"])

print(descr["DbInfo"]["LastUpdate"])

Now after choosing relevant database let’s search for p53 proteins in the NCBI Protein database:

handle = Entrez.esearch(db="protein", term="tp53[gene] NOT partial AND mammals[ORGN]", retmax='100')

We are looking for “tp53” term in the gene names among mammals. To make our search results more relevant we also look for the terms without “PARTIAL” word in any field. The syntax is similar to the one which is used in Entrez search engine. “retmax” specifies the maximal number of findings. We are interested in the list of identifiers of our findings, so we create a string which contains all identifiers separated by comma and give it as an input for Entrez.efetch function. This function is responsible for downloading all corresponding entries.

handle2 = Entrez.efetch(db="protein", id=idlist, rettype="gb", retmode="text")

We download this data in GenBank bioinformatics format (rettype = “gb”) and output it as a text file (retmode=”text”) (not XML).

Now we can parse entries one by one (in the loop) with read() and write the information you are interested in, for example:

from Bio import SeqIO for seq_record in SeqIO.parse(handle2, "gb"): print(seq_record.id + ‘_’ + seq_record.annotations["source"])

print(len(seq_record)+ ‘_’ + seq_record.seq)

In this piece of code we import Bio.SeqIO module and use its functions for sequence objects annotation. SeqIO objects have an id (seq_record.id), annotation of the object with some metadata (seq_record.annotations) and sequence itself (seq_record.seq). We can call for GenBank fields (“source”, “date”, “topology” etc.) by using them as a keys for seq_record.annotations dictionary and print the length of amino acid sequence by len(seq_record) command.

We can save our SeqIO object in any reasonable format (FASTA in this example)

from Bio import SeqIO fafile = open('/data/[user_name]/biopython/tp53.fa', 'a') species = [] for seq_record in SeqIO.parse(handle2, "gb"): if len(seq_record.seq) < 500 and len(seq_record.seq) > 300: if seq_record.annotations["source"] and not seq_record.annotations["source"] in species: seq_record.id = seq_record.id + '_' + seq_record.annotations["source"] SeqIO.write(seq_record, fafile, "fasta")

species.append(seq_record.annotations["source"])

In this piece of code we iterate through all entries and filter out those which were already observed and those with unexpected length of sequence (probably, more than 1 protein in the entrie). We change the sequence identifier (add the specie name) to make it more convenient for further analysis and write this data with new identifiers in the FASTA file tp53.fa.

If you want to write annotation object in one file format or another (FASTA or GenBank) it is quite easy to switch them by the following commands:

SeqIO.write(SeqIO.parse("genbank_file.gbk", "genbank"), out_fasta, "fasta") or SeqIO.write(SeqIO.parse("fasta_file.fa", "fasta"), out_genbank, "genbank")

Finally, we can use Bio.ExPASy module to search and access data in ExPASy database. The commands are different than ones of Bio.Entrez module, however, the syntax is similar. In the nutshell one can use the following code to download data from SwissProt database:

from Bio import ExPASy from Bio import SwissProt records = [] for accession in accession_list: handle = ExPASy.get_sprot_raw(accession_list) record = SwissProt.read(handle)

records.append(record)


Originally published at insidedna.me.