Obtaining sequences via BLAST in fasta format with geneCoverage2fasta

One of the common burdens for evolutionary biologists dealing with phylogeny reconstruction is supplementing newly sequenced data with sequences already available in GenBank. Here we present a second tool in the pipelines which allows to automate large phylogeny reconstruction. The tool is called geneCoverage2fasta and it automatically retrieve most represented sequences via BLAST from the GeneBANK database in fasta format.

For simplicity, let’s consider following case: you have sequenced several species from Balanophoraceae plant family for 5.8S ribosomal RNA and 18S ribosomal RNA products and now would like to:

  • evaluate which other Balanophoraceae species were sequences and for which genes
  • download these sequences from GenBANK in fasta format and combine downloaded sequences with the de novo sequences
  • align sequences and prepare gene matrix for all obtained genes (combined)
  • reconstruct a phylogeny for entire Balanophoraceae family

Not to blow the length of this tutorial, we will cover here step 2, and steps 1, 3 and 4 steps are discussed in other tutorials (so, subscribe to our newsletter and check the Tutorial page roll).

Download sequences from GenBANK in fasta format via BLAST

From the previous tutorial on geneCoverage tool we understood that the most “sequenced” gene products are 18S ribosomal RNA, 28S ribosomal RNA and 5.8S ribosomal RNA. We have also prepared a small tab-separated text file (click to download) where we specified a reference sequence and gene product names. We need reference sequence IDs because we are going to search for similar sequences in NCBI database using BLAST algorithm.

Why it is not a good idea to search for genes or species by their names?

Because in many cases exactly same gene or gene product will be called differently by different submitters in NCBI database. For example, consider 5.8S ribosomal RNA — you can find names as 5.8S rRNA, 5.8s rRNA, small subunit RNA, etc. So the only reliable way to obtain fasta sequences for same gene or gene product is indeed via BLASTing against a reference sequence.

We also better off specifying a limiting taxon range — because we don’t want to download sequences which are similar enough but belong to a different taxonomic group in which we are not interested. It also helps to reduce the size of downloaded files.

1. Prepare a tab-separate file

In previous tutorial, we have evaluated the most covered genes and prepared a small tab file with the content as:

2. Select geneCoverage2fasta (click to run) tool in My Tools

Login to the InsideDNA application and navigate into Balanophoraceae project in My Tools section. There click on Run tool button for geneCoverage2fasta tool.

3. Specify settings for geneCoverage2fasta (click to run)

In Tool settings for this tool specify:

1) Specify Balanophoraceae as reference taxon range to limit your BLAST search (we want only species from this group)

2) For input tab-delimited file click on Browse and navigate into the folder Balanophoraceae_project and select file Balan_id.txt.

If you don’t have this file, you may have forgotten to upload it in FM. You can always open an extra Browser tab with FM, upload a file there and the click on Refresh button in miniFM — then you’ll see the file.

This way you never lose your current settings in Tool settings if you have forgotten to upload or create something.

3) Specify output directory to store the resulting files and also prefix for output files (there will be several). We will make a new folder called blast2fasta in Balanophoraceae_project folder. To create this new folder click on Plus button and make a new folder.

For prefix, specify Balanophoraceae

4) Select a Queue — for instance, Small queue.

5) Click on Preview button and verify that settings look as expected and click on Submit button.

4. Monitoring task progress.

Just like you did in the previous tutorial (tutorial 1) — monitor the progress of your Task. It will be done in a couple of minutes, but right now it is in Running group.

Once done — it is moved to Completed group and we can verify that nothing went wrong by looking at the error log in the right panel.

5. Obtaining the files

Now let’s move to the File Manager (FM). Click on Files and navigate into Balanophoraceae_project directory. Here you will see all files associated with the task. There are 10 files in total — 5 files for each gene product (18S and 28S-5.8S).

Let’s first preview what is inside of files by clicking on the preview button on the right.

6. Understanding the output

We have 5 files for each gene:

*.txt — a text file listing all species and their reference ID which were placed into fasta file

*_skip.txt — a text file listing all species and their reference ID which were excluded from fasta file (duplicate species mostly)

*.fasta — a fasta file containing all the fasta sequenced from NCBI and ready to be aligned

*.gb — a gb file containing original unprocessed GenBank-formatted dataset

*.xml — an xml file containing original unprocessed XML-formatted dataset

We are mostly interested in fasta file as it is something we can now align and then merge with other gene or gene products in order to obtain a large DNA matrix ready for phylogeny reconstruction.