Aligning multiple sequences with Mafft for reconstruction of large phylogenies
When researchers need to reconstruct a relatively large phylogeny for multiple genes (e.g. sequenced de-novo and obtained from the NCBI database) there are typically two important steps: multiple sequence alignment and merging of the genes into a single large matrix while preserving both species and alignment orders. Here we present a simple way to align sequences for multiple genes and combine these aligned genes into a coherent DNA matrix for phylogeny reconstruction.
For simplicity, let’s consider following case: you have sequenced several species from Balanophoraceae plant family for 5.8S ribosomal RNA and 18S ribosomal RNA products and now would like to:
- evaluate which other Balanophoraceae species were sequences and for which genes
- download these sequences from GenBANK in fasta format and combine downloaded sequences with the de novo sequences
- align sequences and prepare gene matrix for all obtained genes (combined)
- reconstruct a phylogeny for entire Balanophoraceae family
We have obtained fasta files for 18S ribosomal RNA, 28S ribosomal RNA and 5.8S ribosomal RNA with the help of geneCoverage and geneCoverage2fasta scripts. Now, we need to align sequences and prepare gene matrix for all obtained genes (combined).
Align multiple sequences with MAFFT
We are now going to align sequences for both files we obtained at earlier steps (tutorial 1 and tutorial 2). If you have not done these tutorials, download all the fasta sequences here. We will use one of the most popular tools for sequences alignment called MAFFT. MAFFT offers various algorithms tailored for sequences with variable lengths, DNA composition, amount of indels, etc.
1. Add MAFFT to your Balanophoraceae project.
Log in (or sign up if you have not yet) into InsideDNA application and navigate into Balanophoraceae project in My Tools tab. If you followed tutorial 1 and tutorial 2, you have two tools already present there — geneCoverage and geneCoverage2fasta.
If you haven’t done previous tutorials, then you will need to create a new project called Balanophoraceae by clicking on + Add new project and then naming it Balanophoraceae
Now, search in the search field for MAFFT tool. Click on add button and choose Balanophoraceae project in the dropdown list.
MAFFT tool should appear in your project.
2. Initialize a task
Now we are going to initialize MAFFT tool for multiple sequence alignment. First, click on Run tool button. You will have a Tool Settings menu opened for MAFFT tool. Here you need to specify the Task name, tool parameters and queue. Then you will need to preview the task and submit it.
Specify the task name which is easy for you to recognize later on. For instance, Balan_18s_mafft as we first going to align 18S ribosomal RNA. Now we need to select an input file which is fasta file we obtained with geneCoverage2fasta in Tutorial 2. Click on Browse and navigate into Balanophoraceae_project/blast2fasta folder and select Balanophoraceae_rRNA18S.fasta file.
Select a name for the output aligned file — for example, Balanophoraceae_18S_align.fasta. I have also created a separate folder to store MAFFT output — aligned. You can leave the rest of parameters to default.
Select not very powerful queue — we have a relatively small dataset and click on Task preview button to verify submission.
Then click on Submit, and then choose to Stay on the current settings.
Now we are going to submit the second task: for 28S-5.8S ribosomal RNA fasta (our second gene product). Simply modify settings as follow:
For Task name — change to Balan_2858s_mafft
For Input file — select Balanophoraceae_58S_align.fasta
For Output file — Balanophoraceae_2858s_align.fasta
Submit this task and go to Task monitoring.
3. Monitoring task progress.
Just like you did in previous tutorials — monitor the progress of your Tasks. It will be done in a couple of minutes. Once done — they are moved to Completed group and we can verify that nothing went wrong by looking at the error log in the right panel. If you experience some errors — please submit a bug report.
4. Obtaining the files
Now let’s move to the File Manager (FM). Click on Files and navigate into Balanophoraceae_project directory. Here you will see all files associated with the task and newly aligned fasta files. Preview what is inside of the file by clicking on the preview button on the right.
5. Combining aligned files into a single matrix
We are going to use SequenceMatrix for combining sequences of different genes into a single matrix. It is only available as a graphical interface tool, so to use it you will need to install locally.
Download aligned fasta files to your local machine
Install SequenceMatrix. Now, launch SequenceMatrix GUI by clicking on the jar file.
Select Import in the main menu and click on Add sequences
Select Balanophoraceae_18S_align.fasta fasta file. When asked to recode external gaps — say Yes to all.
You will get first dataset loaded:
Now, select Balanophoraceae_2858s_align.fasta file with the same approach. SequenceMatrix is going to merge two fasta files by species name and you will see that total length for certain species has increased.
Now, let’s export the file into format which allows to reconstruct a phylogenetic tree. Specify Raxml format and name file Balanophoraceae_src.phy
Save the file in your working folder. Now you have aligned and complete matrix of sequences for multiple genes. This matrix can be easily used to build a phylogeny.