Multiple Sequence Alignment using Clustal

6 min readMar 18, 2024

In molecular biology, computational biology, and bioinformatics, multiple sequence alignment (MSA) is a powerful tool. It allows researchers to compare DNA, RNA, and protein sequences, revealing their similarities and variations. The explosion of sequence data from next-generation sequencing technologies has made MSA even more crucial. However, aligning these massive datasets poses a challenge. To overcome this, existing MSA algorithms need to be parallelized, meaning the data would be distributed across multiple computers in clusters or server farms. This parallel approach is key to efficiently unlock the full potential of MSA for large-scale sequence analysis.

Multiple sequence alignment offers a big advantage over comparing just two sequences at a time. It gives us more insight by revealing patterns and motifs that are hidden when we only look at pairs of sequences. This method helps us find important amino acid residues in proteins and understand their functions better. It’s also key for doing things like figuring out how different species are related (phylogenetic analysis), predicting the shapes of proteins, and even designing better PCR primers for experiments.

Using dynamic programming for aligning multiple sequences, like in pairwise alignment, is theoretically feasible. However, as the number of sequences increases, so does the computing time and memory needed, growing exponentially. Due to this, full dynamic programming isn’t practical for datasets containing more than ten sequences. In practice, heuristic approaches are most often used.

In multiple sequence alignment, the sequences are organized to maximize the matching of the residues according to a scoring function. This function, based on the sum of pairs (SP), calculates the total score of all pairs of sequences in an alignment. Each column is scored by adding up the scores for all possible matches, mismatches, and gap costs. Most alignment algorithms aim to maximize these SP scores.

Given a multiple alignment of three sequences, the sum of scores is calculated as the sum of the similarity scores of every pair of sequences at each position. The scoring is based on the BLOSUM62 matrix. The total score for the alignment is 5, which means that the alignment is 25 = 32 times more likely to occur among homologous sequences than by random chance.

Jalview is a piece of bioinformatics software that is used to look at and edit multiple sequence alignments. By allowing visualization and in-depth analysis, Jalview helps unlock insights into functional similarities and differences across vast amounts of biological data.

There are two main approaches in multiple sequence alignment: exhaustive and heuristic. Exhaustive methods consider all possible alignments to find the optimal solution, ensuring thoroughness but often requiring significant computational resources, making them impractical for large datasets. On the other hand, heuristic approaches employ strategies to quickly find reasonably good alignments, sacrificing optimality for efficiency. These methods employ techniques like progressive alignment or iterative refinement to rapidly generate alignments that are close to optimal, making them suitable for large-scale sequence data analysis.

Clustal is a multiple alignment program accessible both online and as a standalone tool. The stand-alone program, which runs on UNIX and Macintosh, has two variants, ClustalW and ClustalX. The W version provides a simple text-based interface and the X version provides a more user-friendly graphical interface.

Clustal Omega (https://www.ebi.ac.uk/jdispatcher/msa/clustalo) is a newer alignment program utilizing seeded guide trees and HMM profile-profile techniques to align three or more sequences.

One key feature of this program is its adaptable use of substitution matrices. Clustal employs various scoring matrices based on sequence similarity levels. The matrix choice is determined by the evolutionary distances calculated from the guide tree. For closely related sequences, Clustal utilizes BLOSUM62 or PAM120 matrices, while for more divergent ones, it switches to BLOSUM45 or PAM250 matrices.
Clustal also offers adjustable gap penalties, allowing for more insertions and deletions in non-conserved regions while minimizing them in conserved areas. For example, penalties for gaps near hydrophobic residues are higher compared to those near hydrophilic or glycine residues commonly found in loop regions. Additionally, gaps occurring close to each other may face increased penalties compared to those in isolated positions.
The program adjusts the alignment process by assigning weights to sequences, particularly for those with less than 25% identity, to enhance alignment reliability. This involves reducing the influence of redundant or closely related sequences, preventing them from dominating the alignment. Each sequence’s weight is determined by its position on the guide tree, normalized by the number of shared basal branches. These weights are then used to adjust alignment scores, reducing the impact of common characters and increasing the importance of rare ones.
This method has limitations when comparing sequences of different lengths due to its global alignment approach. It restricts long gaps because of affine gap penalties, affecting accuracy. Additionally, the order of sequence addition influences the final alignment, and once gaps are introduced, they are fixed, potentially leading to errors that cannot be corrected. These limitations can result in suboptimal alignments, especially with divergent sequences. New algorithms have been developed to address these issues, aiming to improve upon the limitations of the Clustal program.
T-Coffee does progressive sequence alignments but with a twist. It conducts both global and local pairwise alignments for all pairs involved in a query. For the global pairwise alignment, it uses the Clustal program. The local pairwise alignment is generated by the Lalign program, from which the top ten scored alignments are selected. The collection of local and global sequence alignments are pooled to form a library. The consistency of the alignments is evaluated. For every pair of residues in a pair of sequences, a consistency score is calculated for both global and local alignments. Each pairwise alignment is further aligned with a possible third sequence. The result is used to refine the original pairwise alignment based on a consistency criterion in a process known as library extension. Based on the refined pairwise alignments, a distance matrix is built to derive a guide tree, which is then used to direct a full multiple alignment using the progressive approach.

Clustal plays a pivotal role in various bioinformatics applications:

Sequence Alignment: Clustal is primarily used for aligning multiple sequences of DNA, RNA, or protein to identify similarities and differences between them. It is crucial for understanding evolutionary relationships, identifying functional domains, and predicting protein structures.
Phylogenetic Analysis: By aligning sequences, Clustal helps in constructing phylogenetic trees, which helps to draw the evolutionary relationships between organisms or sequences. These are vital for studying the evolutionary history and taxonomy of species, as well as for predicting gene functions.
Structural Bioinformatics: Clustal alignments are often used as inputs for predicting protein structures and analyzing their functions. Alignments provide valuable insights into conserved regions, functional motifs, and structurally important residues, aiding in protein structure prediction and drug design.
Comparative Genomics: Clustal is utilized in comparative genomics studies to compare sequences from different species or strains. By aligning genomes or transcriptomes, researchers can identify conserved regions, gene families, and genetic variations, shedding light on evolutionary processes and functional differences between organisms.
Functional Annotation: Clustal alignments are instrumental in annotating sequences with known functions to newly sequenced or poorly characterized sequences. By aligning unknown sequences with annotated ones, researchers can infer their functions based on conserved domains, motifs, or sequence similarities, facilitating gene annotation and functional characterization.

Future research in multiple sequence alignment using Clustal focuses to enhance speed and accuracy for large-scale datasets. This includes developing parallel computing techniques, refining algorithms for better handling of diverse sequences, and integrating machine learning for automated alignment optimization with the help of cloud computing technologies. These advancements will further empower bioinformatics studies and genomic analysis.

Clustal remains an important tool in bioinformatics, offering optimal approaches for multiple sequence alignment on protein and nucleotide sequences. Its user-friendly interface and robust algorithms make it accessible to researchers. Through Clustal, scientists can study the coding within DNA, RNA, and protein sequences, finding evolutionary relationships, predicting protein structures, and annotating gene functions. As technology advances and data increases, the importance of efficient alignment algorithms like Clustal only continues to grow. With its continued development and integration into bioinformatics workflows.

Multiple Sequence Alignment using Clustal

Written by RoopamSeal