UPGMA Method: Designing a Phylogenetic Tree

5 min readJan 28, 2019

A phylogenetic tree (AKA cladogram) is a diagrammatic representation of the evolutionary relatedness between various organisms, or at least our hypothesis regarding such. It is an important tool in understanding life as a whole and how various traits came about in living things. Of course, we can’t travel back in time to witness exactly how species diverged over time. However, we can rely upon an important clue present in all organisms today that reveals much about their past and the evolution of organisms, namely, the genome.

Life as we know it today is very diverse, ranging from bacteria to monkeys to insects, but one commonality between all living things is the presence of the genetic code. The differences between various organisms are reflected in their genomes; two more closely related organisms have a more related genome than two more distantly related organisms. Therefore, we can reason that if the genomes of species A and B are very similar, they have a recent common ancestor.

Notice that until now, we have been using very relativistic terms like “very similar” or “distantly related”. However, such ambiguity and lack of specificity is not enough to arrive at a conclusion. Moreover, there is no standard of comparison in such terms that enables us to reliably compare such relatedness to create a cladogram. To resolve this issue, there needs to be a quantifiable degree of relatedness — some benchmark that would enable us to determine exactly how similar or different two organisms are.

In assessing the relatedness between two organisms, there are a variety of factors to consider, such as morphology and comparative anatomy or embryology. However, to consistently arrive at an easily quantifiable value, the sequences of common genes between organisms are frequently compared. Let’s take a brief look at some of the theory behind this comparison. Between various organisms, there are genes of common origin (i.e. they evolved from the same common ancestor), which usually serve the same purpose; examples include the gene for hemoglobin and TBX6. Over time, if two populations do not interbreed, speciation occurs, and differences in shared genes begin to accumulate. (Then, using sequence alignment, the exact number of differences between any two organisms can be determined.) From this premise, we can infer that the more differences two organisms have in their sequences, the less evolutionary related they are, and vice versa. alignment is used to arrive at differences.

At this point, our problem becomes fairly simple. Given the differences between a set of organisms, we must create a phylogenetic tree. There are a variety of ways of making a cladogram starting from this data, one of which is the Unweighted Pair Group Method with Arithmetic Mean (UPGMA). This approach is simple, and can be boiled down to three simple steps: 1) Find the two organisms with least differences. 2) Group them together as one cluster and recalculate differences. 3) Repeat steps 1–2 until the tree is complete. Let’s go into a bit more detail.

1. Find two clusters with the least differences

As exemplified in the table above, you’ll need to start with a table like the one above that showcases the differences between your starting set of organisms. Notice that there are some blank cells; these values are omitted because they are duplicate of another value in the table. For example, the cell with row “Penguin” column “Horse” has the same value as the cell with row “Horse” column “Penguin”. In the table above, the smallest difference between two clusters is 1, which occurs between Horse and Donkey. (The term cluster refers to a given row or column in the table. At first, each cluster only contains one organism, but this will change once we start clustering multiple organisms together.)

2. Group them together as one cluster and recalculate differences

First, we substitute the Horse cluster and Donkey cluster with one large cluster containing both Horse and Donkey. To calculate the new values created by the intersection of the Horse and Donkey cluster with other clusters, we compute the average of the values between the intersection of Horse with that cluster and Donkey with that cluster. For instance, to get the value at the intersection of (Horse and Donkey) and Penguin, we sum the values at the intersection of Horse and Penguin and Donkey and Penguin, and take the average. Similarly, do the same for all other intersections, filling in the new table. All other intersections not including the compounded cluster, such as Chicken and Penguin will remain the same.

3. Repeat steps 1–2 until the tree is complete

Notice that after completing these steps, we have reduced the total number of clusters by one, as Horse and Donkey have been combined. These two have made a smaller cluster, which will be added to a larger cluster. Eventually, after the last two clusters are combined, one large cluster will result, which will be our tree. Take a look at the tree below to see the final result.

If you’re using the UPGMA method, you might want to consider using the following program to create your tree: https://github.com/SRavit1/UPGMA. All you have to do is input the names of the organisms you’re working with and the differences between each organism, and the program will output the tree, complete with distances on each branch. In fact, the example cladogram shown above was created using this program.

As mentioned earlier, there are a whole host of alternatives to UPGMA that you can use when making a phylogenetic tree, including Weighted Pair Group Method using Arithmetic Mean (WPGMA), Neighbor Joining Method (NJ), Weighted Neighbor Joining (Weighbor), Fitch-Margoliash (FM), Minimum Evolution (ME), Maximum Parsimony (MP), and Maximum Likelihood (ML). (You can learn about all these in the first link in Further reading.) In any case, hopefully this article has enlightened you about UPGMA and cladograms in general.

UPGMA Method: Designing a Phylogenetic Tree

1. Find two clusters with the least differences

2. Group them together as one cluster and recalculate differences

3. Repeat steps 1–2 until the tree is complete

Written by Ravit Sharma