(Generalized) Procrustes analysis with Python/NumPy

Olga Kravchenko
6 min readJun 17, 2018

--

One of the great discoveries that I came across when I got into stats is that people in the field sometimes have a great sense of humor. Granted, I assumed the opposite based solely on the field’s stereotypical representation, so learn from my mistakes, kids! Procrustes analysis is named after a bandit from ancient Greek mythology that would hunt people down, and adjust their sizes to his iron bed by either cutting their limbs off or stretching them. The algorithm that implements it does something similar to a set of shapes. The general idea is to analyze the statistical shape variation, aka to find a degree of similarity between different shapes in a set. If there’s a reference shape present, and all the other shapes are compared to it, we have ourselves what is referred to as classical Procrustes analysis. If what we’re dealing with is just a bunch of shapes floating around, and there’s no info suggesting a good candidate to be chosen for reference shape, the problem is referred to as generalized Procrustes analysis. In general, a term “shape” is used loosely, and the method can be interpolated from explicit shapes to abstract such as sets of numbers that need to be compared. However I came across the method in computer vision course, so I’ll be mostly talking about it in that context.

Let’s say we have ourselves five random triangles scattered across 2d plane, each vertex defined by an (x,y) coordinate. To compare them, they first need to be aligned to each other as closely as possible (hence Procrustes, get it? Get it?). In more practical terms, we need to find an optimal scale, rotation and translation that would make all of the triangles overlap w.r.t. either an explicitly selected reference landmark, or a mean landmark that is calculated in one way or the other. More formally, for classical Procrustes the steps are:

  1. Select a reference shape
  2. For each of the remaining shapes, calculate scale, rotation and translation that would align it to a reference shape as closely as possible
  3. Calculate Procrustes distance by first calculating an SSD for each point w.r.t a reference point, then summing those and taking a square root of the sum

For generalized Procrustes analysis there’s no reference shape to begin with, so what all the shaped are compared against is a mean shape that is chosen arbitrarily and then iteratively improved. The algorithm goes like:

  1. Select a random shape as a mean shape (usually the first shape in the set is taken)
  2. Align all the other shapes to it
  3. Calculate a new mean shape, compare it to the old shape (e.g. by subtracting one from the other)
  4. If the difference in mean shapes is above some margin, align the new mean to the old mean and return to step 2.

The second algorithm usually converges in a couple iterations.

Now to the practical part. For the examples here I will assume that each shape is represented by a 2n x 1 vector with with coordinates (x1,y1, x2, y2,…xn, yn). In principle it is not necessary and any format of data will do, however there are some operations that make use of NumPy functionality for which this format optimal. All the visualizations will be done in OpenCV. Let’s create and display five triangles, just to see what we will be working with.

Important: Procrustes analysis aligns shapes point by point, so it is necessary to preserve the same relative order of vertices for each triangle. Here, for the first four triangles I started with the right angle, went up the longer tangent, and went back down. For for the last triangle I went base left — up — base right.

The code below has some simple helper functions that I will be using throughout the post. The create_test_set function does not offer much flexibility, however should you test the algorithm for yourself, I assume you’ll have a set of data to work with, so I did not think spending effort on customizing was worth it.

This should produce an image that looks something like this:

Yay, colors!

Mind that OpenCV locates the original in the upper left corner and increments values of columns (x) and rows (y) from left to right.

As you can see, what we’re working with is four variants of the same shape, and one additional one that does not really line up with anything.

Exhibit A. Procrustes analysis.

First step is just translating all the points to be centered at the origin, which is done simply by subtracting the mean of the axis from every value of those axis. This will save a lot of pain when finding the rotation angle and the scale.

Now, to scale and rotation. In the manual that I’ve used, they give no interpretation for the formulas that are used to calculate scale factor and rotation angle. They work though, so I call this magic! Here’s how you do it:

Scaling is done simply by dividing all the values in the scale vector by a scale factor calculated above. To rotate a vector by an angle theta it is multiplied by a rotation matrix:

Mind that the rotation is assumed to be performed around the origin. In case you want to rotate around an arbitrary point (x,y), just subtract it from the vector, perform rotation and then add it back.

And it looks like we’re done. Time to pile all of it together and see the beautiful triangular caterpillar emerge from a less beautiful chrysalis that also happens to be triangular!

Now, if this whole thing works, the triangles should be neatly aligned on top of each other. I will be translating all the shapes back to the position of reference landmark for visualization.

Aaand looks a-okay from here. The first 4 shapes are overimposed beautifully. The result on the last one is not that impressive, but this is what this algorithm is for: detecting deviations in shape.

Spot the purple deviant

Now, as I’ve said before, Procrustes analysis is used to analyze the difference in shapes, and Procrustes distance is used as a measure. Here’s a formula to calculate it:

And in Pytnon:

Now that this is done, let’s examine exhibit B, generalized Procrustes analysis. Which is basically the same thing we’ve just done with an extra for loop. The algorithm is outlined (way) above, so imma just leave you with the code:

I was lazy with the NumPy array initialization in the very beginning and used Python lists instead, and that came back to haunt me in this function. The code is a bit bumpy in places because I need to convert the lists to conveniently calculate means. I plotted the result to have fun with multicolored shapes one last time

Well, the colors could have been better. Damn you, (pseudo) random number generator!

So the odd triangle, the blue sheep, if you will, has now influenced (well, not terribly, but still) the shape of the rest of triangles. To see more dramatic changes, you can play around with the locations of vertices of triangles.

Wikipedia tells me that generalized Procrustes analysis is used to compare the results of surveys, interviews, and pretty much anything where there needs to be a “fair comparison” when accounting for some common factors. In fact, computer vision seems to have stolen it from statistics to use in a very narrow application domain.

So there. Hopefully, this was helpful and/or mildly entertaining. Either way, thanks for reading and happy shapes mutilation!

--

--