Affine Transformation — why 3D matrix for a 2D transformation

Junfeng Zhang
5 min readJan 15, 2023

--

Assumption: you know the basics of Linear Transformation by matrix multiplication. If not, this 3Blue1Brown’s video is a great intro.

Have you wondered why Affine Transformation typically use a 3x3 matrix to transform a 2D image? It looks confusing at first, but it is actually a brilliant idea!

If you search around for articles on the topic, you can see that 3x3 matrices are used to perform transformations (scaling, translation, rotating, shearing) on images, which are 2D!

Credit: Cmglee at https://en.wikipedia.org/wiki/Affine_transformation

We know that the locations of each pixel in a 2D image can be represented with a 2D vector [x,y], and the image can be linearly transformed with a 2x2 matrix. So questions naturally come. Where do the 3x3 matrices come from? 3x3 matrices are not even compatible with 2D vectors! Why are we using a 3x3 matrix while it seems that a 2x2 can do the same?

In this article, I will answer the questions below:

  • Why Affine Transformation uses a 3x3 matrix to transform a 2D image? and
  • more confusing yet, why in OpenCV’s Affine Transformation function cv2.warpAffine(), the shape of the input transformation matrix is 2 x 3?

In linear transformation, a 2x2 matrix is used to do scaling, shearing, and rotating on a 2D vector [x,y], which is exactly what Affine Transformation does also. You see what’s missing? translation! Multiplying a 2D vector with a 2x2 matrix cannot achieve translation.

In linear transformation: scaling, shearing and rotating, the basis vectors all stay on the same origin (0,0) before and after the transformation. That means the point (0,0) never changes location. To translate the image to a different location, you need to add a vector after the matrix transformation. Therefore, the general expression for Affine Transformation is q= Ap + b, which is

[p₁, p₂] can be understood as the original location of one pixel of an image. [q₁, q₂] is the new location after the transformation.

When vector b [b₁, b₂] is [0,0], there is no translation. In this case, a 2 x 2 matrix A is indeed sufficient. When b is not [0,0], the image moves to a different location (aka translation). b₁ and b₂ here determines the new location. Specifically, b₁ determines how much the location moves along the x axis and b₂ determines how much it moves along the y axis. This looks a bit cumbersome, doesn’t it? It is a 2-step calculation: matrix multiplication + vector addition.

KEY QUESTION: what if we can perform the entire transformation with just one matrix multiplication?

That’s exactly what a 3x3 matrix can do, combining the multiplication with a 2x2 matrix and adding of a 2D vector into one multiplication with a 3x3 matrix!

Here is how it works. The original points on the 2D plane are padded with 1 in the third axis, becoming (p₁, p₂, 1). This makes them points on a 3D plane with value of p₃ always 1. So the 2D image is still a 2D image, it’s just that now it is augmented into a 3D space. To visualize it:

When we shear the cube along the z axis, the image does not change shape or size but it moves to a different location from the perspectives of x and y axes! That’s exactly how the 3x3 matrix transformation helps!

What happens to the transformation matrix then? It is expanded into this

The key points here:

  • The last row of the transformation matrix A [0,0,1] makes sure that the dots after transformation are still on the same z=1 plane.
  • a₁₃ and a₂₃ determine how much the image is shifted along the x and y axes accordingly. They are actually the same as b₁ and b₂ in the 2D vector above.

It is worth noting the two 0s in position a₃₁ and a₃₂. They stay as 0 in Affine transformation. If either of them is none zero, the image will go out of the z=1 plane and when projected back to to the z=0 plane, it is no longer in the same shape. In that case, it is not an Affine Transformation anymore. That is why in OpenCV, the input transformation matrix for the Affine Transformation function cv2.warpAffine() is a 2 x 3 matrix, which only has 2 rows, as the third row [0, 0, 1] never changes!

OK, a recap:

  1. Why Affine Transformation typically use a 3x3 matrix to transform a 2D image?
    For saving computation steps and elegance (in my opinion), it combines a two-step calculation into one matrix multiplication.
  2. Why in OpenCV’s Affine Transformation function cv2.warpAffine(), the input transformation matrix for is 2 x 3 (hopefully not confusing anymore)?
    The third row of the transformation matrix always stays as [0,0,1] so no need to specify it in the function input.

Let me know if the explanation makes sense.

Bonus: the answer to the same question from ChatGPT:

--

--