Different implementations of the ubiquitous convolution

Analyze Im2Col, Winograd, Strassen, FFT based 2D convolution implementations & understand advancements made by DeepMind’s AlphaTensor in this domain.

8 min readNov 23, 2022

Throughout the Deep Learning (DL) field, nn.Conv2d is being used left-right-center for building efficient convolution layers in PyTorch without worrying much about how they are implemented under the hood. In this article, we will specifically gain some insights into different convolution implementations like a naive nested for-loop, Im2Col, Winograd, Strassen and FFT algorithms and infer their pros & cons based on latencies incurred on a N1 CPU and a T4 GPU. We will also relate Strassen’s algorithm to DeepMind’s recent computing breakthrough with AlphaTensor.

Naive convolution: Nested for-loop

Let’s start with a naive implementation for 2D convolution. We’ll use a simple 3x3 kernel with a 5x5 input matrix (with 1 channel):

In the naive implementation, we simply traverse or slide through the input matrix using a “kernel-window”. For each window, we do simple element-wise multiplication with the kernel and sum up all the values. Finally, before returning the result we add the bias term to each element of the output. Here is an excellent video by 3Blue1Brown to intuitively understand convolution in general.

def conv2d(input, kernel, bias):
    kernel_shape = kernel.shape[0]
    output_shape = input.shape[0] - kernel_shape + 1
    result = np.zeros((output_shape, output_shape))
  
    for row in range(input.shape[0] - 1):
        for col in range(input.shape[1] - 1):
            window = input[row: row + kernel_shape, col: col + kernel_shape]
            result[row, col] = np.sum(np.multiply(kernel,window))
    return result + bias

For a N×N matrix, the 2 for-loops in our implementation are responsible for O(N²) execution time. If we had a huge network like ResNet50 with hundreds of convolutions and thousands of large input matrices, naive convolution would be an absolutely terrible idea.

However, PyTorch’s own implementation scales very well with the input matrix size and kudos to it for all the mighty achievements made possible. Let’s look at other efficient variants of performing the same operation.

Im2Col/GEMM

Im2Col, which stands for Image to Column, vectorizes the entire operation of multiplying each window with the kernel that we saw before, to speed it up. We vectorize by taking each window, flattening it out and stacking them as columns in a matrix. Now, if we flatten out the kernel into a row vector and do matrix multiplication between the two, we should get the exact same result after reshaping the output. The below image with it’s color coding beautifully depicts the above.

Reshaping result after matrix multiplication to obtain required output

Let’s try it out:

def im2col(input, kernel):
    kernel_shape = kernel.shape[0]
    rows = []
    for row in range(input.shape[0] - 1):
        for col in range(input.shape[1] - 1):
            window = x[row: row + kernel_shape, col: col + kernel_shape]
            rows.append(window.flatten())
            
    return np.transpose(np.array(rows))

Vectorizing helps in casting this problem as a matrix multiplication but it’s the optimized matrix algebra libraries in modern CPUs, GPUs like BLAS (Basic Linear Algebra Subroutines) that allow code to take advantage of the hardware acceleration. Can we do any better?

The answer is …

Yes. Using knowledge about the hardware implementation of Arithmetic, Logic units (like how addition, multiplication are implemented), we can gain further speed-up using Winograd’s smart convolution technique!

Winograd’s Algorithm

As it’s always true, there is no free lunch here. Winograd’s convolution approach makes a tradeoff between the number of multiplications and additions one needs to perform to obtain the convolution result. Since multiplication in hardware takes more time, power than addition, it would help immensely if we can reduce the #multiplications , at the cost of some extra additions. The below picture illustrates the technique:

Winograd Convolution saving us some multiplications at the cost of additions!

Instead of directly multiplying two matrices which would require 6 MULs, 4ADDs in this example, we can use the Winograd’s where we find values of m1, m2, m3, m4, using 4 MULs, 4ADDs, then use them to calculate the convolution output using 2 extra ADDs. This is reducing computationally expensive MUL operation by a factor of 1.5x, which is very significant when the input, kernel sizes are big (as they are often)!

One observation we can make here is that values of (g0+g1+g2)/2 and (g0-g1+g2)/2 need not be calculated at each convolution operation because the kernel remains same; hence, we can pre-compute and save them, more so for inference.

Strassen’s Algorithm

The algorithm is similarly named after the math legend Strassen who stunned the mathematical community that better algorithms for matrix multiplication do exist, which debunked a common misconception held for centuries. He found that a 2×2 matrix multiplication can be performed with 7 MULs instead of 8 MULs through the usual method. Let’s see how it’s done through the below explanation.

a. Tensor T representing the multiplication of two 2×2 matrices. Tensor entries equal to 1 are depicted in purple, and 0 entries are semi-transparent. The tensor specifies which entries from the input matrices to read, and where to write the result. For example, as c1 = a1×b1 + a2×b3, tensor entries located at (a1, b1, c1) and (a2, b3, c1) are set to 1. b. Strassen’s algorithm for multiplying 2×2 matrices using 7 multiplications. c. Strassen’s algorithm in tensor factor representation. The stacked factors U, V and W (green, purple and yellow, respectively) provide a rank-7 decomposition of T. The correspondence between arithmetic operations ( b ) and factors ( c ) is shown by using the aforementioned colors.

Let a, b be the two 2×2 input matrices which we want to multiply. Consider expressions m1, m2,…,m7 as shown in the picture. Each of those expressions have just one product and their sums are used to evaluate c, the output matrix. It will be a good exercise for the reader to verify that the algorithm is true by evaluating till the final c1, c2, c3, c4 expressions by hand. Ofcourse again, the gain in the multiplication comes at the cost of some extra additions, which is fine as additions are way more effective to compute.

The same algorithm can be formally represented as a product of 3 rank-1 tensors as shown in the right-half of the image. Consider U, V, W tensors as shown in the picture. Columns of U, V represent the additions that needs to be performed amongst elements of a, b respectively and then those sums are correspondingly multiplied. For example, look at the first column of U, V: U1=[1,0,0,1] which means (a1+a4) and V1=[1,0,0,1] which means (b1+b4) and take their product to get m1=(a1+a4)*(b1+b4). Rows of tensor W represents the sum among m’s to be performed to obtain c. We introduce this notation as DeepMind’s AlphaTensor is based on using Reinforcement Learning (RL) to further optimize this approach of convolution by casting the problem of finding the U, V, M tensors as a game.

Fast Fourier Transform (FFT)

Based on the convolution theorem (a simple but intricate mathematical theorem), convolution in time/space domain is multiplication in frequency domain. Convolution is often an expensive, contrived operation and finding it’s equivalence with multiplication (simple) in another (frequency) domain is a marvel. FFT is an easy-to-implement, reversible (through IFFT/Inverse-FFT) and quick way to convert space domain signals to frequency domain. Here is a breath-taking video by Veritasium on what is FFT and how it transformed the world.

Example of using Fourier transforms for convolution

As shown in the above example, first convert input image, kernel into frequency domain through FFT(F), then perform their point-wise multiplication and do IFFT on the output to obtain the convolution result. This seemingly long, 3-step process is however very quick and is often used by lot of frameworks to perform fast, efficient convolution under the hood. The following pseduo-code further expounds the above explanation.

from torch.fft import rfftn, irfftn

def fft(input, kernel):
    input_ft = rfftn(input, dim=tuple(range(2, input.ndim)))
    kernel_ft = rfftn(kernel, dim=tuple(range(2, kernel.ndim)))
    output_ft = complex_matmul(input_ft, kernel_ft)
    output = irfftn(output_ft, dim=tuple(range(2, output_ft.ndim)))
    return output[crop_slices].contiguous()

Experiments & Analysis

Mean-latency across several hundreds of runs across the various methods of convolution discussed for different sizes of kernels, inputs was performed using Colab on a N1 CPU, and a T4 GPU to obtain the following inferences:

Naive convolution:

GPU performances are way worse than that of CPU as a single GPU core is not as capable as a CPU (in terms of both memory, compute features)
Increasing size results in worsening performance compared to the rest

Im2Col:

Very close to FFT’s performance in many cases
Performs best when kernel is small as transformation costs are minimal
Changing from unequal, non-unity strides to equal strides changes Im2Col’s performance by 100X

Winograd’s:

Performs best for small kernel but degrades for larger inputs as the g-functions become more and more intractable

Strassen’s:

The algorithm existed (until AlphaTensor) only for 2×2 inputs, so couldn’t use it for experiments with bigger inputs!

FFT:

FFT seems to be the best among all despite performing multiple conversions (FFT/IFFT) due to complexity reduction from O(N²K²) to O(N²log(N)) (where N is input-size, K is kernel size)
FFT on GPU seems to be slightly faster than on CPU as FFT/IFFT is inherently parallelizable

DeepMind’s Conv2D Baby: AlphaTensor

DeepMind’s recent works have been the talk of the town. Relying on the same inherent concept of playing games at super-human level using RL, they have come a long way with recent progress in Sciences (biology, physics) and Computing itself, with AlphaTensor: DeepMind’s AI For Discovering Efficient Matrix Multiplication Algorithms (which further helps convolution).

As discussed under Strassen’s algorithm, the game AlphaTensor plays is that of finding columns of U, V, W successively. Since we know the final matrix-mul output anyways, C = U×V×W, the penalty given to the RL agent is C-(U1×V1×W1) which it needs to make it to zero,. This ensures that the algorithm it comes up with will always yield a verifiable true result.

AlphaTensor RL agent’s training pipeline

The sooner it achieves a zero penalty, the lesser number of multiplications the algorithm would need to perform as the columns of U, V, W represent the total #multiplications as we saw under Straseen’s algorithm. The above figure depicts AlphaTensor’s training pipeline, which usues lot of smart techniques like Change of Basis, Synthetic Demonstration, Memory Buffer, etc. which will be topics for future posts:) Refer DeepMind’s paper, blog to learn more about AlphaTensor.

One example highlighted in the blog was about the multiplication of 4x5 and 5x5 matrices, where the standard algorithm takes 100 multiplications, this number was reduced to 80 with human ingenuity, AlphaTensor has found algorithms that do the same operation using just 76 multiplications!

AlphaTensor shows how machine learning can be used to go beyond human intuition when it comes to discovering things like mathematical algorithms. Further research into the application of machine learning in these areas could lead to computation efficiency or pure mathematics breakthroughs.

Conclusion

In this article, we discussed few major types of convolution implementations, starting from naive and exploring Im2Col, Winograd’s, Strassen’s and FFT. We also analyzed when each one is most efficient and when they underperform through latency metrics on a N1 CPU and a T4 GPU. We further saw that cutting-edge research is still happening to speed-up matrix-multiplication (or convolution in an equivalent domain), which we often take for granted. It seems like the very nn.conv2D engine we have built, is finding new ways to evolve itself. Really very exciting time to be alive and be a part of this field!