Understanding ‘Winograd Fast Convolution’

Deepak Mangla
4 min readJul 18, 2019

--

Winograd vs CuDNN. Image Source- Intel AI

Deep learning thrives on speed. Faster training enables the construction of larger and more complex networks. We always want faster networks to detect pedestrians faster in autonomous cars and to enable networks on resource-constrained embedded devices and infinite other reasons. In CNN architectures, most of the time is consumed by Convolution Layers. Today, we will talk about Winograd Algorithm which can reduce the number of floating-point multiplications by a factor of 2.25x. Refer- http://arxiv.org/abs/1509.09308

Before we start talking about Winograd, I expect you understand how convolutions are usually implemented in Deep learning Libraries. They are not simply implemented the way we visualize convolutions in our mind. Normal convolutions are too slow to implement because they can’t make good use of CPU cache and Locality of reference properly. For this, we convert convolutions operations into matrix multiplication. Let’s see how it’s done.

Suppose we have input image f of size (4) and filter g of size (3).

Then, using the im2col technique we convert input image into

Don’t freak out just now. I understand, it may feel like we have increased unnecessary memory consumption here but now we can perform matrix multiplication using BLAS libraries like CuBLAS (GPU) or Intel MKL (CPU) which are extremely optimized for matrix multiplications.

If you want to understand this deeply, check out https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/making_faster.html

But we are humans, always thrive for better. So, how can we further increase the speed? Well, here Winograd comes into the picture. So, instead of doing the dot product, we are calculating the resultant matrix using this formula.

Or let’s generalize a bit.

where:

This way we can find values of m1, m2, m3, m4. Then use them to calculate convolution instead of the dot product of matrices. One observation we can make here is that values of (g0 + g1 + g2) / 2 and (g0-g1 + g2) / 2 need not to be calculated at each convolution operation because filter remains same. We can calculate them once before convolution during training and can be saved precomputed during inference. Now, we will need-

4 ADD and 4 MUL operations to calculate values of m1, m2, m3, m4 and 4 ADD operations in calculating result using computed values of m1, m2, m3, m4. In doing normal dot products, we would be doing 6 MUL operations instead of 4. This is reducing computationally expensive MUL operation by a factor of 1.5x which is very significant.

In the above example, I used F(4,3) i.e. f(4) and g(3) which gave us 2 convolutions. A minimal 1D algorithm F(m, r) is nested with itself to obtain a minimal 2D algorithm, F(m x m, r x r). If we try it with, f(4,4) and g(3,3) which will give us 4 convolutions, we will see that Winograd method is taking 4*4=16 MULs vs 2*2*9=36 MULs in normal convolution which can reduce MULs by a factor of 2.25x which is kind of crazy!!

Conclusion

I hope this article helped you realize how much optimization is being used behind libraries you are using. I think it provides a good explanation of how Winograd works. In the next post, We will further discuss Nesting Minimal Filtering Algorithms where we will discuss how Winograd is implemented for different kernel sizes.

📝 Read this story later in Journal.

👩‍💻 Wake up every Sunday morning to the week’s most noteworthy stories in Tech waiting in your inbox. Read the Noteworthy in Tech newsletter.

--

--