Convolution: Image Filters, CNNs and Examples in Python & Pytorch
Introduction
Two-dimensional (2D) convolution is well known in digital image processing for applying various filters such as blurring the image, enhancing sharpness, assisting in edge detection, etc. With the advent of Convolutional Neural Networks (CNN), convolutions gained a new interest. Convolutions are based on the idea of using a filter, also called a kernel, and iterating through an input image to produce an output image. This story will give a brief explanation of convolution using visual examples and code snippets in Python that show how to implement a simple convolution.
Convolution
Convolution is a linear operator widely used in signal processing that, from two given functions, results in a third that measures the sum of the product of these functions along the domain implied by their superposition. In digital image processing in particular, convolution is a mathematical method for combining two images to produce a third image. Typically, one of the two combined images is not an image itself, but a matrix whose size and values determine the nature of the effect of the convolution process; this matrix is called a filter or kernel. The basic idea is to align the kernel over each pixel of the image and multiply and sum its values over the pixel and its local neighbours. For those more interested in mathematics, the formula for convolution in two dimensions is given by Equation 1 [1]:
where I and H are two-dimensional functions, (u, v) are the coordinates of the pixel in the image and (i, j) are the coordinates of the kernel element. Or Equation 2 more closely related to the focus of this story [2]:
where F is the filter, or kernel, which having an odd number of elements, is represented by a matrix (2N+1) x (2N+1), (x, y) are the coordinates of the pixel in the image and (i, j) kernel element coordinates. The animation in Figure 1 illustrates this process on a 7x7 image. The image on the left is the original image and the one on the right is the result of the applied convolution. Note that the pixel resulting from the addition and multiplication operations is stored in the new image and the original image pixels remain unchanged. I have borrowed some of the following figures and animations from another story of mine on Binarisation of documents using the U-Net.
and step 1. Note that the resulting image (right) is smaller than the original image (left), in this case 5x5 instead of 7x7. Animation by author.
The most common use of convolutions in digital image processing is the implementation of filters for edge detection, blurring and noise reduction. Although the effect of convolutions in intermediate layers in CNNs is well known, it can be made even clearer by showing their effect as digital filters.
Figure 2 shows the convolutions applied to the left image with different kernels, first a blur and then Sobel edge detection. These two convolutions were applied to a greyscale version of the original image. In CNNs, convolution is performed separately for each RGB channel, which is not common in image processing as it leads to an undesired result, like the one on the right in Figure 2. The two matrices in the bottom row of Figure 2 represent the kernels used, and during convolution the target pixel in the image must match the centre of the kernel (red square). Convolutions are part of the implementation of various digital image processing filters such as blurring, edge detection (Sobel and Laplacian), etc. The kernel size directly affects the final result, as shown in the following example (Figure 3) where two kernels with the same element values but different sizes, one 3x3 and the other 5x5, are applied to the same image.
The images in Figure 3 were created using the code in Listing 1 below. It’s not the best way to implement convolutions but it’s useful for illustrating the process from scratch.
################################# Listing 1 ###############################
blur_3x3 = np.array((
[1, 1, 1],
[1, 1, 1],
[1, 1, 1]), dtype="int")
blur_5x5 = np.array((
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1]), dtype="int")
def conv(img_in):
img = cv2.imread(img_in)
img = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) / 255.
height = img.shape[0]
width = img.shape[1]
img_res = np.zeros([height, width, 3], dtype=np.uint8)
img_res.fill(1)
kernel = blur_4x4
height_k = kernel.shape[0]
width_k = kernel.shape[1]
factor = height_k * width_k
init_x = kernel.shape[1] // 2
init_y = kernel.shape[0] // 2
for x in range(init_x, width - init_x):
for y in range(init_y, height - init_y):
resulting_pixel = 0
for xk in range(width_k):
for yk in range(height_k):
resulting_pixel += kernel[yk, xk]*img[y+(yk-init_y),
x+(xk-init_x)]
img_res[y, x] = int((resulting_pixel / factor) * 255)
cv2.imshow('', img_res)
cv2.waitKey(0)
The image in Figure 4 will be used as the input image in the various examples that follow in this story.
Some more common kernels are shown in Figure 5. Depending on the intended use, the values and type (integer, float) may vary. For example, a blur filter might contain only 1/9 without requiring the final division by 9. The images next to each filter are representations commonly seen in visualisations of weights in CNNs. Figure 6 shows examples of the application of these filters.
Padding and stride are two important concepts in convolutional neural networks (CNNs) that affect the size of the output feature maps.
Padding: as can be seen from the examples in Figure 3, each resulting image is smaller than the original by an amount related to the kernel size; the longer the kernel, the farther the center is from the edge of the image. To produce an output the same size as the input, we pad the edges with extra pixels. That way, when sliding, the kernel can allow the original border pixels to be in its center, while extending the extra pixels beyond the border.
Figure 7 shows the original image used as a reference for the edge fill examples, four colored dots have been added in the corners to help demonstrate the difference between each method.
(red, green, blue and gray) have been added at the corners to help demonstrate the difference in
each method. Image by author.
BORDER_CONSTANT (in yellow), (top right) with BORDER_REFLECT, (bottom left) with
BORDER_REPLICATE and (bottom right) with BORDER_WRAP. Image by author.
Figure 8 shows four methods of padding using OpenCV’s copyMakeBorder function. Which of these methods to choose depends on the problem at hand. Remembering that we apply padding so that the resulting image has the same dimensions as the input image, in addition to including the pixels at the edges of the image.
Figure 9 illustrates an animation similar to Figure 1, the difference here is that in this case the input image, the one on the right, has 5x5 dimensions and had a border with zeros added around it, generating a 7x7 image. The image on the right, the result of the applied convolution, now has the same dimensions as the original image, 5x5.
with zeros around the image). Note that the resulting image (right) has the same dimensions as the
original image (5x5) without the border added (left). The green rectangle represents the kernel and the
red the target pixel. Animation by author.
Stride: Stride is the number of pixels each kernel window offsets over the input array. A step of 1 means the kernel will go through all the pixels in the image without skipping any. A step of two means choosing offsets two pixels apart. The bigger the jump, the smaller the resulting image. In Figure 1 and Figure 9 stride is equal to 1. Figure 10 shows the difference in kernel movement with stride=2, resulting in a 3x3 image.
with zeros around the image) and stride=2. Note that the resulting image (right) has smaller dimensions (3x3)than the original image (7x7). The green rectangle represents the kernel and the red the target pixel. Animation by author.
The relationship between the input size, kernel size, padding, and stride can be expressed using the following formula:
output_size = ((input_size — kernel_size + 2 * padding) / stride) + 1
In convolutional neural networks (CNNs), the convolution step is a fundamental operation that applies a set of filters (kernels) to an input image or feature map to extract important features resulting in a feture map. The number of resulting feature maps corresponds to the number of filters, as shown in Figure 11.
1×1 Convolution
One of the disadvantages of deep convolution networks is that the number of feature maps often increases with the depth of the network. This problem can lead to a significant increase in the number of parameters and computational complexity, especially when large filter sizes are used, such as 5x5 and 7x7. The solution is to use a 1×1 filter to reduce the depth or number of feature maps, and was introduced in the Network in Network paper by Lin et al. [3] and used extensively in Google’s Inception architecture [4].
In 1x1 convolution, the input is convolved with filters of size 1x1, usually with zero padding and a stride of 1. It is a 1x1xC operation, where C is the number of channels or feature maps that do not include neighbours of the same channel. This 1x1 convolution can then be applied like a 2D convolution from left to right and top to bottom with a stride of 1, without the need for padding, resulting in a feature map with the same width and height as the input. The 1x1 convolution is also a linear combination of the channels, but you can add non-linearity by applying activation functions to the output.
The animation in Figure 12 shows an example of a 1x1x7 convolution applied to a 3x3x7 activation map, resulting in a 3x3x1 activation map.
See here for a detailed explanation of the 1x1 convolution.
3D Convolution
In all the previous considerations and examples, convolution has been applied to images or matrices with two dimensions, but the same idea works for three-dimensional matrices. In the case of a three-dimensional matrix, as is often the case with convolutions in CNN-based models, the kernel will also be a 3D matrix. In the animation in Figure 8, the kernel is a cube of dimensions 3x3x3 that traverses another cube of dimensions 5x5x5. As with convolution in two dimensions, the result is one pixel at each iteration. Since stride=1 and no padding was applied in this example, the resulting cube has dimensions 3x3x3.
The same padding and stride concepts apply for three-dimensional convolutions.
Visualisation of Filters in Pytorch
Neural networks are usually initialised with random values. In the case of CNNs, these initial values are the filter elements (kernels). As training progresses, these filters take on an “organised” aspect being possible to recognise a certain pattern in each of them. This difference between untrained and trained kernels can be seen in Figure 14.
As the training evolves and the random aspect of the kernels is abandoned, the feature maps also change, as they are the result of convolution by these kernels. It is possible to observe (Figure 15) similarities in the feature maps resulting from the trained kernels with the results of applying the filters shown in Figure 6 (Sobel, emboss, Laplacian, etc.).
The Figure 16 shows the behavior of the feature maps as the convolution layer gets deeper.
Figure 16 keeps the same size for all feature maps for better visualisation, but in reality there is a decrease in dimensions as shown in Figure 17.
Transposed Convolution
The transposed convolutional layer, whose purpose is essentially to increase the dimensions (height and width) of its inputs, is also incorrectly called a deconvolutional layer. A deconvolution layer reverses the operation of a convolution layer and gets back the original input. The transposed convolution reverses the convolution not by values but by dimensions. The transposed convolutions can be considered as standard convolutions with a modified input feature map.
To change the input imageor feature map, sequences of zeros are inserted in the columns and rows between the existing values of the input image. The number of columns/rows to be inserted is indicated by stride-1 (stride and paddind also apply here). In the following example (Figure 18) stride=2 and padding=1.
The code in Listing 2 illustrates a very simple convolutional autoencoder (trained on STL10 dataset), just to illustrate the use of transposed convolutions. After training a thousand epochs, I obtained a very inefficient blur filter.
################################# Listing 2 ###############################
class ConvAutoEncoder(nn.Module):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(
nn.Conv2d(1, 16, (3, 3), stride=(3, 3), padding=(1, 1)),
nn.ReLU(True),
nn.Conv2d(16, 32, (3, 3), stride=(3, 3), padding=(1, 1)),
nn.ReLU(True),
nn.Conv2d(32, 64, (3, 3), stride=(2, 2), padding=(1, 1)),
nn.ReLU(True)
)
self.decoder = nn.Sequential(
nn.ConvTranspose2d(64, 32, (3, 3), stride=(2, 2), padding=(1, 1)),
nn.ReLU(True),
nn.ConvTranspose2d(32, 16, (3, 3), stride=(3, 3), padding=(1, 1)),
nn.ReLU(True),
nn.ConvTranspose2d(16, 1, (3, 3), stride=(3, 3), padding=(1, 1)),
nn.Tanh()
)
def forward(self, x):
x = self.encoder(x)
x = self.decoder(x)
return x
The evolution of a feature map through the layers can be seen in Figure 19.
These two stories (here and here)provide further explanations on transposed convolutions.
Convolution x Correlation
Convolution and correlation are mathematical operations that involve the combination of two functions or signals with similar in concept. There is a fundamental difference between the two tha is the treatment of the kernel.
- In convolution, the filter/kernel is flipped (rotated 180 degrees) before the operation is performed. The flipped filter is then slid over the input signal, and at each position, an element-wise multiplication is performed between the filter and the corresponding portion of the input.
- In correlation, the filter/kernel is not flipped before the operation. It is directly slid over the input signal, and at each position, an element-wise multiplication is performed between the filter and the corresponding portion of the input.
In both cases the resulting products are summed up to obtain a single value. The process is repeated for each position, generating the output signal.
If the filter is symmetrical, the result of correlation and convolution would be the same. In this story, however, we can assume that we are applying a convolution in which the kernel has previously been rotated 180°.
Final considerations
This story briefly discusses the following topics related to convolution:
- Concept of convolution
- Difference between greyscale and colour images
- Elements of convolution: kernel, stride, padding
- Two-dimensional convolution in greyscale and colour images
- Illustration of different kernels when using filters
- 1x1 convolution
- Example of three-dimensional convolution using an animation
- Transposed convolution with a toy example in Pytorch
- Difference between convolution and correlation
Codes with examples are available on Github.
References
[1] Gonzalez, R. C., Woods, R. E., Digital Image Processing. 4th edition. Pearson Prentice Hall, 2017.
[2] Burger, W. e Burge, M. J., Digital Image Processing: An Algorithmic Introduction using Java. Springer, 1st edition, 2008.
[3] Lin, M., Chen, Q. and Yan, S., Sermanet, P., Szegedy, C., Network in network. CoRR, abs/1312.4400, 2013
[4] Szegedy, C., Liu, W., Jia, Y., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., Going Deeper with Convolutions, arXiv:1409.4842v1 [cs.CV], 2014.