General speaking about convolutional neural network

Sajjad Shumaly
Analytics Vidhya
Published in
7 min readSep 17, 2021

Last month, I gave a presentation to a group of physicists on how CNN works. I presented this as simply as I could, and received positive feedback, so I decided to share it with a larger community.

  1. General terms:

1.1. Briefly about pixels

Look at the picture below. An image with a very common size of 800 by 800, which has a total of 640,000 pixels. This simple image consists of more than half a million pixels.

Fig.1. Common image dimension

If we look more closely at a part of the image (Fig.2), it is clear that each object is made up of a number of pixels working together in different colors. And each pixel itself is a combination of three colors red, green, and blue with different intensities from 0 to 255.

Fig.2. An image and its pixels

In each pixel, by combining different intensities of each of the colors red, green, and blue from 0 to 255, various colors can be produced. In Figure 3, shape number 1 shows a state in which only red light exists with maximum intensity and the other colors are off, in which case the entire pixel will be completely red. Similarly, Figures 2, 3, and 4 show how to create purple, black and white.

Fig.3. Produce different colors by pixels

So, when we are talking about a colorful image; behind the scene, we have three different numerical matrices that they are representing the different intensities of red, green, and blue colors (Fig.4).

Fig.4. Three channels of colors

But when we are talking about black and white images, we have just one matrix and there is no need to further dimensions. Because in black and white images intensity of the red, green, and blue colors are equal to each other (Fig.5). To continue we will talk just about black and white images to make it easier.

Fig.5. Black and white image matrix

Now, it is clear how pixels are working and we know always there is a matrix of numbers behind every image (Fig.6) that represent the light intensity of the monitor screen pixels.

Fig.6. An image and its intensity matrix

1.2. Briefly about kernels

What is the kernel? Wikipedia says: “In image processing, a kernel, convolution matrix, or mask is a small matrix used for blurring, sharpening, embossing, edge detection, and more.” Generally, it is possible to do some manipulation on the picture using kernels. In figure 7, we are trying to generate an image (converted image) from an original image using the kernel. In this way, all numbers of the kernel are multiplying on the first part of the original image numbers and the result will be the first number of converted image intensity. We will do the same thing for figures 8, 9, and 10 but by moving the kernel and covering all parts of the original image.

Fig.7. The kernel functionality _ step1
Fig.8. The kernel functionality _ step2
Fig.9. The kernel functionality _ step3
Fig.10. The kernel functionality _ step4

You can see an example of kernel movement over an image in figure 11.

Fig.11. Example of movement of the kernel over the original image

The 12 figure is a converted image of the figure 11 image. On the right side of fig.12, there is a vertical kernel. In another word, all numbers of the first column are equal to 1, the second column is equal to 0 and the last one is -1. Using this kernel, it is possible to see vertical lines accurately and remove horizontal lines. That’s why you are not able to see the horizontal line inside of the red box.

Fig.12. The functionality of the vertical kernel

Everything is the same for figure 13, but this kernel is horizontal and it can be a good option to study accurately horizontal lines of the original image. Again, you are not able to see the vertical line inside of the red box.

Fig.13. The functionality of the horizontal kernel

Using different kernels it is possible to generate various types of images to study different aspects of an image. In the three below images, you can see more examples of different kernels. Also, if you are interested to make your kernels and try them on your images you can use this website: https://setosa.io/ev/image-kernels/

Fig.14. Example of the kernel abilities
Fig.15. Example of the kernel abilities
Fig.16. Example of the kernel abilities

2. Convolutional Neural Network

In traditional models, image processing researchers, mostly electrical engineers, studied different kernel combinations together to find the best kernel combinations based on their goals. It is clear that this was very time-consuming and energy-consuming, and certainly not all effective kernel components could be identified. One of the most important capabilities of CNN is that it can automatically check the different combinations (thousands or millions) of kernels based on the defined goal and use the best ones. Figure 17 shows different parts of a CNN model. The first part is the feature extractor. In this part, the model is able to get an image and search to find the best combination of kernels and extract the best features based on the final goal. Flatten layer is the output of the feature extraction part. It is possible to consider flatten layer as a box of numbers that represent key features of the input image. So, the feature extraction part of the CNN is able to convert the input image to a box of meaningful numbers. To continue, a simple classifier like the neural network is able to analyze and decide based on the numbers.

Fig.17. Convolutional neural network parts

The first CNN model called AlexNet was introduced in 2012 (Fig.18), which succeeded in winning the ImageNet Large Scale Vision Recognition Challenge (ILSVRC). Using this model, the error rate improved by more than 10 percent in comparison with the previous year. And more importantly, this model became the basis for further leaps in image processing. The second leap happened in 2015 when Microsoft introduced its model called ResNet. This model was accurate than human eyes and because of some innovations in its architecture, it was able to use a large number of layers (152 layers) in comparison with the previous model(GoogLeNet with 22 layers).

VGG16 that introduced in 2014 is an accurate model with a simple architecture and is easy to implement. These features make this model a good option for researches and scientific works in other areas.

Fig.17. ILSVRC error rate based on year

The use of deep models has always had two major problems, the need for large datasets and the difficulty in interpreting how they work. Based on figure 18 it is possible to reach very accurate results using deep models, but it needs a large amount of data. Otherwise, use machine learning algorithms or classical statistics can be more accurate and reasonable. But, Using transfer learning, the need for large datasets in CNN models has been well answered. This ability makes CNN very powerful and useful even in comparison with other types of deep models.

Fig.18. Different methods based on the amount of data

Suppose, as in Figure 19, a CNN model is developed to identify the faces of a large number of people. In this case, the first layers of the model are responsible for identifying colors, lines, and generally simple entities. The middle layers are responsible for identifying more complex entities such as facial components such as the eyes, ears, and lips. The final layers in this example are responsible for recognizing a combination of facial components and generally considering all facial components simultaneously.

When you want to do another project, for example in the field of car model identification, you can use the first layers of the developed model to identify the face. Because these layers are responsible for identifying simple entities such as lines and colors, and incidentally teaching them is a big part of the problem. So the first layers of existing models, sometimes trained on billions of images, can be used in other new problems and can dramatically increase the performance of your new model.

Fig.19. Different layers of the CNN model

Finally, you can watch a video in which I try to play Super Mario using the CNN model that I developed using the transfer learning method. This video is a practical and simple example of a two-class classification problem. So, there are just two classes; Raising eyebrows = the first jump then run continuously, Frown = stop and don’t move

https://youtu.be/_h15_uBuL70

--

--