How do computers perceive images?

5 min readJun 19, 2023

As humans, we perceive vision through our eyes that capture photons and our brain build information of what we see everyday. However, since the computers can only work with numbers, they first need to convert image into an array of numbers. Image is separated to pixels which are very small grids with a certain color. If we put enough pixels together, we can create any image. Higher the number of pixels more detail the image gets. But there are many colors, how does the computer differentiate colors from each other. Simple, as we learned at kindergarden we can build any color by mixing three of the main colors which are red, green and blue. Therefore each pixel contains a tuple of three numbers RGB.

R represents red, g represents green and b represents blue. The number is always between 0 and 255. So for example what is the number for the color red? As we can figure out, it is (255, 0, 0).

We have talked about the pixels, now we can move on to the whole image. Images have three dimensions: height, width and color channels. Height and width are the number of pixels for y and x axis in order. When we call resolution of an image, that is actually the numbers of pixels in two directions. Resolution directly affects the quality of the image.

I think we have understood how the computers perceive vision, our next question is: how does the computer extract semantics from image? This is where deep learning comes in. In the first paragraph we saw the computer mimicking human’s sensory system and this paragraph will be about human cognitive system.

Human vision involves our sensory and cognitive systems[2]

The simplest way to extract semantic is to flatten the image and put it through dense neural networks. This can work for small images but as the resolution increased it does not work anymore. Also needed trainable weights of dense layers go up exponentially for larger images. And we have not even started talking about the biggest problem yet. When we flatten the image, we lose vital information which is the position of the pixels.

Let’s say we want to classify the image and there is a cat at the right of the image. Even after we train the dense layer that the image contains a cat, when an image with a cat on far left is given to our trained dense layer, the dense layer will probably get it wrong since there is no information of the cat position. And the dense layers have no shared weight attribute. These problems kept deep learning behind humans in computer vision.

These all have changed after the CNNs attacked. Not actually accurate:D. CNNs have been introduced in 1989 with LeNet architecture by Yann Lecun. But its popularity came after AlexNet at 2012. Why is that?

There are many reasons for this. CPUs have always been insufficient for training deep learning and still is. On the other hand, GPUs that have been improved for gaming have started to be used for training neural networks after the authors of AlexNet. This sped up the training immensely, also this made training deep learning networks publicly available. Other reason for the AlexNet to change the game on computer vision is the usage of relu activation and regularization techniques. Even though ReLU activation does not saturate, it converges much faster than other activation functions. And regularization handles the saturation problem. Lastly, because of the speed up of the GPUs, the authors were able to train much deeper architectures.

How do convolutional neural networks work?

Next question is: “How do convolutional neural networks work?” The elements of a CNN are:

filter
stride
padding
pooling

If we understand what these do, then it will be easy to explain CNNs.

1. Filters

Filters are matrices that perfrom convolution process to the image. Convolution is basically a summation of elementwise multiplication of two same sized matrices. Filter size is a hyperparameter and mostly given 1x1, 2x2, 3x3 shapes. Let’s say we have 5x5 image and 3x3 filter, first convolution process would be:

100 x 1 + 20 x 1 + 20 x 1 + 10 x 2 + 1 x 60 = 220

2. Stride

This convolution process is performed on the whole image in a sliding window way from left to right like a typewriter. Stride is the amount of sliding after each convolution. The default stride is 1.

3. Padding

If we want to have the same shape for output, we can use padding. Padding is a frame around image with all zero values. Padding is also useful when we have filter size and stride that do not fit to the input image such as;

4. Pooling

Pooling is the final step in a convolutional neural network (CNN) that helps reduce resolution and filter out certain details. It operates by using a sliding matrix, similar to convolution, but with the purpose of decreasing resolution.

Pooling can be performed in two ways: max pooling or average pooling. In max pooling, the maximum value within each matrix is selected, while in average pooling, the average value is taken. Both methods contribute to decreasing the resolution and discarding certain details in the data.

Conclusion

By combining all of these tools, computer vision surpassed human understanding of the vision. These features let the computer mimick human sensory and cognitive systems.

References

[1] https://www.color-meanings.com/what-color-red-green-make-mixed/

[2] Lakshmanan V. et.al. “Practical Machine Learning for Computer Vision”, 2021

[3] https://www.analyticsvidhya.com/blog/2022/01/convolutional-neural-network-an-overview/

[4] Geron A., “Hands-On Machine Learning with Scikit-Learn and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems”, 2023