Deep Learning for Computer Vision

6 min readJun 16, 2024

What is Computer Vision

Computer vision is a field of artificial intelligence (AI) and computer science that focuses on enabling computers to interpret and make decisions based on visual data from the world.

It involves the development of algorithms and techniques to allow machines to understand and analyze images and videos, mimicking the human visual system. Here are some key aspects of computer vision:

Image Classification: This is the task of “assigning a label or class to an image from a predefined set of categories”. For example, given an image, the system could classify it as a ‘cat’, ‘dog’, ‘car’, etc. This is typically done using convolutional neural networks (CNNs) which are highly effective for such tasks.

2. Object Detection:This involves “identifying and locating objects within an image”. The “goal is to draw bounding boxes around objects and classify them”. For example, detecting multiple objects like people, cars, and traffic signs in a single image. Techniques like YOLO (You Only Look Once) and R-CNN (Region-based Convolutional Neural Networks) are commonly used.

3. Semantic Segmentation: This process involves “classifying each pixel in an image into a category”. It’s more granular than object detection, as it provides detailed information about the shapes and boundaries of objects. For example, segmenting an image of a street scene where each pixel is classified as road, sidewalk, building, etc.

4. Action Recognition: This is the task of “identifying and understanding the actions being performed in a video”. For example, recognizing actions like walking, running, jumping, or waving. It involves analyzing both spatial and temporal information from a sequence of frames.

5. Clustering: In the context of computer vision, clustering involves “grouping a set of images into clusters based on their visual features”. Each cluster should contain images that are visually similar to each other. This can be useful for tasks like organizing large image datasets, image retrieval, and unsupervised learning.

6. 3D Reconstruction: This involves “constructing a three-dimensional model of an object or a scene from a set of two-dimensional images”. This can be done using multiple views of the same scene (stereo vision) or by moving a single camera around the scene (structure from motion). This technique is used in various applications such as virtual reality, augmented reality, and robotics.

What is Convolution?

Convolution is a fundamental technique in image processing.
It is a mathematical operation used to apply a filter or kernel to an image, producing a filtered version of the original image.
This operation is called convolution because it involves “sliding” the filter over the image, element-wise multiplying the values of the filter and the image at each location, and summing the results to create a new value for the output image.
This process is repeated for every pixel in the image, effectively enhancing features, detecting edges, or applying other transformations based on the filter used.

A convolution is a powerful tool for image processing because it allows us to extract information from images in a highly efficient manner. For example, we can use convolution to:

Detect edges in an image
Sharpen blurry images
Smooth out noisy images

In general, convolution is used to apply a wide range of transformations to images, including blurring, sharpening, edge detection, and many others.

Why we are learning Convolution here?

Before the advent of Convolutional Neural Networks (CNNs), image classification problems required manually crafting image features before feeding them into an image classifier, such as a Support Vector Machine (SVM). However, this process is now automated with CNNs. Even the classification part is also now being done by Neural Networks.

Before having CNNs

After having CNNs

History of CNN architectures for Computer Vision

LeNet (1998)

AlexNet (2012) A most populer example, first successful CNN architecture in Computer Vision domain

VGGNet (2014)

Anatomy of a CNN (Example LeNet)

How LeNet Identify an Image using different layers of feature extraction using Convolutions

Activation functions used in Convolutions: https://en.wikipedia.org/wiki/Activation_function

Here, the loss function means the Cost Function we are using to calculate the Error.

Stpes of Training a CNN

If the loss is high, we can’t expect a good accuracy.
Therefore we have to gradually train the Neural Network to reduce the loss (Optimization).
Training a CNN (or Training a Artificial Neural Network) meaning minimizing the loss function over multiple iterations of training.

Gradient Descent

One way we can minimize the Loss function (Cost function)is using Gradient Descent.

How long training a CNN can take?

Might take even days, even with a very power full computer.
Therefore we are not feeding all the training data at once, if we are having a very larger dataset.
What we are doing is randomy select some images from the dataset and train the CNN ( we are calling it training using a small “Mini-batch” at a time).