Computer Vision Basics

Kashish oberoi
Computer Vision
Published in
4 min readJun 12, 2020

Object Detection, localizing objects in image frames, has been the focus of the Deep Learning, and Computer vision fraternity for decades now. It plays a crucial role in the autonomous industry and the algorithms here can be used for Image classification, which indeed can be used almost everywhere.

Basics

Starting with the basics of Computer Vision, first, we need to know about convolution filters and Image transformations.
1. Vertical Edge Detection Filter: As the name suggests, detecting vertical edges in an image. This can be an important filter when you want to detect something which has clear vertical lines, like boxes, sudoku grid, tables, and graphs.
Example: [[1,0,-1],[1,0,-1],[1,0,-1]], Sobel Filter and Scharr Filter.

2. Padding: Padding is basically adding pixels to the borders. It is a really helpful tool when it comes to changing the image size to fit the input size of the layers/nodes. It helps in avoiding the Shrinking Problem and information loss.

3. Stridden Convolution: Stride is the number of shifts on the input image. It is the operation performed to extract information from a block of pixels and converting it into a smaller block for more processing.

A very useful and animated view at the convolution operation in terms of Deep Learning.

https://github.com/vdumoulin/conv_arithmetic

Convolution Neural Network

Wikipedia Definition: In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery. The name “convolutional neural network” indicates that the network employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation. Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.

See the example below, We have an image as input and we extract information from the input images, from all channels and increase the depth, which indeed is information extraction and the weights of convolution are learned during the training phase. Ultimately, we end up with flattened 1-D data which is fed into an artificial neural network, and then the object is classified.

Let us look into the terminologies:

ARTIFICIAL NEURAL NETWORK: An artificial neural network is an interconnected group of nodes, inspired by a simplification of neurons in a brain. Here, each circular node represents an artificial neuron and an arrow represents a connection from the output of one artificial neuron to the input of another.

RELU: In the context of Artificial Neural Networks, the rectifier is an activation function defined as the positive part of its argument:

1where x is the input to a neuron.POOLING: When we make the strides on the input image, and a small unit of the image is converted to even smaller dimensions as mentioned above. In the Pooling phase, we form the new matrix with reduced dimensions using a unit of the input. Examples, Max Pooling, Min Pooling, Average Pooling, etc.

SOFTMAX: The softmax function, also known as softargmax or normalized exponential function, is a function that takes as input a vector z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. This is usually used in classification problems, in which we get the probability of class type.

FULLY CONNECTED Layers: Fully connected layers are an essential component of Convolutional Neural Networks (CNNs), which have been proven very successful in recognizing and classifying images for computer vision. All nodes are connected to the next layer nodes.

A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way

Why Convolution?

1. Parameter Sharing: A feature detector (such as a vertical edge detector) that’s useful in one part of the image is probably useful in another part of the image.

2. Sparsity of connections: In each layer, each output value depends on a small number of inputs. This helps to reduce the complexity of the model.

--

--