Computer Vision-Basic Building blocks and step by step beginner guide for model building.

Keerthana Durai
Jun 12 · 7 min read

Computer vision is growing in popularity! It enables the computers to see, analyze and process objects in images /videos, in the same way of human does. Object detection are immensely used in robotics, text extraction, self-driving cars, theft detection, satellite analytics etc. In this Article, we’ll focus on the image basics , building blocks of Convolutional neural network and step by step approach of model building.

Image Basics and Manipulations

Pixels are the smallest element of digital image. It’s values ranges from 0 to 255, where 0 represents black and 1 indicate white.

Figure 1: Image in pixel format

Images have different channels like RGB(Red-Green-Blue), BGR(Blue-Green-Red) and Grey scale. Image Filtering is used to transfer images with operations like sharpening, smoothing and edge enhancement. Some of the image transformation techniques are rotate, mirror, identify, reflection, scaling.

In above piece of code, scikit-image read and shown in each RGB channel image.

Feature Extraction from Images

This can be solved either with open CV techniques like SIFT & HOG methods or with Convolution. Here we’ll focus on Convolution.

Convolution is the process of extracting the layers from an image with the use of Kernel. This process works similar to Human Eye Retina, it means retina see 2D image layer by layer. Convolution provide good accuracy but it is computationally high.

Figure 2: Convolution Operation


kernel is a matrix that moves over the input data by the Stride value and performs the dot product with the sub-region of input data. It’s of different types like Identity, Edge detection and Sharpen, which are shown in Figure 3.

Convolution with different kernel are used for image transformations but Edge detection kernel is predominantly used to extract high-level features like edges from the image.

Figure 3: Convolved image after applying different types of kernels.


Instead of applying multiple kernels, pooling applied to model. It will reduce the size of input data by half.

Pooling doesn’t change the depth of input but affect length and width.

Benefits of pooling layer:

Types of pooling layers:

Figure 4: Pooling


Padding is the process of adding zeros to the input data symmetrically as shown in figure 5. For instance, if you’re training an autoencoder, the output size of the autoencoder resultant image should have same size as input. Here, padding come in to play. Padding add “extra space” in input data to avoids the loss of spatial dimensions.

Figure 5: Padding

Fully Connected Layer

Figure 6: Fully connected Layer

What is Fully connected layer?

It is also known as “Feed Forward neural Network”. Fully Connected layers are those layers where all the inputs from one layer are connected to every activation unit of the next layer. It has three layers, input layer represents the dimensions of input vector, hidden layer takes set of weighted inputs and process output with activation functions and output layer represents the output of neural network.

Output from the final Pooling or Convolutional Layer will be the input to fully connected layer . These outputs are flatten, that is to unroll all its values into a vector and then fed into the fully connected layer.

Activation Function

Activation function determine whether a neuron should be activated or not, by calculating the weighted sum . It also applies non-linearity to avoid the output layer to be a linear function, which is a polynomial of one degree.

Types of Activation functions

1. Sigmoid Function:

The sigmoid function used in the output layer of the deep learning models and is used for predicting probability-based outputs. The sigmoid function is represented as:

2. Tanh Function:

Tanh ranges between -1 to 1,it is a smoother, zero-centered function. The tanh function is represented as:

3. Rectified Linear Unit (ReLU) Function:

It is one of the popular activation function. Compared to other functions like the sigmoid and tanh, this function performs better. ReLU helps to solve the vanishing gradients problem by taking negative values when z<0, which allows to have an average output closer to 0. The ReLU function performs a threshold operation on each input element where all values less than zero are set to zero. The ReLU function represented as:

4. SoftMax Function:

SoftMax output ranges between 0 and 1 and with the sum of the probabilities equal to 1.This function will be used as final layer in multi-class classification problems. The SoftMax function is represented as follows:

Convolution Neutral Network Architecture

Figure 6: Convolution Neutral Network Architecture

How to compute output shape of CNN ?

Output shape of CNN is calculated by the formulae [(W−K+2P)/S]+1. Where W= input Size, K =Kernel size, P=Padding, S=Stride.

Figure 6 Summary,

Problem solving with computer vision

We start with MNIST Dataset. The MNIST dataset is one of the most common datasets used for image classification and it contains 60,000 training images and 10,000 testing images collected from American Census Bureau employees and American high school students. TensorFlow and Keras allow us to import and download the MNIST dataset directly.

Below code shows how to import data from MNIST dataset and how to shuffle, split the data as train and test set.

How to install TensorFlow?

!pip install tensorflow

Figure 7

Next, visualize data from both train and test set.

Figure 8

Before feeding the data to model, the data needs to be preprocessed as shown in Figure 9.

Figure 9

One-hot encoding converts class vector to binary class matrix. As we are dealing with multi-class classification dataset, so it need to converts the label to categorical as done in step 8.Now the data is ready to feed in to the model.

To start with model building, import Sequential model from keras.model and layers from keras.layers as shown in step 9.

Figure 10
Figure 11

Compile the model with categorical_crossentropy as loss function, metrics defined as accuracy and Adam is used as a optimizer.

Fit the model with x, y, batch size as 32, epochs as 5 (i.e. model train for 5 cycles) and validation split as 0.2 (i.e. 80:100)

Figure 12

Train and test accuracy are above 95% . The predicted output shown in the step 13.

Figure 13


I would love to hear some feedback on my first work. Thank you for your time!


Everything connected with Tech & Code