Basic of Deep Neural Networks In Vision — Part 1

Shubham Agnihotri
Analytics Vidhya
Published in
6 min readJul 18, 2020

--

This blog is an introduction to the basics of Deep Neural Networks. There is a buzz of AI, have you wondered how they work. I will be covering everything in depth in AI in Vision, so lets get started with basics, this blog will give you an intuition about kernels, channels and training of DNN……

Source

So Lets get started….

The basic components of Deep Neural Network(DNN) are Kernels and Channels. Before jumping onto it I will explain what is the process in AI and Vision.

The Process (In brief)

In Vision, Input data are images, Images are fed to the neural network. The initial few kernels acts as filters to identify the basic building blocks of the images i.e. Edges and Gradients.

Edges and Gradients

The later few kernels filters textures and patterns,

Textures and Patterns

Then the kernels filter parts of Object,

Parts of Object

Then they filter Object

Objects

The images you see above are the feature maps i.e. channels which are produced as an output of the filter operation. In the above image at the bottom right corner there is a dog feature map, therefore if any dog in present in the image: this channel will confirm it. So lets understand what are channels and kernels.

Channels

Each Channel is a container of specific information, Containers are objects for holding things. lets take an example of Television:

TV Channels

Here each channel that you see in an image broadcasts shows of specific information. You can never expect news in Cartoon Network or expect Animes/Cartoons in CNN. Here each channel is a container that stores hours of videos and information.

Therefore in DNN, these represents containers which contains features maps. So from the images in ‘The Process’, we see the channels of edges and gradients, textures and patterns, parts of object and object.

Kernel

3x3 Kernels

Kernel is the guy which extracts the features for us. Kernel can be called as filter or features extractor as well. Its task is to take the input and extract features out of it and store them in feature maps called as channels. Here, layer 1 is the input layer, every square in layer 1 is a pixel in an image and is a neuron. The grid in green is the kernel which runs on the input channel i.e. the image to filter out edges and gradients. The output of the kernel in layer 1 is stored in layer 2. In layer 2, again a new kernel is run whose output will be stored in layer 3.

So you might be wondering what these kernel values should be to be able to filter out the desired object. I will be covering the training in brief ( in depth part 2).

Kernel Initialization

Kernels values are randomly assigned from a set of random numbers. We can not have kernel values as 0s or 1s or anything except random for training from scratch. The reasons are as follows:

Kernels are filters which work over the input image and gives out channels as output. If we have all values as 0s, all the convolution operation(Discussed in Part 2) that will take place will be 0. Lets take an Example of Filtering Tea: All 0s means the filter doesnot have opening to filter the tea from tea leaves. Hence when we pour our tea via that filter we will get nothing.

And if we initialize all kernel values with 1, we will have identical kernels giving same channels as output. The task of the kernel should be to separate, where as if we have same filters it will produce same results. For example: if the filter used to separate tea from tea leaves is open (no nets), it will give the same unfiltered tea back.

Receptive Field

This is one of the most important concept in DNN with respect to Vision.

Receptive Field Concept

In the above image, a kernel of 3x3 is used over an image of 5x5 resulting in a channel of 3x3 and then again a kernel of 3x3 is used over it to give an output of 1x1. To reach a global receptive field of 5x5, two 3x3 is being used. The pixel in green knows what is happening in all the pixels in light purple, hence its global receptive field is 5x5 where as its local receptive field is 3x3 as it is just seeing what is happening in the pixels in yellow, but indirectly knows what is happening in pixel in purple.

It can be defined as the region of the input image that the DNN is looking at. Minimum receptive field required to give good results for complex datasets is at least the size of the image. The DNN should have information of all the pixel values so take an appropriate decision.

For example: we may want to detect flies, they are very small in size, and can be anywhere in the frame, thus the DNN should be able to scan the whole image for the fly.

Reaching All Receptive Field Using 3x3

As seen above global receptive field of 5x5 could be reached using a kernel of 5x5 or two 3x3. In real world the images will be big of size ranging from 50x50 to 1024x1024 or beyond, therefore alone convolution layers wont be used, we will have max pooling and other layers that will aid in reaching the desired receptive field.

Kernel Size

Kernels can be of various sizes i.e. in images we have 2D or 3D kernels of any desired size :5x5 or 100x100. It can be anything. But it is preferable to have kernel size of 3x3 over any other kernel size. The following reasons are:

  1. Reaching derised receptive field using 3x3 over big kernels. It reduces parameters significantly,
    For Eg: A 5x5 Kernel will have 25 Neurons(In an image every pixel is a neuron), so to view all the pixels and make a decision if we apply a 5x5 directly it will have 25 parameters which are the weights of neuron where as if we apply two 3x3, the total parameter count will be 9+9 =18. And once we go for bigger networks over complicated images, the difference will be large.
  2. Having Odd Kernels allows use to make shapes like triangle more easily which will be tough in even kernels.
  3. The GPUs are designed for faster execution of 3x3, hence the processing time is faster.

The Training(In Brief)

All DNNs uses Back Propagation in training. Lets say the filter was to identify Dogs from images, hence initially the kernels are randomly initialized. So they will not be able to identify the dogs, hence for all the wrong results we calculate the loss and updates the weights of the neurons according to the loss. Then those updated weights are used to do the same task and this process repeats until a specific accuracy is reached or the number of epochs have exhausted(total training loops). In depth will be in later parts.

References

  1. Distill Pub

If you liked the article, please give a clap. You can comment down here in case of doubts.

--

--