Google Inception Architecture: Won the Imagenet Challenge in 2015

Deep Learning and Computer Vision: From Basic Implementation to Efficient Methods

Hey guys! This tutorial focuses on the implementation of computer vision algorithms and talks in depth about the nuances of models that are implemented. Read till the end. Includes tips and tricks to implement efficient CNNs.

We learning about a lot of these concepts using an example for Image Classification. BTW, here is list of 100s of open datasets from image classification, you can choose anyone that interests you.

Let’s get started with a fundamental question. Why do we need to use deep learning to enhance computer vision? To answer that, let’s think of images as a huge array of matrices (tensors). The data in an 8-bit image is represented as an array of pixels, where each pixel is a value between 0 and 255. How do we make sense of random numerical data?

Pixel value in Images

This entire process is called feature extraction, and the elegance of Signal Processing and Digital Image Processing is that algorithms have been designed to extract features like edges, contours. There are several algorithms like SIFT and SURF that extract features from images. These features are the key descriptors of images, which are used as data to be sent to Machine Learning algorithms for further creation of hypothesis functions(models), which can then be tested on test data.

On the other hand, Deep Learning simplifies the process of feature extraction through the process of convolution. Convolution is a mathematical operation, which maps out an energy function, which is a measure of similarity between two signals, or in our case images. So, when we use a blue filter and convolve it with white light, the resultant energy spectrum is that of blue light. Hence, the convolution of white light with a blue filter results in blue light. Hence term Convolutional Neural Networks, where feature extraction is done via the process of convolution.

Convolutional Neural Networks

Let’s start off with CNNs. This is part of the data-driven approach. By this, we refer to the dependence of Deep Learning Frameworks on large amounts of data. The data-driven approach is an efficient way to make a dumb model clever, as it is getting exposure to more data.

Imagenet Architecture

The first few conv layers extract features like edges. The deeper conv layers extract complicated features like face, digits etc, that is the object of interest. This statement is an overgeneralization, but on a broader level this is true.

Let’s understand about each step in detail.

Convolution: We have already seen what convolution is. Initially, filters are initialized via a Gaussian distribution, randomly. These filters are defined such that each filter learns about certain patterns, and these filters learn more patterns as the network gets deeper. Now you must ask, how can the same filters get different insights from the same data? To answer that, let’s talk about pooling.

Max Pooling with Stride 2

Pooling: Well, pooling is a sub-sampling technique. The use of pooling is to reduce the dimension of the input image after getting convolved. There are two types, max pooling, and average pooling. An illustration for max pooling! Guess what stride is from the above GIF…

Batch Normalization: We add this layer initially, to normalize all the features. Technically, batch normalization normalizes the output of a previous activation layer(initially, input layer) by subtracting the batch mean and dividing by the batch standard deviation. This makes the model more robust and learns effectively. Intuitively, we are preventing overfitting!

Dropout: This is another regularization technique that was used before Batch Norm. The way this works is, the weights are randomly juggled around by very small amounts… the model ends up learning variations and again prevents overfitting. Individual nodes are either dropped out of the net with probability 1-p or kept with probability p so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed.

Srivastava, Nitish, et al. ”Dropout: a simple way to prevent neural networks from
overfitting”,

Zero- Padding: This helps prevent dimensionality loss, during convolution. Thus, for very deep networks, we usually prefer this. The zeros don’t add to the energy quotient during the convolution and help maintain dimensionality at a required level.

Fully Connected Networks: The output from the convolutional layers represents high-level features in the data. While that output could be flattened and connected to the output layer, adding a fully-connected layer is a way of learning non-linear combinations of these features. Essentially the convolutional layers are providing a meaningful, low-dimensional, and somewhat invariant feature space and the fully-connected layer is learning a nonlinear function induced by the activation functions, in that space. Similar to an artificial neural network architecture.

We shall implement a CNN and check the effects of batch norm… Also what do we do when we have very less number of images in our dataset. We use transfer learning. This is a process, where we use pre-trained weights and train the last few layers for our dataset.

Let’s get started: The codes are available at Dataturks official github account. Also the PyTorch Implementation of the same is available on their website

An image classification Dataset on Dataturks.

Let’s start with the dataset. To start annotating images, you can use dataurks tools. Once done annotating, you can import the images into Keras format. Keras format essentially is having all images of different classes in different folders. Download the json format file from dataturks. Believe me! It’s super easy to use.

Here is a super interesting dataset you can try to use, classified images of faces of around 70 characters in Game of Thrones classified.

Once you have the dataset… Run the following code to start training. The dataset is imported using flow_from_directory module in keras.

My friend, Sameer has implemented a CNN classifier in both keras and tensorflow. Check out his blog here.

Next, if our dataset is small, then we can use transfer learning. Let’s get started …

  1. Representation of data
  2. Evaluation
  3. Optimization

#DeepHacks :P

  1. What to do when we have less data? If possible, use Bing / Google image search APIs to increase the dataset. If the dataset is very specific, try creating a dataset in a controlled environment, by which I mean, ambient lighting conditions, a good resolution camera, etc. Use Dataturks annotation tool to label each image in the dataset quickly. The dataset decides your quality of output. Transfer learning is the best available solution to the given problem of less data.
  2. Learning Rates: How do we decide which learning rate is the best? The best technique I have come across so far is the technique of comparing the loss versus learning rate. To do this, run 1 epoch on your dataset and compare the losses for each learning rate. Plot a graph and analyze it. Whenever the rate of change of loss is the highest, the corresponding the learning rate is the optimum one, usually between 10^-3–10^-5… When using pre-trained models, the learning rate must be smaller for weights that are being fine-tuned, in comparison to the (randomly-initialized) weights for the new linear classifier. This is because we expect that the ConvNet weights are relatively good, so we don’t wish to distort them too quickly and too much.
  3. Weights Initialization: The initialization of weights makes a huge impact on the firing of neurons. While using ReLU activation function, we prefer Xavier Initialization.
  4. Implementation on Keras, PyTorch are relatively simple, but they are pretty slow on the training time, as compare to the training time on Tensorflow. I personally would suggest tensorflow to get the grasp of concept and improve coding skills.

Where next?

The above content focuses on image classification only. But there is more to computer vision than just classification task. The detection, segmentation and localisation of classified objects is equally important, for example in self-driving cars. So, I strongly suggest you go through R-CNNs and its variations. This is a vast field with loads of efficient implementations. I shall talk about this in yet another blog…

This is the second story in the 5 part tutorial series… The first being this. Do let me know your feedback on this story or any doubts or views, at my email: lalith@dataturks.com

--

--

DataTurks: Data Annotations Made Super Easy

Data Annotation Platform. Image Bounding, Document Annotation, NLP and Text Annotations. #HumanInTheLoop #AI, #TrainingData for #MachineLearning.