Towards Deep Learning

9 min readOct 3, 2018

Image from https://github.com/InFoCusp/tf_cnnvis

This post is about our foray into image similarity detection, and covers the challenges we came across, how we approached some of these, and common gotchas to look out for. I will also try to include resources I found valuable throughout this post wherever appropriate. The principles involved here are reusable and may be extended to generalised deep learning/artificial intelligence problems.

Each section will cover a step or two of the process, from data collection to testing various approaches, to the approaches themselves and some of the reasoning behind them. Before, we get into the details, here’s an overview of the “steps” we followed:

Data Collection: This was inarguably one of the most important steps of the process. Not only the quality of data collected, but also its similarity to real-world data and our ability to somehow generalise and predict on that data mattered.
Enhancement & Augmentation: The problem domain we were working on didn’t have as much data as we would have preferred to have. As a result of this, we relied on a host of image enhancement and augmentation techniques to improve upon the dataset we had.
Choosing the right framework: Today, developers have a plethora of options to choose from as far as deep learning frameworks are concerned, each of which offers its own benefits and has its own drawbacks. As such, selecting the right framework to work with had a significant impact on the speed with which we could prototype and test our approaches.
Approaching the problem: This is an often ignored, but vital part of the pipeline. The approach one chooses to a deep learning problem involves a lot of deliberation and has far-reaching effects on the solution itself.
Tuning your model: Dos and Don’ts. One tends to pick up more from making mistakes here, but a little forethought goes a long way in helping realise those mistakes faster.

Data Collection & Augmentation

Image from https://dawn.cs.stanford.edu/2017/08/30/tanda/

Often, it is not an easy endeavour to obtain the right type of data required for your specific problem domain.

Pertinent to the image domain, let us talk about an example problem, such as the classification of mobile phones. Given a photo of a mobile phone, either the front or back of it, I need to classify with a degree of surety if it is an iOS/Android based phone.

It would be a wasteful exercise to search for a specific dataset, especially one having clearly labelled pictures of iPhones and Android-based handsets. It might, however, be a simpler exercise to retrieve a smaller dataset for this problem, using techniques such as web scraping. This “smaller” dataset can be put to good use using an approach known as Transfer Learning. There are often pre-trained networks and generalised datasets available for problem domains, some of which may be found here. An example of such a dataset for image classification is ImageNet.

Sometimes having the data is not enough, maybe the data just isn’t generalised enough. In case of images, the data could be of a quality/resolution that one cannot practically hope to have in a production system, or the images could be all in the same angle, and so forth. In such a case, we use data augmentation techniques to enhance the data we have, so as to enable our system to generalise better. Some common ways are through adding noise to images, translations such as skewing, flipping, rotation, and so forth. Intuitively, the more data you have to represent something, the better you should be able to generalise.

Choosing the right framework

When working with deep learning projects, one has a bunch of good options to sift through to get started, each with its plus points. While we did not consider an exhaustive amount of frameworks, we explored a few, and here is the reasoning of why we chose those few and the comparison made between them:

Keras: Keras is a high-level, abstracted deep learning framework which works upon TensorFlow, Theano, and a couple of other “lower-level” deep learning frameworks. A big win for us with Keras was its simplicity and the ease with which one can start working on it. We felt Keras would help us spend more time on solving the problem itself than dealing with the how-to of using a framework.
Caffe2: While we did explore a bit into Caffe2, we almost immediately discarded it since we didn’t feel it was documented well enough and would take more time to get accustomed to than we were prepared to dedicate to this task. It has in-built support for image classification problems and well-defined neural network layer types. Caffe2’s utility drops significantly outside convolutional networks, though.
Apache MXNet: Apache MXNet is a powerful and versatile deep learning framework by the Apache Software Foundation that provides APIs in multiple languages including C++, Python, and R. While retaining the power of a lower-level framework, it provides high-level abstracted APIs that make working with neural networks a breeze. Additionally, the documentation and tutorials were excellent, and the framework was supported by Amazon SageMaker too.

One of our aims going into this was to go for a framework easier to get started with, but we also looked for one that would give us just the right balance between being abstracted enough yet not disallowing us the option of delving deeper into it. For this purpose, we proceeded with Apache MXNet using the Gluon API.

Approaching the problem

Conventional wisdom in the area of deep learning and machine learning endorses the 80/20 split between training and test datasets. While this works when the data being dealt with is comparatively lesser in magnitude, as the size of our data grows, we require a more even split between training and test data. It is suggested that as 98/1/1 split between training, validation, and test data is a better option than the 80/20 split. There is strong reasoning behind using a separate validation set as compared to only one test set, which is as follows:

On a higher level, we have a set of data resembling production data the model has to actually work on. This data is our target, and our aim is to train a model that works best on this. The test set represents this data. The validation dataset is representative of the performance of one particular model on a set of data, where we continuously tweak the model’s hyper-parameters (such as learning rate, optimisers, etc.). The test data helps us differentiate between different models and approaches.

Hence, our approach to the problem involves a variety of steps. A rough version of the same could be:

Determine the Bayes Error Rate: The Bayes error rate is the lowest possible error rate for a classifier which is usually analogous to human error rate. Choosing the correct value here helps us determine how to proceed during training the model itself. For instance, the Bayes error rate for differentiating bottles from headphones could be ~ 0%. However, the same for distinguishing between over-the-ear and on-ear headphones could be significantly higher.
Make sure the test data is representative of the target: It is a pure waste of time and effort in most cases to have a test set that doesn’t represent the data you will be dealing with in a real-time system. If it is made sure that your model’s test set performance is indicative of how it will perform in the real world, the system becomes a whole lot easier to improve upon and test.
Validation/Dev and Test Set distribution: It is thought that the validation and test sets should come from the same distribution of data even if the training set itself is created out of different distributions of data cobbled together. The reason for this is supposed to be that the effort in tuning hyper-parameters to improve performance with validation data should extend to the test set as well.
Intuition: A very common mistake is to simply throw data at a network until it starts making some sense. It is important to have a realistic idea of how a problem is going to play out, and whether it is too complex to be dealt with by a single model. Sometimes, it is a better option to break it down into simpler problems and create multiple steps to deal with each of those separately. Taking the same phone example we discussed earlier, say you want to find similar phones given the image of a phone. It might be a less daunting task to first classify the phone into a type, based on a factor such as form factor or manufacturer, and then further perform image similarity detection on the subset of images we have. We hence break one task (image similarity detection on a database of phones) into two tasks, namely classification based on some attribute and then image similarity detection on a limited subset, which might be easier to perform.

Tuning your model

Image from https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html

Tuning your model is where it’s at. Often, tweaking hyper-parameters ends up giving widely better results than one would imagine. Before we dive into this, there are two terms one should be familiar with:

Bias: The bias of a model is a measure of its performance on the training data with respect to the Bayes error. For example, if the Bayes error is 1% (i.e., you can achieve 99% accuracy) and the training data accuracy is 87%, the bias is 12%.
Variance: The variance of a model is the difference between its performance on the training set vis-a-vis the dev/validation set. For instance, if the performance on the training set is 87% and that on the dev/validation set is 85%, the variance is 2%.

There is an obvious tradeoff between bias and variance. When the bias is high, the model is said to be under-fitting, i.e., not fitting to the data well enough. When the variance is higher, the model is said to be overfitting, i.e., not generalising well enough and rather fitting more to the individual data points than to the generalised function we are looking for.

Let us take the example of a deep learning problem where the Bayes error is 1%. A 3-layer neural network trained on the data results in 91% accuracy on the training data set and 77% accuracy on test data. The bias in this case is (99–91 = ) 8%, and the variance is (91–77 = ) 14%. Since the variance is much higher than the bias, the model is overfitting. Some ways to avoid overfitting are training on more data, adding dropout, and so forth.

If the bias is now 8% for a slightly tweaked model but the variance is, say, 1%. Since the model is under-fitting, we need to train it more, by either using a more complex network architecture/model with more training.

One might also perform an exhaustive “grid”-like search with various hyper-parameters to see what performs best.

Often, the problem lies not within the values of the hyper-parameters of the neural network, but with the architecture of the neural network itself. In such a case, it might be prudent to consider other architectures or to try to take a look under the hood of the black box that is your neural network. One such tool that helps visualise what the convolutional layers are detecting is: https://github.com/InFoCusp/tf_cnnvis.

In the case of image-related problems, the most obvious solution is a classification model, but if this isn’t just right for your use-case, one might go towards similarity detection, feature encoding, and other approaches. To this end, there is plenty of new research cropping up detailing different techniques one might use to bolster results. Some examples from the domain of images include ResNets, Triplet Networks, Mask-RCNN, etc.

In conclusion, it is easy to get lost in the rabbit hole of increasing network performance while completely forgetting the original goal. As with any project, it is pivotal to keep the original goals well-defined and always in mind while planning your next iteration or step.

Towards Deep Learning

Written by Dantin Kakkar