Deep Learning; Personal Notes Part 1 Lesson 1, Image Classification

Gerald Muriuki
Coinmonks
12 min readAug 6, 2018

--

This blog post series will be updated as I have a second take on the fast ai lessons. These are my personal notes, a strive to understand things clearly and explain them well. Nothing new, only living up to this blog.

I will be renting Paperspace GPUs (graphical processing units) to train neural networks as the cost seem quite affordable. Specifically, we will be using Nvidia GPU as they support CUDA.

fast.ai requires python 3

A brief recap on Fast ai

Fast ai is built on top of pytorch which is a deep learning library built by Facebook. The library takes the best practices and approaches in the industry and research. Each time an interesting paper is published they test it on a variety of datasets, tune it and implement it on the library.

Fast ai flexibility allows you to use this curated best practices as much or as little as you want. It makes it easy to write your own data augmentation, loss function and network architecture. The philosophy of fast ai is to learn things on as needed- basis.

Fast ai is code-driven and uses a top-down approach or the whole-game approach by David Perkins. Whereby it gets students using neural nets right away and getting results ASAP .

It then gradually peels back the layers to look at what is under the hood rather than the bottom-up approach where you learn the building blocks you need, and finally putting everything together.

Let’s take on the Cats vs Dogs challenge.

In this competition, you’ll write an algorithm to classify whether images contain either a dog or a cat. It is easy for humans to identify dogs and cats, your computer will find it a little more challenging.

We are going to use convolutional neural networks to classify the images

The imports

The data path

This folder structure is the standard way to share image classification files. Each folder tells you the label e.g. cats and dogs.

Looking at a cat’s image

f’{PATH}valid/cats/{files[4]}’ — This is a Python 3.6. format string which is a convenient to format a string.

plotting the image

How the raw data looks like

let’s look at how the data looks like…

The image is a 3 rank tensor or a 3 dimensional array. (374, 500, 3)

(66,38,24) are the red, blue and green pixel values between 0 and 255. The numbers represent a subset of what an image looks inside a computer. We will take these numbers and use them to predict if the image is a cat or a dog by looking at a lot of pictures.

Let’s jump to training a model.

We’re going to use a pre-trained model that is a model created by someone else to solve a different problem.

Instead of building a model from scratch to solve a similar problem. We’ll use a model trained trained on ImageNet (1.2 million images and 1000 classes). The model is a Convolutional Neural Network(CNN).

What is a pre-trained model?

A pre-trained model is a model that has been previously trained on a dataset and contains the weights and biases that represent the features of whichever dataset it was trained on.

Learned features are often transferable to different data. For example, a model trained on a large dataset of bird images will contain learned features like edges or horizontal lines that would be transferable to your dataset.

Why use a pre-trained model?

Pre-trained models are beneficial to us for many reasons. By using a pre-trained model you are saving time. Someone else has already spent the time and compute resources to learn a lot of features and your model will likely benefit from it.

We will be using the resnet34 model.

Resnet34 is a version of the model that won the 2015 ImageNet competition. Other architectures in computer vision include; AlexNet, VGG Net, GoogleNet, ResNeXt etc

archdefines the architecture we will be using (resnet34)

data object contains the training and validation data.

ImageClassifierData.from_paths reads in images and their labels given as sub-folder names:

  • path: a root path of the data (used for storing trained models, precomputed values, etc)
  • bs: batch size. Default 64.
  • tfms: transformations (for data augmentations). e.g. output of tfms_from_model. Default 'None'
  • trn_name: a name of the folder that contains training images. Default 'train'
  • val_name: a name of the folder that contains validation images. Default 'valid'
  • test_name: a name of the folder that contains test images. Default 'None'
  • num_workers: number of workers. Default '8'

tfms= tfms_from_modeltransformations — this gets the data ready to pass to our model. Normalization and resizing for example.

learn contains the model

ConvLearner.pretrained:

  • f: arch. E.g., resnet34
  • data: previously defined data object
  • precompute: include/exclude precomputed activations. Default 'False' (precomputing- using pre trained weights)

learn.fit trains/fits the model through a given learning rate and epochs. In this instance, it is going to do 3 epochs with a 0.01 learning rate, meaning it is going to look at each image three times in total.

trn_loss and val_loss are the values of the cross-entropy loss function.

How good is this model? our model got an accuracy of 99%. The state of the art was 80%, but the competition resulted to a huge jump of 98.9% accuracy. Less than 4 years later that result can be beaten in seconds.

So as at 2013 you could get a Kaggle winning image classifier at the end in 17 sec and 3 lines of code. But, first it downloads a pre-trained model from the internet; this takes 2 min. It then pre-computes some caches; this takes 1 min and a half the first time you run the model. Subsequent training takes 20sec.

Hence it is not factual to say that deep learning takes a lot of time, resources and data.

Analysing Results

Looking at the validation set, it is a bunch of 0's and 1’s, where 0’s represent cats and 1’s represent dogs.

Let’s look at the first ten predictions

The fist column is cats the second is dogs. The model returns the log of the prediction instead of probabilities. To get the probabilities get the exp of the log.

Printing some random correct images

A dog is 1 so anything greater than 0.5 is a dog and anything less is a cat.

Printing incorrect images

Most correct cats

Most correct dogs

Most incorrect cats (classified as dogs yet they are cats)

Most incorrect dogs (dogs that the machine actually thought were cats)

Most uncertain prediction which are closes to 0.5. The classifier does not really know what to do with these images.

The reason for looking at your data is to try and visualise what the model has built so that you can take advantage of the things it is doing well and fix the thing it is doing badly. In this case we see that there are images that are not supposed to be here. We can also improve the model using data argumentation.

Assignment: grab some data of two or more different things put them in different folders and pass it through the tree lines of code (model). The best photos would be the normal day today photos people take pictures of. Note: satellite images, CT scans, microscope and pathology picture won’t work for this code. (10 examples should do the trick)

Image Classification Examples

Check out this image classifier that uses mouse movement images to detect fraudulent transactions.

Let us backtrack a little, what is Deep Learning?

Deep Learning is a kind of Machine Learning. It is however not equal to Machine Learning. In Machine Learning we have to do most of the feature engineering ourselves, in deep learning we leave that to the neural network.

C-Path (Computational Pathologist) is an example of a traditional Machine Learning approach. Pathology slides of breast cancer biopsies were taken for the study. Then many pathologists consulted on ideas about what kinds of patterns or features might be associated with long-term survival. They then wrote specialist algorithms to calculate these features, run through logistic regression and predict the survival rate. The algorithms outperformed pathologists. It however took domain experts and computer experts many years of work to build.

Below are the properties that make deep learning class of algorithm better:

Instead of building domain specific functions with lots of feature engineering, you build an infinitely flexible function that can solve any problem, if you set the parameter of that function correctly. Then establish a fast and scalable all-purpose way of setting the parameters of that function.

The underlying function that deep learning use is called a neural network

A neural net with one hidden layer

A neural network contains a number of linear layers interspaced with a number of non-linear layers. Thus giving us the universal approximation theorem that says these kind of functions can solve any given problem to arbitrary close accuracy — as long as you add enough parameters. This proves it is an infinitely flexible function.

To do an all purpose parameter fitting, we use gradient descent. For the different parameters that we have we need to ask ourselves how good they are at solving the problem at hand.

Let’s figure out a better set of parameters which happen to follow the loss function downwards to find the minimum thus ending up at local minimum.

The finding of parameters have to be done in a reasonable amount of time thus, the use of GPUs.

From the diagram below GPUs are 10X faster than the CPUs.

To make a neural net scalable and fast we need multiple hidden layers hence the word “deep learning”.

link to another example;

Diagnosing lung cancer

Things you could use deep learning for;

A key piece of the Convolutional neural network is the convolution, this website has a visual explanation of the convolution. Also check out this neural network interactive book.

How to set the parameters

Let’s take a quadratic function and try to find the minimum.

You start by randomly picking a point Xn and calculating the value of your quadratic at that point on the quadratic curve by finding the derivative. This gives you the gradient that tells you which way is down and which way is up.

Take a small step in the left direction to create a new point Xn+1 and keep repeating the process by taking small steps, each time it takes you closer to minimum.

The small step and number denoted by ℓ is called the learning rate.

With too high step size you end up with divergence rather than convergence. If the learning rate is too small it will take longer to converge.

Divergence on the right convergence on the left

Combining linearity, non-linearity, convolution and gradient decent gives us a Convolutional Neural Network

Choosing a learning rate

learn.fit(0.01, 3)

0.01 is the learning rate. This is something you have to set.

Learning rate is how much you multiply your gradient when taking each step in your gradient decent. Setting the learning rate well is important so that you don’t diverge or take too long to get to convergence. Learning rate affects model performance.

The method learn.lr_find() helps you find an optimal learning rate. It uses the technique developed in the 2015 paper Cyclical Learning Rates for Training Neural Networks, where we simply keep increasing the learning rate from a very small value, until the loss stops decreasing. We can plot the learning rate across batches to see what this looks like.

We first create a new learner, since we want to know how to set the learning rate for a new untrained model.

Our learn object contains an attribute sched that contains our learning rate scheduler, and has some convenient plotting functionality including this one:

Note that in the previous plot iteration is one iteration or minibatch of SGD(stochastic gradient descent). In one epoch there are (num_train_samples/num_iterations) of SGD.

Stochastic Gradient Descent — samples are selected randomly (or shuffled) instead of as a single group (as in standard gradient descent) or in the order they appear in the training set.

We can see the plot of loss versus learning rate to see where our loss stops decreasing:

Go back and look at what point did we see the best improvement, then use that learning rate. The loss is still clearly improving at lr=1e-2 (0.01), so that’s what we use.

Note that the optimal learning rate can change as we train the model, so you may want to re-run this function from time to time.

Loss is how accurate the model is, how far is the prediction from the goal. we pick the point the curve drops quickly and not the minimum point as this might be the point it jumped too far.

The training of the model does not get to 100% as it notices the loss is getting worse and stops. It gets to 75%

We then pick the learning rate where the loss is still clearly improving, In this case. highest learning rate possible but still improving.

Choosing the number of epochs

learn.fit(0.01, 3)

3 this is the number of epochs.

An epoch is one complete presentation of the data set to be learned to the model.

On each epoch we print training loss, validation and accuracy.

You can run as many epochs as you like but the accuracy might start getting worse due to overfitting.

A few other considerations are the size of the model and the amount of data. If both are too large, it will require a lot of time to train the model. Therefore you only choose your epoch based on the amount of time you have.

Tips and Tricks on jupyter notebook

1.Checking out a function name you can’t remember type the first letter and hit tab once.

2. If you can’t remember what the arguments are to a method hit shit+tab

3. If you can’t remember what the function does hit shift+ tab + tab. It bring up the documentation.

4. Typing ?? before a function bring up the source code.

5. pressingH bring out the keyboard shortcut. Try learning 4 to 5 of them a day and practice.

Always stop your paperspace, crestle or AWS after finishing your work. otherwise you will be charged.

Take your time and hangout in the forums.

Thanks for reading, I appreciate! follow @itsmuriuki.

Back to learning!

--

--