Deep Learning 2: Part 1 Lesson 1

12 min readJan 12, 2018

My personal notes from fast.ai course. These notes will continue to be updated and improved as I continue to review the course to “really” understand it. Much appreciation to Jeremy and Rachel who gave me this opportunity to learn.

Lessons: 1 ・ 2 ・ 3 ・ 4 ・ 5 ・ 6 ・ 7 ・ 8 ・ 9 ・ 10 ・ 11 ・ 12 ・ 13 ・ 14

Lesson 1

Getting started [0:00]:

In order to train a neural network, you will most certainly need Graphics Processing Unit (GPU) — specifically NVIDIA GPU because it is the only one that supports CUDA (the language and framework that nearly all deep learning libraries and practitioners use).
There are several ways to rent GPU: Crestle [04:06], Paperspace [06:10]

Introduction to Jupyter Notebook and Dogs vs. Cats [12:39]

You can run a cell by selecting it and hitting shift+enter (you can hold down shift and hit enter multiple times to keep going down the cells), or you can click on Run button at the top. A cell can contain code, text, picture, video, etc.
Fast.ai requires Python 3

%reload_ext autoreload
%autoreload 2
%matplotlib inline# This file contains all the main external libs we'll use
from fastai.imports import *from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *PATH = "data/dogscats/"
sz=224

First look at pictures [15:39]

!ls {PATH}models	sample	test1  tmp  train  valid

! tells to use bash (shell) instead of python
If you are not familiar with training set and validation set, check out Practical Machine Learning class (or read Rachel’s blog)

!ls {PATH}validcats  dogsfiles = !ls {PATH}valid/cats | head
files['cat.10016.jpg',
 'cat.1001.jpg',
 'cat.10026.jpg',
 'cat.10048.jpg',
 'cat.10050.jpg',
 'cat.10064.jpg',
 'cat.10071.jpg',
 'cat.10091.jpg',
 'cat.10103.jpg',
 'cat.10104.jpg']

This folder structure is the most common approach for how image classification dataset is shared and provided. Each folder tells you the label (e.g. dogs or cats).

img = plt.imread(f'{PATH}valid/cats/{files[0]}')
plt.imshow(img);

f’{PATH}valid/cats/{files[0]}’ — This is a Python 3.6. format string which is a convenient to format a string.

img.shape(198, 179, 3)img[:4,:4]array([[[ 29,  20,  23],
        [ 31,  22,  25],
        [ 34,  25,  28],
        [ 37,  28,  31]],[[ 60,  51,  54],
        [ 58,  49,  52],
        [ 56,  47,  50],
        [ 55,  46,  49]],[[ 93,  84,  87],
        [ 89,  80,  83],
        [ 85,  76,  79],
        [ 81,  72,  75]],[[104,  95,  98],
        [103,  94,  97],
        [102,  93,  96],
        [102,  93,  96]]], dtype=uint8)

img is a 3 dimensional array (a.k.a. rank 3 tensor)
The three items (e.g. [29, 20, 23]) represents Red Green Blue pixel values between 0 and 255
The idea is to take these numbers and use them to predict whether those numbers represent a cat or a dog based on looking at lots of pictures of cats and dogs.
This dataset comes from Kaggle competition, and when it was released (back in 2013) the state-of-the-art was 80% accurate.

Let’s train a model [20:21]

Here are the three lines of code necessary to train a model:

data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(resnet34, sz))
learn = ConvLearner.pretrained(resnet34, data, precompute=True)
learn.fit(0.01, 3)[ 0.       0.04955  0.02605  0.98975]                         
[ 1.       0.03977  0.02916  0.99219]                         
[ 2.       0.03372  0.02929  0.98975]

This will do 3 epochs which means it is going to look at the entire set of images three times.
The last of three numbers in the output is the accuracy on the validation set.
The first two are the value of loss function (in this case the cross-entropy loss) for the training set and the validation set.
The start (e.g. 0., 1.) is the epoch number.
We achieved ~99% (which would have won the Kaggle competition back in 2013) in 17 seconds with 3 lines of code! [21:49]
A lot of people assume that deep learning takes a huge amount of time, lots of resources, and lots of data — that, in general, is not true!

Fast.ai Library [22:24]

The library takes all of the best practices and approaches they can find — each time a paper comes out that looks interesting, they test it out and if it works well for a variety of datasets and they can figure out how to tune it, it gets implement it in the library.
Fast.ai curates all these best practices and packages up for you, and most of the time, figures out the best way to handle things automatically.
Fast.ai sits on top of a library called PyTorch which is a really flexible deep learning, machine learning, and GPU computation library written by Facebook.
Most people are more familiar with TensorFlow than PyTorch, but most of the top researchers Jeremy knows nowadays have switched across to PyTorch.
Fast.ai is flexible that you can use all these curated best practices as much or as little as you want. It is easy to hook in at any point and write your own data augmentation, loss function, network architecture, etc, and we will learn all that in this course.

Analyzing results [24:21]

This is what the validation dataset label (think of it as the correct answers) looks like:

data.val_yarray([0, 0, 0, ..., 1, 1, 1])

What do these 0’s and 1’s represents?

data.classes['cats', 'dogs']

data contains the validation and training data
learn contains the model

Let’s make predictions for the validation set (predictions are in log scale):

log_preds = learn.predict()
log_preds.shape(2000, 2)log_preds[:10]array([[ -0.00002, -11.07446],
       [ -0.00138,  -6.58385],
       [ -0.00083,  -7.09025],
       [ -0.00029,  -8.13645],
       [ -0.00035,  -7.9663 ],
       [ -0.00029,  -8.15125],
       [ -0.00002, -10.82139],
       [ -0.00003, -10.33846],
       [ -0.00323,  -5.73731],
       [ -0.0001 ,  -9.21326]], dtype=float32)

The output represents a prediction for cats, and prediction for dogs

preds = np.argmax(log_preds, axis=1)  # from log probabilities to 0 or 1
probs = np.exp(log_preds[:,1])        # pr(dog)

In PyTorch and Fast.ai, most models return the log of the predictions rather than the probabilities themselves (we will learn why later in the course). For now, just know that to get probabilities, you have to do np.exp()

Make sure you familiarize yourself with numpy (np)

# 1. A few correct labels at random plot_val_with_title(rand_by_correct(True), "Correctly classified")

The number above the image is the probability of being a dog

# 2. A few incorrect labels at random
plot_val_with_title(rand_by_correct(False), "Incorrectly classified")

plot_val_with_title(most_by_correct(0, True), "Most correct cats")

plot_val_with_title(most_by_correct(1, True), "Most correct dogs")

More interestingly, here are what the model thought it was definitely a dog but turns out to be a cat, or vice versa:

plot_val_with_title(most_by_correct(0, False), "Most incorrect cats")

plot_val_with_title(most_by_correct(1, False), "Most incorrect dogs")

most_uncertain = np.argsort(np.abs(probs -0.5))[:4]
plot_val_with_title(most_uncertain, "Most uncertain predictions")

Why is it important to look at these images? The first thing Jeremy does after he builds a model is to find a way to visualize what it has built. Because if he wants to make the model better, then he needs to take advantage of the things that is doing well and fix the things that is doing badly.
In this case, we have learned something about the dataset itself which is that there are some images that are in here that probably should not be. But it is also clear that this model has room to improve (e.g. data augmentation — which we will learn later).
Now you are ready to build your own image classifier (for regular photos — maybe not CT scan)! For example, here is what one of the students did.
Check out this forum post for different way of visualizing the results (e.g. when there are more than 2 categories, etc)

Top-down vs Bottom-up [30:52]

Bottom-up: learn each building block you need, and eventually put them together

Hard to maintain motivation
Hard to know the “big picture”
Hard to know which pieces you’ll actually need

fast.ai: Get students using a neural net right away, getting results ASAP

Gradually peel back the layers, modify, look under the hood

Course Structure [33:53]

Image classifier with deep learning (with fewest lines of code)
Multi-label classification and different kinds of images (e.g. satellite images)
Structured data (e.g. sales forecasting) — structured data is what comes from database or spreadsheet
Language: NLP classifier (e.g. movie review classification)
Collaborative filtering (e.g. recommendation engine)
Generative language model: How to write your own Nietzsche philosophy from scratch character by character
Back to computer vision — not just recognize a cat photo, but find where the cat is in the photo (heat map) and also learn how to write our own architecture from scratch (ResNet)

Image Classifier Examples:

Image classification algorithm is useful for lots and lots of things.

For example, AlphaGo [42:20] looked at thousands and thousands of go boards and each one had a label saying whether the go board ended up being the winning or the losing player’s. So it learnt an image classification that was able to look at a go board and figure out whether it was a good or bad — which is the most important step in playing go well: to know which move is better.
Another example is an earlier student created an image classifier of mouse movement images and detected fraudulent transactions.

Deep Learning ≠Machine Learning [44:26]

Deep learning is a kind of machine learning
Machine learning was invented by Arthur Samuel. In the late 50s, he got an IBM mainframe to play checkers better than he could by inventing machine learning. He made the mainframe to play against itself lots of times and figure out which kind of things led to victories, and used that to, in a way, write its own program. In 1962, Arthur Samuel said one day, the vast majority of computer software would be written using this machine learning approach rather than written by hand.
C-Path (Computational Pathologist)[45:42] is an example of traditional machine learning approach. He took pathology slides of breast cancer biopsies, consulted many pathologists on ideas about what kinds of patterns or features might be associated with long-term survival. Then they wrote specialist algorithms to calculate these features, run through logistic regression, and predicted the survival rate. It outperformed pathologists, but it took domain experts and computer experts many years of work to build.

A better way [47:35]

A class of algorithm that have these three properties is Deep Learning.

Infinitely flexible function: Neural Network [48:43]

Underlying function that deep learning uses is called the neural network:

All you need to know for now is that it consists of a number of simple linear layers interspersed with a number of simple non-linear layers. When you intersperse these layers, you get something called the universal approximation theorem. What universal approximation theorem says is that this kind of function can solve any given problem to arbitrarily close accuracy as long as you add enough parameters.

All purpose parameter fitting: Gradient Descent [49:39]

Fast and scalable: GPU [51:05]

The neural network example shown above has one hidden layer. Something what we learned in the past few years is that these kind of neural network was not fast or scalable unless we added multiple hidden layers — hence called “Deep” learning.

Putting all together [53:40]

Here are some of the examples:

Diagnosing lung cancer [56:55]

Other current applications:

Convolutional Neural Network [59:13]

Linear Layer

http://setosa.io/ev/image-kernels/

Nonlinear Layer [01:02:12]

Neural networks and deep learning

In this chapter I give a simple and mostly visual explanation of the universality theorem. We'll go step by step…

neuralnetworksanddeeplearning.com

A combination of linear layer followed by an element-wise nonlinear function allows us to create arbitrarily complex shapes — this is the essence of the universal approximation theorem.

How to set these parameters to solve problems [01:04:25]

Stochastic Gradient Descent — we take small steps down the hill. The step size is called learning rate

If learning rate is too large, it will diverge instead of converge
If learning rate is too small, it will take forever

Visualizing and Understanding Convolutional Networks [01:08:27]

We started with something incredibly simple but if we use it as a big enough scale, thanks to the universal approximation theorem and the use of multiple hidden layers in deep learning, we actually get the very very rich capabilities. This is actually what we used when we used when we trained our dog vs cat recognizer.

Dog vs. Cat Revisited — Choosing a learning rate [01:11:41]

learn.fit(0.01, 3)

The first number 0.01 is the learning rate.
The learning rate determines how quickly or how slowly you want to update the weights (or parameters). Learning rate is one of the most difficult parameters to set, because it significantly affect model performance.
The method learn.lr_find() helps you find an optimal learning rate. It uses the technique developed in the 2015 paper Cyclical Learning Rates for Training Neural Networks, where we simply keep increasing the learning rate from a very small value, until the loss stops decreasing. We can plot the learning rate across batches to see what this looks like.

learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.lr_find()

Our learn object contains an attribute sched that contains our learning rate scheduler, and has some convenient plotting functionality including this one:

learn.sched.plot_lr()

Jeremy is currently experimenting with increasing the learning rate exponentially vs. linearly.

We can see the plot of loss versus learning rate to see where our loss stops decreasing:

learn.sched.plot()

We then pick the learning rate where the loss is still clearly improving — in this case 1e-2 (0.01)

Choosing number of epochs [1:18:49]

[ 0.       0.04955  0.02605  0.98975]                         
[ 1.       0.03977  0.02916  0.99219]                         
[ 2.       0.03372  0.02929  0.98975]

As many as you would like, but accuracy might start getting worse if you run it for too long. It is something called “overfitting” and we will learn more about it later.
Another consideration is the time available to you.