Deep Learning 2: Part 1 Lesson 3

Hiromi Suenaga
19 min readJan 15, 2018


My personal notes from course. These notes will continue to be updated and improved as I continue to review the course to “really” understand it. Much appreciation to Jeremy and Rachel who gave me this opportunity to learn.

Lessons: 1234567891011121314

Lesson 3

Helpful materials created by students:

Where we go from here:

Review [08:24]:

Kaggle CLI : How to download data 1:

Kaggle CLI is a good tool to use when you are downloading from Kaggle. Because it is downloading data from Kaggle website (through screen scraping), it breaks when the website changes. When that happens, run pip install kaggle-cli --upgrade.

Then you can run:

$ kg download -u <username> -p <password> -c <competition>

Replace <username>, <password> with your credential and <competition> is what follows /c/ in the URL. For example, if you are trying to download dog breed data from the command would look like:

$ kg download -u john.doe -p mypassword -c dog-breed-identification

Make sure you had clicked on the Download button from your computer once and accepted the rules:

CurWget (Chrome extension): How to download data 2:

Quick Dogs vs. Cats [13:39]

from fastai.conv_learner import * 
PATH = 'data/dogscats/'
sz=224; bs=64

Often the notebook assumes that your data is in data folder. But maybe you want to put them somewhere else. In that case, you can use symbolic link (symlink for short):

Here is an end to end process to get a state of the art result for dogs vs. cats:

Quick Dogs v Cats

A little further analysis:

data = ImageClassifierData.from_paths(PATH, tfms= tfms, bs=bs, test_name='test')
  • from_paths : Indicates that subfolder names are the labels. If your train folder or valid folder has a different name, you can send trn_name and val_name argument.
  • test_name : If you want to submit to Kaggle competition, you will need to fill in the name of the folder where the test set is.
learn = ConvLearner.pretrained(resnet50, data)
  • Notice that we did not set pre_compue=True. It is just a shortcut which caches some of the intermediate steps that do not have to be recalculated each time. If you are at all confused about it, you can just leave it off.
  • Remember, when pre_compute=True , data augmentation does not work.
%time[1e-5, 1e-4,1e-2], 1, cycle_len=1)
  • bn_freeze : If you are using a bigger deeper model like ResNet50 or ResNext101 (anything with number bigger than 34) on a dataset that is very similar to ImageNet (i.e. side-on photos of standard object whose size is similar to ImageNet between 200–500 pixels), you should add this line. We will learn more in the second half of the course, but it is causing the batch normalization moving averages to not be updated.

How to use other libraries — Keras [20:02]

It is important to understand how to use libraries other than Keras is a good example to look at because just like sits on top of PyTorch, it sits on top of varieties of libraries such as TensorFlow, MXNet, CNTK, etc.

If you want to run the notebook, run pip install tensorflow-gpu keras

  1. Define data generators
train_data_dir = f'{PATH}train' 
validation_data_dir = f'{PATH}valid'
train_datagen = ImageDataGenerator(rescale=1. / 255,
shear_range=0.2, zoom_range=0.2, horizontal_flip=True)
test_datagen = ImageDataGenerator(rescale=1. / 255)train_generator = train_datagen.flow_from_directory(train_data_dir,
target_size=(sz, sz),
batch_size=batch_size, class_mode='binary')
validation_generator = test_datagen.flow_from_directory(
target_size=(sz, sz),
batch_size=batch_size, class_mode='binary')
  • The idea of train folder and validation folder with subfolders with the label names is commonly done, and Keras also does it.
  • Keras requires much more code and many more parameters to be set.
  • Rather than creating a single data object, in Keras you define DataGenerator and specify what kind of data augmentation we want it to do and also what kind of normalization to do. In other words, in, we can just say “whatever ResNet50 requires, just do that for me please” but in Keras, you need to know what is expected. There is no standard set of augmentations.
  • You have to then create a validation data generator in which you are responsible to create a generator that does not have data augmentation. And you also have to tell it not to shuffle the dataset for validation because otherwise you cannot keep track of how well you are doing.

2. Create a model

base_model = ResNet50(weights='imagenet', include_top=False)
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
predictions = Dense(1, activation='sigmoid')(x)
  • The reason Jeremy used ResNet50 for Quick Dogs and Cats was because Keras does not have ResNet34. We want to compare apple to apple.
  • You cannot ask it to construct a model that is suitable for a particular dataset, so you have to do it by hand.
  • First you create a base model, then you construct layers you want to add on top of it.

3. Freeze layers and compile

model = Model(inputs=base_model.input, outputs=predictions)for layer in base_model.layers: layer.trainable = Falsemodel.compile(optimizer='rmsprop', loss='binary_crossentropy', 
  • Loop through layers and freeze them manually by calling layer.trainable=False
  • You need to compile a model
  • Pass the type of optimizer, loss, and metrics

4. Fit

model.fit_generator(train_generator, train_generator.n//batch_size,
epochs=3, workers=4, validation_data=validation_generator,
validation_steps=validation_generator.n // batch_size)
  • Keras expects to know how many batches there are per epoch.
  • workers : how many processors to use

5. Fine-tune: Unfreeze some layers, compile, then fit again

split_at = 140for layer in model.layers[:split_at]: layer.trainable = False
for layer in model.layers[split_at:]: layer.trainable = True
model.compile(optimizer='rmsprop', loss='binary_crossentropy',
%%time model.fit_generator(train_generator,
train_generator.n // batch_size, epochs=1, workers=3,
validation_steps=validation_generator.n // batch_size)

Pytorch — If you want to deploy to mobile devices, PyTorch is still very early.

Tensorflow — If you want to convert things you learned in this class, do more work with Keras, but it would take a bit more work and is hard to get the same level of results. Maybe there will be TensorFlow compatible version of in future. We will see.

Create Submission file for Kaggle [32:45]

To create the submission files, we need two pieces of information:

  • data.classes : contains all the different classes
  • data.test_ds.fnames : test file names
log_preds, y = learn.TTA(is_test=True)
probs = np.exp(log_preds)

It is always good idea to use TTA:

  • is_test=True : it will give you predictions on the test set rather than the validation set
  • By default, PyTorch models will give you back the log of the predictions, so you need to do np.exp(log_preds) to get the probability.
ds = pd.DataFrame(probs)
ds.columns = data.classes
  • Create Pandas DataFrame
  • Set the column name as data.classes
ds.insert(0, 'id', [o[5:-4] for o in data.test_ds.fnames])
  • Insert a new column at position zero named id. Remove first 5 and last 4 letters since we just need IDs (a file name looks like test/0042d6bf3e5f3700865886db32689436.jpg)
SUBM = f'{PATH}sub/' 
os.makedirs(SUBM, exist_ok=True)
ds.to_csv(f'{SUBM}subm.gz', compression='gzip', index=False)
  • Now you can call ds.to_csv to create a CSV file and compression='gzip' will zip it up on the server.
  • You can use Kaggle CLI to submit from the server directly, or you can use FileLink which will give you a link to download the file from the server to your computer.

Individual prediction [39:32]

What if we want to run a single image through a model to get a prediction?

fn = data.val_ds.fnames[0]; fn'train/001513dfcb2ffafc82cccf4d8bbaba97.jpg' + fn)
  • We will pick a first file from the validation set.

This is the shortest way to get a prediction:

trn_tfms, val_tfms = tfms_from_model(arch, sz)im = val_tfms(

preds = learn.predict_array(im[None])
  • Image must be transformed. tfms_from_model returns training transforms and validation transforms. In this case, we will use validation transform.
  • Everything that gets passed to or returned from a model is generally assumed to be in a mini-batch. Here we only have one image, but we have to turn that into a mini-batch of a single image. In other words, we need to create a tensor that is not just [rows, columns, channels] , but [number of images, rows, columns, channels].
  • im[None] : Numpy trick to add additional unit axis to the start.

Theory: What is actually going on behind the scenes with convolutional neural network [42:17]

  • We saw a little bit of theory in Lesson 1 —
  • Convolution is something where we have a little matrix (nearly always 3x3 in deep learning) and multiply every element of that matrix by every element of 3x3 section of an image and add them all together to get the result of that convolution at one point.

Otavio’s fantastic visualization (he created Word Lens):

Jeremy’s visualization: Spreadsheet [49:51]

  • This data is from MNIST
  • Activation: A number that is calculated by applying some kind of linear operation to some numbers in the input.
  • Rectified Linear Unit (ReLU): Throw away negative — i.e. MAX(0, x)
  • Filter/Kernel: A 3x3 slice of a 3D tensor you used for convolution
  • Tensor: Multidimensional array or matrix Hidden Layer A layer that is neither input nor output
  • Max pooling: A (2,2) max pooling will halve the resolution in both height and width — think of it as a summary
  • Fully connected layer: Give a weight to each and every single activation and calculate the sum product. Weight matrix is as big as the entire input.
  • Note: There are many things you can do after the max pooling layer. One of them is to do another max pool across the entire size. In older architectures or structured data, we do fully connected layer. Architecture that make heavy use of fully connected layers are prone to overfitting and are slower. ResNet and ResNext do not use very large fully connected layers.

Question: What happens if the input had 3 channels? [1:05:30] It will look something similar to the Conv1 layer which has 2 channels — therefore, filters have 2 channels per filter. Pre-trained ImageNet models use 3 channels. Some of the techniques you can use when you do when you do have less than 3 channel is to either duplicate one of the channels to make it 3, or if you have 2, then get an average and consider that as the third channel. If you have 4 channels, you could add extra level to the convolutional kernel with all zeros.

What happens next? [1:08:47]

We have gotten as far as fully connected layer (it does classic matrix product). In the excel sheet, there is one activation. If we want to look at which one of ten digit the input is, we actually want to calculate 10 numbers.

Let’s look at an example where we are trying to predict whether a picture is a cat, a dog, or a plane, or fish, or a building. Our goal is:

  1. Take output from the fully connected layer (no ReLU so there may be negatives)
  2. Calculate 5 numbers where each of them is between 0 and 1 and they add up to 1.

To do this, we need a different kind of activation function (a function applied to an activation).

Why do we need non-lineality? If you stack multiple linear layers, it is still just a linear layer. By adding non-linear layers, we can fit arbitrarily complex shapes. The non-linear activation function we used was ReLU.

Softmax [01:14:08]

Softmax only ever occurs in the final layer. It outputs numbers between 0 and 1, and they add up to 1. In theory, this is not strictly necessary — we could ask out neural net to learn a set of kernels which give probabilities that line up as closely as possible with what we want. In general with deep learning, if you can construct your architecture so that the desired characteristics are as easy to express as possible, you will end up with better models (learn more quickly and with less parameters).

  1. Get rid of negatives by e^x because we cannot have negative probabilities. It also accentuates the value difference (2.85 : 4.08 → 17.25 : 59.03)

All the math that you need to be familiar with to do deep learning:

2. We then add up the exp column (182.75), and divide the e^x by the sum. The result will always be positive since we divided positive by positive. Each number will be between 0 and 1, and the total will be 1.

Question: What kind of activation function do we use if we want to classify the picture as cat and dog? [1:20:27] It so happens that we are going to do that right now. One reason we might want to do that is to do multi-label classification.

Planet Competition [01:20:54]

Notebook / Kaggle page

I would definitely recommend anthropomorphizing your activation functions. They have personalities. [1:22:21]

Softmax does not like to predicting multiple things. It wants to pick one thing. library will automatically switch into multi-label mode if there is more than one label. So you do not have to do anything. But here is what happens behind the scene:

from planet import f2

f_model = resnet34
label_csv = f'{PATH}train_v2.csv'
n = len(list(open(label_csv)))-1
val_idxs = get_cv_idxs(n)
def get_data(sz):
tfms = tfms_from_model(f_model, sz,
aug_tfms=transforms_top_down, max_zoom=1.05)
return ImageClassifierData.from_csv(PATH, 'train-jpg',
label_csv, tfms=tfms, suffix='.jpg',
val_idxs=val_idxs, test_name='test-jpg')
data = get_data(256)
  • Multi-label classification cannot be done with Keras style approach where subfolder is the name of the label. So we use from_csv
  • transform_top_down : it does more than just a vertical flip. There are 8 possible symmetries for a square — it can be rotated through 0, 90, 180, 270 degrees and for each of those, it can be flipped (dihedral group of eight)
x,y = next(iter(data.val_dl))
  • We had seen data.val_ds , test_ds, train_ds(ds: dataset) for which you can get an individual image by data.train_ds[0], for example.
  • dl is a data loader which will give you a mini-batch, specifically transformed mini-batch. With a data loader, you cannot ask for a particular mini-batch; you can only get back the next mini-batch. In Python, it is called “generator” or “iterator”. PyTorch really leverages modern Python methodologies.

If you know Python well, PyTorch comes very naturally. If you don’t know Python well, PyTorch is a good reason to learn Python well.

  • x : a mini-batch of images, y : a mini-batch of labels.

If you are never sure what arguments a function takes, hit shift+tab .

list(zip(data.classes, y[0]))

[('agriculture', 1.0),
('artisinal_mine', 0.0),
('bare_ground', 0.0),
('blooming', 0.0),
('blow_down', 0.0),
('clear', 1.0),
('cloudy', 0.0),
('conventional_mine', 0.0),
('cultivation', 0.0),
('habitation', 0.0),
('haze', 0.0),
('partly_cloudy', 0.0),
('primary', 1.0),
('road', 0.0),
('selective_logging', 0.0),
('slash_burn', 1.0),
('water', 1.0)]

Behind the scenes, PyTorch and are turning our labels into one-hot-encoded labels. If the actual label is dog, it will look like:

We take the difference between actuals and softmax , add them up to say how much error there is (i.e. loss function) [1:31:02].

One-hot-encoding is terribly inefficient for storing, so we will store an index value (single integer) rather than 0’s and 1’s for the target value (y) [1:31:21]. If you look at the y values for the dog breeds competition, you won’t actually see a big lists of 1’s and 0's, but you will wee a single integer. And internally, PyTorch is converting the index to one-hot-encoded vector (even though you will literally never see it). PyTorch has different loss functions for ones that are one hot encoded and others that are not — but these details are hidden by the library so you do not have to worry about it. But the cool thing to realize is that we are doing exactly the same thing for both single label classification and multi label classification.

Question: Does it make sense to change the base of log for softmax?[01:32:55] No, changing the base is just a linear scaling which neural net can learn easily:

  • *1.4 : The image was washed out, so making it more visible (“brightening it up a bit”). Images are just matrices of numbers, so we can do things like this.
  • It is good to experiment images like this because these images are not at all like ImageNet. The vast majority of things you do involving convolutional neural net will not actually be anything like ImageNet (medical imaging, classifying different kinds of steel tube, satellite images, etc)
sz=64data = get_data(sz)
data = data.resize(int(sz*1.3), 'tmp')
  • We will not use sz=64 for cats and dogs competition because we started with pre-trained ImageNet network which starts off nearly perfect. If we re-trained the whole set with 64 by 64 images, we would destroy the weights that are already very good. Remember, most of ImageNet models are trained with 224 by 224 or 299 by 299 images.
  • There is no images in ImageNet that looks like the one above. And only the first couple layers are useful to us. So starting out with smaller images works well in this case.
learn = ConvLearner.pretrained(f_model, data, metrics=metrics)lrf=learn.lr_find() 
lr = 0.2, 3, cycle_len=1, cycle_mult=2)
[ 0. 0.14882 0.13552 0.87878]
[ 1. 0.14237 0.13048 0.88251]
[ 2. 0.13675 0.12779 0.88796]
[ 3. 0.13528 0.12834 0.88419]
[ 4. 0.13428 0.12581 0.88879]
[ 5. 0.13237 0.12361 0.89141]
[ 6. 0.13179 0.12472 0.8896 ]
lrs = np.array([lr/9, lr/3, lr])learn.unfreeze(), 3, cycle_len=1, cycle_mult=2)
[ 0. 0.12534 0.10926 0.90892]
[ 1. 0.12035 0.10086 0.91635]
[ 2. 0.11001 0.09792 0.91894]
[ 3. 0.1144 0.09972 0.91748]
[ 4. 0.11055 0.09617 0.92016]
[ 5. 0.10348 0.0935 0.92267]
[ 6. 0.10502 0.09345 0.92281]
  • [lr/9, lr/3, lr] — this is because the images are unlike ImageNet image and earlier layers are probably not as close to what they need to be.
sz = 128
learn.freeze(), 3, cycle_len=1, cycle_mult=2)
[ 0. 0.09729 0.09375 0.91885]
[ 1. 0.10118 0.09243 0.92075]
[ 2. 0.09805 0.09143 0.92235]
[ 3. 0.09834 0.09134 0.92263]
[ 4. 0.096 0.09046 0.9231 ]
[ 5. 0.09584 0.09035 0.92403]
[ 6. 0.09262 0.09059 0.92358]
learn.unfreeze(), 3, cycle_len=1, cycle_mult=2)'{sz}')
[ 0. 0.09623 0.08693 0.92696]
[ 1. 0.09371 0.08621 0.92887]
[ 2. 0.08919 0.08296 0.93113]
[ 3. 0.09221 0.08579 0.92709]
[ 4. 0.08994 0.08575 0.92862]
[ 5. 0.08729 0.08248 0.93108]
[ 6. 0.08218 0.08315 0.92971]
sz = 256
learn.freeze(), 3, cycle_len=1, cycle_mult=2)
[ 0. 0.09161 0.08651 0.92712]
[ 1. 0.08933 0.08665 0.92677]
[ 2. 0.09125 0.08584 0.92719]
[ 3. 0.08732 0.08532 0.92812]
[ 4. 0.08736 0.08479 0.92854]
[ 5. 0.08807 0.08471 0.92835]
[ 6. 0.08942 0.08448 0.9289 ]
learn.unfreeze(), 3, cycle_len=1, cycle_mult=2)'{sz}')
[ 0. 0.08932 0.08218 0.9324 ]
[ 1. 0.08654 0.08195 0.93313]
[ 2. 0.08468 0.08024 0.93391]
[ 3. 0.08596 0.08141 0.93287]
[ 4. 0.08211 0.08152 0.93401]
[ 5. 0.07971 0.08001 0.93377]
[ 6. 0.07928 0.0792 0.93554]
log_preds,y = learn.TTA()
preds = np.mean(np.exp(log_preds),0)

A couple of questions people have asked what this does [01:38:46]:

data = data.resize(int(sz*1.3), 'tmp')

When we specify what transforms to apply, we send a size:

tfms = tfms_from_model(f_model, sz,
aug_tfms=transforms_top_down, max_zoom=1.05)

One of the things the data loader does is to resize the images on-demand. This has nothing to do with data.resize . If the initial image is 1000 by 1000, reading that JPEG and resizing it to 64 by 64 take more time than training the convolutional net. data.resize tells it that we will not use images bigger than sz*1.3 so go through once and create new JPEGs of this size. Since images are rectangular, so new JPEGs whose smallest edge is sz*1.3 (center-cropped). It will save you a lot of time.


Instead of accuacy, we used F-beta for this notebook — it is a way of weighing false negatives and false positives. The reason we are using it is because this particular Kaggle competition wants to use it. Take a look at to see how you can create your own metrics function. This is what gets printed out at the end [ 0. 0.08932 0.08218 0.9324 ]

Activation function for multi-label classification [01:44:25]

Activation function for multi-label classification is called sigmoid.

Question: Why don’t we start training with differential learning rate rather than training the last layers alone? [01:50:30]

You can skip training just the last layer and go straight to differential learning rates, but you probably do not want to. Convolutional layers all contain pre-trained weights, so they are not random — for things that are close to ImageNet, they are really good; for things that are not close to ImageNet, they are better than nothing. All of our fully connected layers, however, are totally random. Therefore, you would always want to make the fully connected weights better than random by training them a bit first. Otherwise if you go straight to unfreeze, then you are actually going to be fiddling around with those early layer weights when the later ones are still random — which is probably not what you want.

Question: When you use the differential learning rates, do those three learning rates spread evenly across the layers? [01:55:35] We will talk more about this later in the course but the library, there is a concept of “layer groups”. In something like ResNet50, there are hundreds of layers and you probably do not want to write hundreds of learning rates, so the library decided for you how to split them and the last one always refers to just the fully connected layers that we have randomly initialized and added.

Visualizing the layers [01:56:42]

OrderedDict([('input_shape', [-1, 3, 64, 64]),
('output_shape', [-1, 64, 32, 32]),
('trainable', False),
('nb_params', 9408)])),
OrderedDict([('input_shape', [-1, 64, 32, 32]),
('output_shape', [-1, 64, 32, 32]),
('trainable', False),
('nb_params', 128)])),
OrderedDict([('input_shape', [-1, 64, 32, 32]),
('output_shape', [-1, 64, 32, 32]),
('nb_params', 0)])),
OrderedDict([('input_shape', [-1, 64, 32, 32]),
('output_shape', [-1, 64, 16, 16]),
('nb_params', 0)])),
OrderedDict([('input_shape', [-1, 64, 16, 16]),
('output_shape', [-1, 64, 16, 16]),
('trainable', False),
('nb_params', 36864)]))

  • ‘input_shape’, [-1, 3, 64, 64] — PyTorch lists channel before the image size. Some of the GPU computations run faster when it is in that order. This is done behind scene by the transformation step.
  • -1 : indicates however big the batch size is. Keras uses None .
  • ‘output_shape’, [-1, 64, 32, 32] — 64 is the number of kernels

Question: Learning rate finder for a very small dataset returned strange number and the plot was empty [01:58:57] — The learning rate finder will go through a mini-batch at a time. If you have a tiny dataset, there is just not enough mini-batches. So the trick is to make your batch size very small like 4 or 8.

Structured Data [01:59:48]

There are two types of dataset we use in machine learning:

  • Unstructured — Audio, images, natural language text where all of the things inside an object are all the same kind of things — pixels, amplitude of waveform, or words.
  • Structured — Profit and loss statement, information about a Facebook user where each column is structurally quite different. “Structured” refers to columnar data as you might find in a database or a spreadsheet where different columns represent different kinds of things, and each row represents an observation.

Structured data is often ignored in academics because it is pretty hard to get published in fancy conference proceedings if you have a better logistics model. But it is the thing that makes the world goes round, makes everybody money and efficiency. We will not ignore it because we are doing practical deep learning, and Kaggle does not either because people put prize money up on Kaggle to solve real-world problems:

Rossmann Store Sale [02:02:42]


from fastai.structured import *
from fastai.column_data import *
np.set_printoptions(threshold=50, edgeitems=20)

  • fastai.structured — not PyTorch specific and also used in machine learning course doing random forests with no PyTorch at all. It can used on its own without any of the other parts of library.
  • fastai.column_data — allows us to do and PyTorch stuff with columnar structured data.
  • For structured data need to use Pandas a lot. Pandas is an attempt to replicate R’s data frames in Python (If you are not familiar with Pandas, here is a good book — Python for Data Analysis, 2nd Edition)

There are a lot of data pre-processing This notebook contains the entire pipeline from the third place winner (Entity Embeddings of Categorical Variables). Data processing is not covered in this course, but is covered in machine learning course in some detail because feature engineering is very important.

Looking at CSV files

table_names = ['train', 'store', 'store_states', 'state_names', 
'googletrend', 'weather', 'test']
tables = [pd.read_csv(f'{PATH}{fname}.csv', low_memory=False) for fname in table_names]for t in tables: display(t.head())
  • StoreType — you often get datasets where some columns contain “code”. It really does not matter what the code means. Stay away from learning too much about it and see what the data says first.

Joining tables

This is a relational dataset, and you have join quite a few tables together — which is easy to do with Pandas’ merge:

def join_df(left, right, left_on, right_on=None, suffix='_y'):
if right_on is None: right_on = left_on

return left.merge(right, how='left', left_on=left_on,
right_on=right_on, suffixes=("", suffix))

From library:

add_datepart(train, "Date", drop=False)
  • Take a date and pull out a bunch of columns such as “day of week”, “start of a quarter”, “month of year” and so on and add them all to the dataset.
  • Duration section will calculate things like how long until the next holiday, how long it has been since the last holiday, etc.
  • to_feather : Saves a Pandas’ data frame into a “feather” format which takes it as it sits in RAM and dumps it to the disk. So it is really really fast. Ecuadorian grocery competition has 350 million records, so you will care about how long it takes to save.

Next week

  • split columns into two types: categorical and continuous. Categorical column will be represented as one hot encoding, and continuous column gets fed into fully connected layer as is.
  • categorical: store #1 and store #2 are not numerically related to each other. Similarly, day of week Monday (day 0) and Tuesday (day 1).
  • continuous: Things like distance in kilometers to the nearest competitor is a number we treat numerically.
  • ColumnarModelData

Lessons: 1234567891011121314