Deep Learning 2: Part 2 Lesson 8

Hiromi Suenaga
Mar 25, 2018 · 26 min read

My personal notes from course. These notes will continue to be updated and improved as I continue to review the course to “really” understand it. Much appreciation to Jeremy and Rachel who gave me this opportunity to learn.

Lessons: 1234567891011121314

Object Detection

Forum / Video / Notebook / Slides

Differentiable layer [02:11]

Yann LeCun has been promoting the idea that we do not call this “deep learning” but “differentiable programming”. All we did in part 1 was really about setting up a differentiable function and a loss function that describes how good the parameters are and then pressing go and it makes it work. If you can configure a loss function that scores how good something is doing your task sand you have a reasonably flexible neural network architecture, you are done.

Yeah, Differentiable Programming is little more than a rebranding of the modern collection Deep Learning techniques, the same way Deep Learning was a rebranding of the modern incarnations of neural nets with more than two layers.

The important point is that people are now building a new kind of software by assembling networks of parameterized functional blocks and by training them from examples using some form of gradient-based optimization….It’s really very much like a regular program, except it’s parameterized, automatically differentiated, and trainable/optimizable.

- Yann LeCun, Director of FAIR

2. Transfer Learning [03:23]

Transfer learning is the most important single thing to be able to do to use deep learning effectively. You almost never would want to or need to start with random weights unless nobody had ever trained a model on a vaguely similar set of data with an even remotely connected kind of problem to solve as what you are doing — which almost never happens. Fastai library focuses on transfer learning which makes it different from other libraries. The basic idea of transfer learning is:

  • Given a network that does thing A, remove the last layer.
  • Replace it with a few random layers at the end
  • Fine-tune those layers to do thing B while taking advantage of the features that the original network learned
  • Then optionally fine tune the whole thing end-to-end and you now have something which probably uses orders of magnitude less data, is more accurate, and trains a lot faster.

3. Architecture design [05:17]

There is a pretty small range of architectures that generally works pretty well quite a lot of the time. We have been focusing on using CNN’s for generally fixed size ordered data, RNN’s for sequences that have some kind of state. We also fiddled around a tiny bit with activation functions — softmax if you have a single categorical outcome, or sigmoid if you have multiple outcomes. Some of the architecture design we will be studying in part 2 gets more interesting. Particularly this first session about object detection. But on the whole, we probably spend less time talking about architecture design than most courses or papers because it is generally not the hard bit.

4. Handling over-fitting [06:26]

The way Jeremy likes to build a model:

  • Create something that is definitely terribly over-parameterized which will massively overfit for sure, train it and make sure it does overfit. At that point, you’ve got a model that is capable of reflecting the training set. Then it is as simple as doing these things to reduce that overfitting.

If you don’t start with something that is overfitting, you are lost. So you start with something overfitting and to make it overfit less you can:

  • add more data
  • add more data augmentation
  • do things like more batch norm layers, dense nets, or various things that can handle less data.
  • add regularization like weight decay and dropout
  • finally (this is often the thing people do first but this should be the thing you do last) reduce the complexity of your architecture. have less layers or less activations.

5. Embeddings [07:46]

We have talked quite a bit about embeddings — both for NLP and the general idea of any kind of categorical data as being something you can now model with neural nets. Just earlier this year, there were almost no examples about using tabular data in deep learning, but it is becoming more and more popular approach to use neural nets for time series and tabular data analysis.

Part 1 really was all about introducing best practices in deep learning. We saw techniques which were mature enough that they definitely work reasonably reliably for practical real-world problems. Jeremy had researched and tuned enough over quite a long period of time, came up with a sequences of steps, architectures, etc, and put them into the fastai library in a way we could do that quickly and easily.

Part 2 is cutting edge deep learning for coders, and what that means is Jeremy often does not know the exact best parameters, architecture details, and so forth to solve a particular problem. We do not necessarily know if it’s going to solve a problem well enough to be practically useful. It almost certainly won’t be integrated well enough into fastai or any other library that you can just press a few buttons and it will start working. Jeremy will not going to teach it unless he is very confident that it either is now or will be soon very practically useful technique. But it will require a lot of tweaking often and experimenting to get it to work on your particular problem because we don’t know the details well enough to know how to make it work for every data set or every example.

This means rather than Fastai and PyTorch being obscure black boxes which you just know these recipes for, you are going to learn the details of them well enough that you can customize them exactly the way you want, you can debug them, you can read the source code of them to see what’s happening. If you are not confident of object-oriented Python, then that is something you want to focus on studying during this course as we will not cover it in the class. But Jeremy will introduce some tools that he thinks are particularly helpful like the Python debugger, how to use your editor to jump through the code. In general, there will be a lot more detailed and specific code walkthroughs, coding technique discussions, as well as more detailed walkthroughs of papers.

Be aware of sample codes [13:20]! The code academics have put up to go along with papers or example code somebody else has written on github, Jeremy nearly always find there is some massive critical flaw, so be careful of taking code from online resources and be ready to do some debugging.

How to use notebooks [14:17]

Building your own box [16:50]
Reading papers [21:37]

Each week, we will be implementing a paper or two. On the left is an extract from the paper that implements adam (you have also seen adam as a single excel formula on a spreadsheet). In academic papers, people love to use Greek letters. They also hate to refactor, so you will often see a page long formula where when you look at it carefully you’ll realize the same sub equation appears 8 times. Academic papers are a bit weird, but in the end, it’s the way that the research community communicates their findings so we need to learn to read them. A great thing to do is to take a paper, put in the effort to understand it, then write a blog where you explain it in code and normal English. Lots of people who do that end up getting quite a following, end up getting some pretty great job offers and so forth because it is such a useful skill to be able to show that you can understand these papers, implement them in code, and explain them in English. It is very hard to read or understand something you cannot vocalize. So learn Greek letters!

More opportunities [25:29]
Part 2’s Topics [30:12]

Generative Models

In part 1, the output of our neural network was generally a number or a category, where else, the outputs of a lot of the things in part 2 are going to be a whole a lot of things like:

  • the top left and bottom right location of every object in an image along with what that object is
  • a complete picture with the class of every single pixel in that picture
  • an enhanced super resolution version of the input image
  • the entire original input paragraph translated into French

Vast majority of the data we will be looking at will be either text or image data.

We will be looking at some larger datasets both in terms of the number of objects in the dataset and the size of each of those objects. For those of you that are working with limited computational resources, please don’t let that put you off. Feel free to replace it with something smaller and simpler. Jeremy actually wrote large amount of the course with no internet (in Point Leo) on a surface book 15 inch. Pretty much all of this course works well on Windows on a laptop. You can always use smaller batch sizes, cut-down version of the dataset. But if you have the resources, you will get better results with bigger datasets when they are available.

Two main differences from what we are used to:

1.We have multiple things that we are classifying.

This if not unheard of — we did that in the planet satellite data in part 1.

2. Bounding boxes around what we are classifying.

A bounding box has a very specific definition which is it’s a rectangle and the rectangle has the object entirely fitting within it but it is no bigger than it has to be.

Our job will be to take data that has been labeled in this way and on data that is unlabeled to generate classes of the objects and each one of those their bounding boxes. One thing to note is that labeling this kind of data is generally more expensive [37:09]. For object detection datasets, annotators are given a list of object classes and asked to find every single one of them of any type in a picture along with where they are. In this case why isn’t there a tree or jump labeled? That is because for this particular dataset, they were not one of the classes that annotators were asked to find and therefore not part of this particular problem.

  1. Classify the largest object in each image.
  2. Find the location of the largest object at each image.
  3. Finally we will try and do both at the same time (i.e. label what it is and where it is for the largest object in the picture).
%matplotlib inline
%reload_ext autoreload
%autoreload 2
from fastai.conv_learner import *
from fastai.dataset import *
from pathlib import Path
import json
from PIL import ImageDraw, ImageFont
from matplotlib import patches, patheffects
# torch.cuda.set_device(1)

You may find a line torch.cuda.set_device(1) left behind which will give you an error if you only have one GPU. This is how you select a GPU when you have multiple, so just set it to zero or take out the line entirely.

There is a number of standard object detection datasets just like ImageNet being a standard object classification dataset [41:12]. The classic ImageNet equivalent is Pascal VOC.

We will be looking at the Pascal VOC dataset. It’s quite slow, so you may prefer to download from this mirror. There are two different competition/research datasets, from 2007 and 2012. We’ll be using the 2007 version. You can use the larger 2012 for better results, or even combine them [42:25](but be careful to avoid data leakage between the validation sets if you do this).

Unlike previous lessons, we are using the python 3 standard library pathlib for our paths and file access. Note that it returns an OS-specific class (on Linux, PosixPath) so your output may look a little different [44:50]. Most libraries that take paths as input can take a pathlib object - although some (like cv2) can't, in which case you can use str() to convert it to a string.

Pathlib Cheat Sheet

PATH = Path('data/pascal')

A little bit about generator [43:23]:

Generator is something in Python 3 which you can iterate over.

  • for i in PATH.iterdir(): print(i)
  • [i for i in PATH.iterdir()] (list comprehension)
  • list(PATH.iterdir()) (turn a generator into a list)

The reason that things generally return generators is that if the directory had 10 million items in, you don’t necessarily want 10 million long list. Generator lets you do things “lazily”.

As well as the images, there are also annotationsbounding boxes showing where each object is. These were hand labeled. The original version were in XML [47:59], which is a little hard to work with nowadays, so we uses the more recent JSON version which you can download from this link.

You can see here how pathlib includes the ability to open files (amongst many other capabilities).

trn_j = json.load((PATH/'pascal_train2007.json').open())
dict_keys(['images', 'type', 'annotations', 'categories'])

Here / is not divided by but it is path slash [45:55]. PATH/ gets you children in that path. PATH/’pascal_train2007.json’ returns a pathlib object which has an open method. This JSON file contains not the images but the bounding boxes and the classes of the objects.

IMAGES,ANNOTATIONS,CATEGORIES = ['images', 'annotations', 
trn_j[IMAGES][:5][{'file_name': '000012.jpg', 'height': 333, 'id': 12, 'width': 500}, {'file_name': '000017.jpg', 'height': 364, 'id': 17, 'width': 480}, {'file_name': '000023.jpg', 'height': 500, 'id': 23, 'width': 334}, {'file_name': '000026.jpg', 'height': 333, 'id': 26, 'width': 500}, {'file_name': '000032.jpg', 'height': 281, 'id': 32, 'width': 500}]
  • bbox : column, row (of top left), height, width
  • image_id : you’d have join this up with trn_j[IMAGES] (above) to find file_name etc.
  • category_id : see trn_j[CATEGORIES] (below)
  • segmentation : polygon segmentation (we will be using them)
  • ignore : we will ignore the ignore flags
  • iscrowd : specifies that it is a crowd of that object, not just one of them
trn_j[ANNOTATIONS][:2][{'area': 34104,
'bbox': [155, 96, 196, 174],
'category_id': 7,
'id': 1,
'ignore': 0,
'image_id': 12,
'iscrowd': 0,
'segmentation': [[155, 96, 155, 270, 351, 270, 351, 96]]},
{'area': 13110,
'bbox': [184, 61, 95, 138],
'category_id': 15,
'id': 2,
'ignore': 0,
'image_id': 17,
'iscrowd': 0,
'segmentation': [[184, 61, 184, 199, 279, 199, 279, 61]]}]
trn_j[CATEGORIES][:4][{'id': 1, 'name': 'aeroplane', 'supercategory': 'none'},
{'id': 2, 'name': 'bicycle', 'supercategory': 'none'},
{'id': 3, 'name': 'bird', 'supercategory': 'none'},
{'id': 4, 'name': 'boat', 'supercategory': 'none'}]

It’s helpful to use constants instead of strings, since we get tab-completion and don’t mistype.

cats = dict((o[ID], o['name']) for o in trn_j[CATEGORIES])
trn_fns = dict((o[ID], o[FILE_NAME]) for o in trn_j[IMAGES])
trn_ids = [o[ID] for o in trn_j[IMAGES]]

Side Note: What people most comment on when they see Jeremy working in real time having seen his classes [51:21]:

“Wow, you actually don’t know what you are doing, do you”. 99% of the things he does don’t work and small percentage of things that do work end up here. He mentioned this because machine learning, particularly deep learning is incredibly frustrating [51:45]. In theory, you just define the correct loss function and the flexible enough architecture, and you press train and you are done. But if that was actually all that took, then nothing would take any time. The problem is that all the steps along the way until it works, it doesn’t work. Like it goes straight to infinity, crashes with an incorrect tensor size, etc. He will endeavor to show you some kind of debugging techniques as we go, but it is one of the hardest things to teach. The main thing it requires is tenacity. The difference between the people who are super effective and the ones who do not seem to go very far has never been about intellect. It’s always been about sticking with it — basically never giving up. It’s particularly important with this kind of deep learning because you don’t get that continuous reward cycle [53:04]. It’s a constant stream of doesn’t work, doesn’t work, doesn’t work, until eventually it does so it’s kind of annoying.


Each image has a unique ID.

im0_d = trn_j[IMAGES][0]
('000012.jpg', 12)

A defaultdict is useful any time you want to have a default dictionary entry for new keys [55:05]. If you try and access a key that doesn’t exist, it magically makes itself exist and it sets itself equal to the return value of the function you specify (in this case lambda:[]).

Here we create a dict from image IDs to a list of annotations (tuple of bounding box and class id).

We convert VOC’s height/width into top-left/bottom-right, and switch x/y coords to be consistent with numpy. If given datasets are in crappy formats, take a couple of moments to make things consistent and make them the way you want them to be [1:01:24]

trn_anno = collections.defaultdict(lambda:[])
for o in trn_j[ANNOTATIONS]:
if not o['ignore']:
bb = o[BBOX]
bb = np.array([bb[1], bb[0], bb[3]+bb[1]-1, bb[2]+bb[0]-1])


Variable naming, coding style philosophy, etc [56:1559:33]

example 1

  • [ 96, 155, 269, 350] : a bounding box [59:53]. As you see above, when we created the bounding box, we did a couple of things. The first is we switched the x and y coordinates. The reason for this that in computer vision world, when you say “my screen is 640 by 480” that is width by height. Or else, the math world, when you say “my array is 640 by 480” it is rows by columns. So pillow image library tends to do things in width by height or columns by rows, and numpy is the opposite way around. The second is that we are going to do things by describing the top-left xy coordinate and the bottom right xy coordinate — rather than x, y, height, width.
  • 7 : class label / category
im0_a = im_a[0]; im0_a[(array([ 96, 155, 269, 350]), 7)]im0_a = im_a[0]; im0_a(array([ 96, 155, 269, 350]), 7)cats[7]'car'

example 2

trn_anno[17][(array([61, 184, 198, 278]), 15), (array([77, 89, 335, 402]), 13)]cats[15],cats[13]('person', 'horse')

Some libs take VOC format bounding boxes, so this let’s us convert back when required [1:02:23]:

def bb_hw(a): return np.array([a[1],a[0],a[3]-a[1],a[2]-a[0]])

We will use’s open_image in order to display it:

im = open_image(IMG_PATH/im0_d[FILE_NAME])

You can use Visual Studio Code (vscode — open source editor that comes with recent versions of Anaconda, or can be installed separately), or most editors and IDEs, to find out all about the open_image function. vscode things to know:

  • Command palette (Ctrl-shift-p)
  • Select interpreter (for fastai env)
  • Select terminal shell
  • Go to symbol (Ctrl-t)
  • Find references (Shift-F12)
  • Go to definition (F12)
  • Go back (alt-left)
  • View documentation
  • Hide sidebar (Ctrl-b)
  • Zen mode (Ctrl-k,z)

If you are using PyCharm Professional Edition on Mac like I am:

  • Command palette (Shift-command-a)
  • Select interpreter (for fastai env) (Shift-command-a and then look for “interpreter”)
  • Select terminal shell (Option-F12 )
  • Go to symbol (Option-command-shift-n and type name of the class, function, etc. If it’s in camelcase or underscore separated, you can type in first few letters of each bit)
  • Find references (Option-F7), next occurrence (Option-command-⬇︎), previous occurrence (Option-command-⬆︎)
  • Go to definition (Command-b)
  • Go back (Option-command-⬅︎)
  • View documentation
  • Zen mode (Control-`-4–2 or search for “distraction free mode”)

Fastai uses OpenCV. TorchVision uses PyTorch tensors for data augmentations etc. A lot of people use Pillow PIL. Jeremy did a lot of testing of all of these and he found OpenCV was about 5 to 10 times faster than TorchVision. For the planet satellite image competition [1:11:55], TorchVision was so slow that they could only get 25% GPU utilization because they were doing a lot of data augmentation. Profiler showed that it was all in TorchVision.

Pillow is quite a bit faster but it is not as fast as OpenCV and also is not nearly as thread-safe [1:12:19]. Python has this thing called the global interpreter lock (GIL) which means that two thread can’t do pythonic things at the same time — which makes Python a crappy language for modern programming but we are stuck with it. OpenCV releases the GIL. One of the reasons library is so fast is because it does not use multiple processors like every other library does for data augmentations — it actually does multiple threads. The reason it could do multiple thread is because it uses OpenCV. Unfortunately OpenCV has an inscrutable API and documentations are somewhat obtuse. That is why Jeremy tried to make it so that no one using needs to know that it’s using OpenCV. You don’t need to know what flags to pass to open an image. You don’t need to know that if the reading fails, it doesn’t show an exception — it silently returns None.

Don’t start using PyTorch for your data augmentation or start bringing in Pillow — you will find suddenly things slow down horribly or the multi-threading won’t work anymore. You should stick to using OpenCV for your processing [1:14:10]

Matplotlib is so named because it was originally a clone of Matlab’s plotting library. Unfortunately, Matlab’s plotting library is not great, but at that time, it was what everybody knew. At some point, the matplotlib folks realized that and added a second API which was an object-oriented API. Unfortunately, because nobody who originally learnt matplotlib learnt the OO API, they then taught the next generation of people the old Matlab style API. Now there are not many examples or tutorials that use the much better, easier to understand, and simpler OO API. Because plotting is so important in deep learning, one of the things we are going to learn in this class is how to use this API.

Trick 1: plt.subplots [1:16:00]

Matplotlib’s plt.subplots is a really useful wrapper for creating plots, regardless of whether you have more than one subplot. Note that Matplotlib has an optional object-oriented API which I think is much easier to understand and use (although few examples online use it!)

def show_img(im, figsize=None, ax=None):
if not ax: fig,ax = plt.subplots(figsize=figsize)
return ax

It returns two things — you probably won’t care about the first one (Figure object), the second one is Axes object (or an array of them). Basically anywhere you used to say plt. something, you now say ax. something, and it will now do the plotting to that particular subplot. This is helpful when you want to plot multiple plots so you can compare next to each other.

Trick 2: Visible text regardless of background color [1:17:59]

A simple but rarely used trick to making text visible regardless of background is to use white text with black outline, or visa versa. Here’s how to do it in matplotlib.

def draw_outline(o, lw):
linewidth=lw, foreground='black'), patheffects.Normal()])

Note that * in argument lists is the splat operator. In this case it's a little shortcut compared to writing out b[-2],b[-1].

def draw_rect(ax, b):
patch = ax.add_patch(patches.Rectangle(b[:2], *b[-2:],
fill=False, edgecolor='white', lw=2))
draw_outline(patch, 4)
def draw_text(ax, xy, txt, sz=14):
text = ax.text(*xy, txt, verticalalignment='top', color='white',
fontsize=sz, weight='bold')
draw_outline(text, 1)
ax = show_img(im)
b = bb_hw(im0_a[0])
draw_rect(ax, b)
draw_text(ax, b[:2], cats[im0_a[1]])

Packaging it all up [1:21:20]

def draw_im(im, ann):
ax = show_img(im, figsize=(16,8))
for b,c in ann:
b = bb_hw(b)
draw_rect(ax, b)
draw_text(ax, b[:2], cats[c], sz=16)
def draw_idx(i):
im_a = trn_anno[i]
im = open_image(IMG_PATH/trn_fns[i])
draw_im(im, im_a)

When you are working with a new dataset, getting to the point that you can rapidly explore it pays off.

Largest item classifier [1:22:57]

Rather than trying to solve everything at once, let’s make continual progress. We know how to find the biggest object in each image and classify it, so let’s start from there. Jeremy’s approach to Kaggle competition is half an hour every day [1:24:00]. At the end of that half hour, submit something and try to make it a little bit better than yesterday’s.

The first thing we need to do is to go through each of the bounding boxes in an image and get the largest one. A lambda function is simply a way to define an anonymous function inline. Here we use it to describe how to sort the annotation for each image — by bounding box size (descending).

We subtract the upper left from the bottom right and multiply (np.product) the values to get an area lambda x: np.product(x[0][-2:]-x[0][:2]).

def get_lrg(b):
if not b: raise Exception()
b = sorted(b, key=lambda x: np.product(x[0][-2:]-x[0][:2]),
return b[0]

Dictionary comprehension [1:27:04]

trn_lrg_anno = {a: get_lrg(b) for a,b in trn_anno.items()}

Now we have a dictionary from image id to a single bounding box — the largest for that image.

b,c = trn_lrg_anno[23]
b = bb_hw(b)
ax = show_img(open_image(IMG_PATH/trn_fns[23]), figsize=(5,10))
draw_rect(ax, b)
draw_text(ax, b[:2], cats[c], sz=16)

You need to look at every stage when you have any kind of processing pipeline [1:28:01]. Assume that everything you do will be wrong the first time you do it.

CSV = PATH/'tmp/lrg.csv'

Often it’s easiest to simply create a CSV of the data you want to model, rather than trying to create a custom dataset [1:29:06]. Here we use Pandas to help us create a CSV of the image filename and class. columns=[‘fn’,’cat’] is there because dictionary does not have an order and the order of columns matters.

df = pd.DataFrame({'fn': [trn_fns[o] for o in trn_ids],
'cat': [cats[trn_lrg_anno[o][1]] for o in trn_ids]},
df.to_csv(CSV, index=False)
f_model = resnet34

From here it’s just like Dogs vs Cats!

tfms = tfms_from_model(f_model, sz, aug_tfms=transforms_side_on, 
md = ImageClassifierData.from_csv(PATH, JPEGS, CSV, tfms=tfms)

One thing that is different is crop_type. The default strategy for creating 224 by 224 image in is to first resize it so that the smallest side is 224. Then to take a random squared crop during the training. During validation, we take the center crop unless we use data augmentation.

For bounding boxes, we do not want to do that because unlike an image net where the thing we care about is pretty much in the middle and pretty big, a lot of the things in object detection is quite small and close to the edge. By setting crop_type to CropType.NO, it will not crop and therefore, to make it square, it squishes it [1:32:09]. Generally speaking, a lot of computer vision models work a little bit better if you crop rather than squish, but they still work pretty well if you squish. In this case, we definitely do not want to crop, so this is perfectly fine.


You already know that inside of a model data object, we have bunch of things which include training data loader and training data set. The main thing to know about data loader is that it is an iterator that each time you grab the next iteration of stuff from it, you get a mini batch. The mini batch you get is of whatever size you asked for and by default the batch size is 64. In Python, the way you grab the next thing from an iterator is with next next(md.trn_dl) but you can’t just do that. The reason you can’t say that is because you need to say “start a new epoch now”. In general, not just in PyTorch but for any Python iterator, you need to say “start at the beginning of the sequence please”. The say you do that is to useiter(md.trn_dl) which will grab an iterator out of md.trn_dl — specifically as we will learn later, it means that this class has to have defined an __iter__ method which returns some different object which then has an __next__ method.

If you want to grab just a single batch, this is how you do it (x: independent variable, y: dependent variable):


We cannot send this straight to show_image[1:35:30]. For example, x is not a numpy array, not on CPU, and the shape is all wrong (3x224x224). Further more, they are not numbers between 0 and 1 because all of the standard ImageNet pre-trained models expect our data to have been normalized to have a zero mean and 1 standard deviation.

As you see, there is a whole bunch of things that has been done to the input to get it ready to be passed to a pre-trained model. So we have a function called denorm for denormalize and also fixes up dimension order etc. Since denormalization depends on the transform [1:37:52], and dataset knows what transform was used to create it, so that is why you have to do md.val_ds.denorm and pass the mini-batch after turning it into numpy array:

learn = ConvLearner.pretrained(f_model, md, metrics=[accuracy])
learn.opt_fn = optim.Adam

We intentionally remove the first few points and the last few points [1:38:54], because often the last few points shoots so high up towards infinity that you can’t see anything so it is generally a good idea. But when you have very few mini-batches, it is not a good idea. When your LR finder graph looks like above, you can ask for more points on each end (you can also make your batch size really small):

learn.sched.plot(n_skip=5, n_skip_end=1)
lr = 2e-2, 1, cycle_len=1)
epoch trn_loss val_loss accuracy
0 1.280753 0.604127 0.806941

Unfreeze a couple of layers:

lrs = np.array([lr/1000,lr/100,lr])
learn.freeze_to(-2), 1, cycle_len=1)
epoch trn_loss val_loss accuracy
0 0.780925 0.575539 0.821064

Unfreeze the whole thing:

learn.unfreeze(), 1, cycle_len=2)
epoch trn_loss val_loss accuracy
0 0.676254 0.546998 0.834285
1 0.460609 0.533741 0.833233

Accuracy isn’t improving much — since many images have multiple different objects, it’s going to be impossible to be that accurate.

fig, axes = plt.subplots(3, 4, figsize=(12, 8))
for i,ax in enumerate(axes.flat):
b = md.classes[preds[i]]
ax = show_img(ima, ax=ax)
draw_text(ax, (0,0), b)

How to understand the unfamiliar code:

  • Run each line of code step by step, print out the inputs and outputs.

Method 1 [1:42:28]: You can take the contents of the loop, copy it, create a cell above it, paste it, un-indent it, set i=0 and put them all in separate cells.

Method 2 [1:43:04]: Use Python debugger

You can use the python debugger pdb to step through code.

  • pdb.set_trace() to set a breakpoint
  • %debug magic to trace an error (after the exception happened)

Commands you need to know:

  • h (help)
  • s (step into)
  • n (next line / step over — you can also hit enter)
  • c (continue to the next breakpoint)
  • u (up the call stack)
  • d (down the call stack)
  • p (print) — force print when there is a single letter variable that’s also a command.
  • l (list) — show the line above and below it
  • q (quit) — very important

Comment [1:49:10]:IPython.core.debugger (on the right below) makes it all pretty:

Creating a bounding box around the largest object may seem like something you haven’t done before, but actually it is totally something you have done before. We can create a regression rather than a classification neural net. Classification neural net is the one that has a sigmoid or softmax output, and we use a cross entropy, binary cross entropy, or negative log likelihood loss function. That is basically what makes it classifier. If we don’t have the softmax or sigmoid at the end and we use mean squared error as a loss function, it is now a regression model which predict continuous number rather than a category. We also know that we can have multiple outputs like in the planet competition (multiple classification). What if we combine the two ideas and do a multiple column regression?

This is where you are thinking about it like differentiable programming. It is not like “how do I create a bounding box model?” but it is more like:

  • We need four numbers, therefore, we need a neural network with 4 activations
  • For loss function, what is a function that when it is lower means that the four numbers are better? Mean squared loss function!

That’s it. Let’s try it.

Now we’ll try to find the bounding box of the largest object. This is simply a regression with 4 outputs. So we can use a CSV with multiple ‘labels’. If you remember from part 1 to do a multiple label classification, your multiple labels have to be space separated, and the file name is comma separated.

BB_CSV = PATH/'tmp/bb.csv'
bb = np.array([trn_lrg_anno[o][0] for o in trn_ids])
bbs = [' '.join(str(p) for p in o) for o in bb]
df = pd.DataFrame({'fn': [trn_fns[o] for o in trn_ids],
'bbox': bbs}, columns=['fn','bbox'])
df.to_csv(BB_CSV, index=False)[:5]['fn,bbox\n',
'000012.jpg,96 155 269 350\n',
'000017.jpg,77 89 335 402\n',
'000023.jpg,1 2 461 242\n',
'000026.jpg,124 89 211 336\n']

Set continuous=True to tell fastai this is a regression problem, which means it won't one-hot encode the labels, and will use MSE as the default crit.

Note that we have to tell the transforms constructor that our labels are coordinates, so that it can handle the transforms correctly.

Also, we use CropType.NO because we want to ‘squish’ the rectangular images into squares, rather than center cropping, so that we don’t accidentally crop out some of the objects. (This is less of an issue in something like imagenet, where there is a single object to classify, and it’s generally large and centrally located).

tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO, 
md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms,

We will look at TfmType.COORD next week, but for now, just realize that when we are doing scaling and data augmentation, that needs to happen to the bounding boxes, not just images.

b = bb_hw(to_np(y[0])); b
array([ 49., 0., 131., 205.], dtype=float32)ax = show_img(ima)
draw_rect(ax, b)
draw_text(ax, b[:2], 'label')

fastai lets you use a custom_head to add your own module on top of a convnet, instead of the adaptive pooling and fully connected net which is added by default. In this case, we don't want to do any pooling, since we need to know the activations of each grid cell.

The final layer has 4 activations, one per bounding box coordinate. Our target is continuous, not categorical, so the MSE loss function used does not do any sigmoid or softmax to the module outputs.

head_reg4 = nn.Sequential(Flatten(), nn.Linear(25088,4))
learn = ConvLearner.pretrained(f_model, md, custom_head=head_reg4)
learn.opt_fn = optim.Adam
learn.crit = nn.L1Loss()
  • Flatten() : Normally the previous layer has 7x7x512 in ResNet34, so flatten that out into a single vector of length 2508
  • L1Loss [1:58:22]: Rather than adding up the squared errors, add up the absolute values of the errors. It is normally what you want because adding up the squared errors really penalizes bad misses by too much. So L1Loss is generally better to work with.
78%|███████▊ | 25/32 [00:04<00:01, 6.16it/s, loss=395]
lr = 2e-3, 2, cycle_len=1, cycle_mult=2)
epoch trn_loss val_loss
0 49.523444 34.764141
1 36.864003 28.007317
2 30.925234 27.230705
lrs = np.array([lr/100,lr/10,lr])
learn.sched.plot(1), 2, cycle_len=1, cycle_mult=2)epoch      trn_loss   val_loss                            
0 25.616161 22.83597
1 21.812624 21.387115
2 17.867176 20.335539
learn.freeze_to(-3), 1, cycle_len=2)
epoch trn_loss val_loss
0 16.571885 20.948696
1 15.072718 19.925312

Validation loss is the mean of the absolute value with pixels were off by.'reg4')
x,y = next(iter(md.val_dl))
preds = to_np(learn.model(VV(x)))
fig, axes = plt.subplots(3, 4, figsize=(12, 8))
for i,ax in enumerate(axes.flat):
b = bb_hw(preds[i])
ax = show_img(ima, ax=ax)
draw_rect(ax, b)

We will revise this more next week. Before this class, if you were asked “do you know how to create a bounding box model?”, you might have said “no, nobody’s taught me that”. But the question actually is:

  • Can you create a model with 4 continuous outputs? Yes.
  • Can you create a loss function that is lower if those 4 outputs are near to 4 other numbers? Yes

Then you are done.

As you look further down, it starts looking a bit crappy — anytime we have more than one object. This is not surprising. Overall, it did a pretty good job.

Lessons: 1234567891011121314