Deep Learning 2: Part 2 Lesson 9

42 min readApr 1, 2018

My personal notes from fast.ai course. These notes will continue to be updated and improved as I continue to review the course to “really” understand it. Much appreciation to Jeremy and Rachel who gave me this opportunity to learn.

Lessons: 1 ・ 2 ・ 3 ・ 4 ・ 5 ・ 6 ・ 7 ・ 8 ・ 9 ・ 10 ・ 11 ・ 12 ・ 13 ・ 14

Links

Forum / Video

Review

From Last week:

Pathlib; JSON
Dictionary comprehensions
Defaultdict
How to jump around fastai source
matplotlib OO API
Lambda functions
Bounding box coordinates
Custom head; bounding box regression

From Part 1:

How to view model inputs from a DataLoader
How to view model outputs

Data Augmentation and Bounding Box [2:58]

Notebook

Awkward rough edges of fastai:
A classifier is anything with dependent variable is categorical or binomial. As opposed to regression which is anything with dependent variable is continuous. Naming is a little confusing but will be sorted out in future. Here, continuous is True because our dependent variable is the coordinates of bounding box — hence this is actually a regressor data.

tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO, 
                       aug_tfms=augs)
md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms,
                                  continuous=True, bs=4)

Let’s create some data augmentation [4:40]

augs = [RandomFlip(), 
        RandomRotate(30),
        RandomLighting(0.1,0.1)]

Normally, we use these shortcuts Jeremy created for us, but they are simply lists of random augmentations. But you can easily create your own (most if not all of them start with “Random”).

tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO,
                       aug_tfms=augs)
md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms,
                       continuous=True, bs=4)idx=3
fig,axes = plt.subplots(3,3, figsize=(9,9))
for i,ax in enumerate(axes.flat):
    x,y=next(iter(md.aug_dl))
    ima=md.val_ds.denorm(to_np(x))[idx]
    b = bb_hw(to_np(y[idx]))
    print(b)
    show_img(ima, ax=ax)
    draw_rect(ax, b)[ 115.   63.  240.  311.]
[ 115.   63.  240.  311.]
[ 115.   63.  240.  311.]
[ 115.   63.  240.  311.]
[ 115.   63.  240.  311.]
[ 115.   63.  240.  311.]
[ 115.   63.  240.  311.]
[ 115.   63.  240.  311.]
[ 115.   63.  240.  311.]

As you can see, the image gets rotated and lighting varies, but bounding box is not moving and is in a wrong spot [6:17]. This is the problem with data augmentations when your dependent variable is pixel values or in some way connected to the independent variable — they need to be augmented together. As you can see in the bounding box coordinates [ 115. 63. 240. 311.] , our image is 224 by 224 — so it is neither scaled nor cropped. The dependent variable needs to go through all the geometric transformation as the independent variables.

To do this [7:10], every transformation has an optional tfm_y parameter:

augs = [RandomFlip(tfm_y=TfmType.COORD),
        RandomRotate(30, tfm_y=TfmType.COORD),
        RandomLighting(0.1,0.1, tfm_y=TfmType.COORD)]tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO,
                       tfm_y=TfmType.COORD, aug_tfms=augs)
md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms, 
                       continuous=True, bs=4)

TrmType.COORD indicates that the y value represents coordinate. This needs to be added to all the augmentations as well as tfms_from_model which is responsible for cropping, zooming, resizing, padding, etc.

idx=3
fig,axes = plt.subplots(3,3, figsize=(9,9))
for i,ax in enumerate(axes.flat):
    x,y=next(iter(md.aug_dl))
    ima=md.val_ds.denorm(to_np(x))[idx]
    b = bb_hw(to_np(y[idx]))
    print(b)
    show_img(ima, ax=ax)
    draw_rect(ax, b)[  48.   34.  112.  188.]
[  65.   36.  107.  185.]
[  49.   27.  131.  195.]
[  24.   18.  147.  204.]
[  61.   34.  113.  188.]
[  55.   31.  121.  191.]
[  52.   19.  144.  203.]
[   7.    0.  193.  222.]
[  52.   38.  105.  182.]

Now, the bounding box moves with the image and is in the right spot. You may notice that sometimes it looks odd like the middle on in the bottom row. This is the constraint of the information we have. If the object occupied the corners of the original bounding box, your new bounding box needs to be bigger after the image rotates. So you must be careful of not doing too higher rotations with bounding boxes because there is not enough information for them to stay accurate. If we were doing polygons or segmentations, we would not have this problem.

tfm_y = TfmType.COORD
augs = [RandomFlip(tfm_y=tfm_y),
        RandomRotate(3, p=0.5, tfm_y=tfm_y),
        RandomLighting(0.05,0.05, tfm_y=tfm_y)]tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO, 
                 tfm_y=tfm_y, aug_tfms=augs)
md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms, 
                 continuous=True)

So here, we do maximum of 3 degree rotation to avoid this problem [9:14]. It also only rotates half of the time (p=0.5).

custom_head [9:34]

learn.summary() will run a small batch of data through a model and prints out the size of tensors at every layer. As you can see, right before the Flatten layer, the tensor has the shape of 512 by 7 by 7. So if it were a rank 1 tensor (i.e. a single vector) its length will be 25088 (512 * 7 * 7)and that is why our custom header’s input size is 25088. Output size is 4 since it is the bounding box coordinates.

head_reg4 = nn.Sequential(Flatten(), nn.Linear(25088,4))
learn = ConvLearner.pretrained(f_model, md, custom_head=head_reg4)
learn.opt_fn = optim.Adam
learn.crit = nn.L1Loss()

Single object detection [10:35]

Let’s combine the two to create something that can classify and localize the largest object in each image.

There are 3 things that we need to do to train a neural network:

Data
Architecture
Loss Function

1. Providing Data

We need a ModelData object whose independent variable is the images, and dependent variable is a tuple of bounding box coordinates and class label. There are several ways to do this, but here is a particularly lazy and convinient way Jeremy came up with is to create two ModelData objects representing the two different dependent variables we want (one with bounding boxes coordinates, one with classes).

f_model=resnet34
sz=224
bs=64val_idxs = get_cv_idxs(len(trn_fns))
tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO, 
                       tfm_y=TfmType.COORD, aug_tfms=augs)md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms, 
                       continuous=True, val_idxs=val_idxs)md2 = ImageClassifierData.from_csv(PATH, JPEGS, CSV,
                       tfms=tfms_from_model(f_model, sz))

A dataset can be anything with __len__ and __getitem__. Here's a dataset that adds a 2nd label to an existing dataset:

class ConcatLblDataset(Dataset):
    def __init__(self, ds, y2): self.ds,self.y2 = ds,y2
    def __len__(self): return len(self.ds)
    
    def __getitem__(self, i):
        x,y = self.ds[i]
        return (x, (y,self.y2[i]))

ds : contains both independent and dependent variables
y2 : contains the additional dependent variables
(x, (y,self.y2[i])) : __getitem___ returns an independent variable and the combination of two dependent variables.

We’ll use it to add the classes to the bounding boxes labels.

trn_ds2 = ConcatLblDataset(md.trn_ds, md2.trn_y)
val_ds2 = ConcatLblDataset(md.val_ds, md2.val_y)

Here is an example dependent variable:

val_ds2[0][1](array([   0.,   49.,  205.,  180.], dtype=float32), 14)

We can replace the dataloaders’ datasets with these new ones.

md.trn_dl.dataset = trn_ds2
md.val_dl.dataset = val_ds2

We have to denormalize the images from the dataloader before they can be plotted.

x,y = next(iter(md.val_dl))
idx = 3
ima = md.val_ds.ds.denorm(to_np(x))[idx]
b = bb_hw(to_np(y[0][idx])); barray([  52.,   38.,  106.,  184.], dtype=float32)ax = show_img(ima)
draw_rect(ax, b)
draw_text(ax, b[:2], md2.classes[y[1][idx]])

2. Choosing Architecture [13:54]

The architecture will be the same as the one we used for the classifier and bounding box regression, but we will just combine them. In other words, if we have c classes, then the number of activations we need in the final layer is 4 plus c. 4 for bounding box coordinates and c probabilities (one per class).

We’ll use an extra linear layer this time, plus some dropout, to help us train a more flexible model. In general, we want our custom head to be capable of solving the problem on its own if the pre-trained backbone it is connected to is appropriate. So in this case, we are trying to do quite a bit — classifier and bounding box regression, so just the single linear layer does not seem enough. If you were wondering why there is no BatchNorm1d after the first ReLU , ResNet backbone already has BatchNorm1d as its final layer.

head_reg4 = nn.Sequential(
    Flatten(),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(25088,256),
    nn.ReLU(),
    nn.BatchNorm1d(256),
    nn.Dropout(0.5),
    nn.Linear(256,4+len(cats)),
)
models = ConvnetBuilder(f_model, 0, 0, 0, custom_head=head_reg4)

learn = ConvLearner(md, models)
learn.opt_fn = optim.Adam

3. Loss Function [15:46]

The loss function needs to look at these 4 + len(cats) activations and decide if they are good — whether these numbers accurately reflect the position and class of the largest object in the image. We know how to do this. For the first 4 activations, we will use L1Loss just like we did before (L1Loss is like a Mean Squared Error — instead of sum of squared errors, it uses sum of absolute values). For rest of the activations, we can use cross entropy loss.

def detn_loss(input, target):
    bb_t,c_t = target
    bb_i,c_i = input[:, :4], input[:, 4:]
    bb_i = F.sigmoid(bb_i)*224
    # I looked at these quantities separately first then picked a 
    # multiplier to make them approximately equal
    return F.l1_loss(bb_i, bb_t) + F.cross_entropy(c_i, c_t)*20def detn_l1(input, target):
    bb_t,_ = target
    bb_i = input[:, :4]
    bb_i = F.sigmoid(bb_i)*224
    return F.l1_loss(V(bb_i),V(bb_t)).datadef detn_acc(input, target):
    _,c_t = target
    c_i = input[:, 4:]
    return accuracy(c_i, c_t)learn.crit = detn_loss
learn.metrics = [detn_acc, detn_l1]

input : activations
target : ground truth
bb_t,c_t = target : Our custom dataset returns a tuple containing bounding box coordinates and classes. This assignment will destructure them.
bb_i,c_i = input[:, :4], input[:, 4:] : the first : is for the batch dimension.
b_i = F.sigmoid(bb_i)*224 : we know our image is 224 by 224. Sigmoid will force it to be between 0 and 1, and multiply it by 224 to help our neural net to be in the range of what it has to be.

Question: As a general rule, is it better to put BatchNorm before or after ReLU [18:02]? Jeremy would suggest to put it after a ReLU because BathNorm is meant to move towards zero-mean one-standard deviation. So if you put ReLU right after it, you are truncating it at zero so there is no way to create negative numbers. But if you put ReLU then BatchNorm, it does have that ability and gives slightly better results. Having said that, it is not too big of a deal either way. You see during this part of the course, most of the time, Jeremy does ReLU then BatchNorm but sometimes does the opposite when he wants to be consistent with the paper.

Question: What is the intuition behind using dropout after a BatchNorm? Doesn’t BatchNorm already do a good job of regularizing [19:12]? BatchNorm does an okay job of regularizing but if you think back to part 1 when we discussed a list of things we do to avoid overfitting and adding BatchNorm is one of them as is data augmentation. But it’s perfectly possible that you’ll still be overfitting. One nice thing about dropout is that is it has a parameter to say how much to drop out. Parameters are great specifically parameters that decide how much to regularize because it lets you build a nice big over parameterized model and then decide on how much to regularize it. Jeremy tends to always put in a drop out starting with p=0 and then as he adds regularization, he can just change the dropout parameter without worrying about if he saved a model he want to be able to load it back, but if he had dropout layers in one but no in another, it will not load anymore. So this way, it stays consistent.

Now we have out inputs and targets, we can calculate the L1 loss and add the cross entropy [20:39]:

F.l1_loss(bb_i, bb_t) + F.cross_entropy(c_i, c_t)*20

This is our loss function. Cross entropy and L1 loss may be of wildly different scales — in which case in the loss function, the larger one is going to dominate. In this case, Jeremy printed out the values and found out that if we multiply cross entropy by 20 that makes them about the same scale.

lr=1e-2
learn.fit(lr, 1, cycle_len=3, use_clr=(32,5))epoch      trn_loss   val_loss   detn_acc   detn_l1       
    0      72.036466  45.186367  0.802133   32.647586 
    1      51.037587  36.34964   0.828425   25.389733     
    2      41.4235    35.292709  0.835637   24.343577[35.292709, 0.83563701808452606, 24.343576669692993]

It is nice to print out information as you train, so we grabbed L1 loss and added it as metrics.

learn.save('reg1_0')
learn.freeze_to(-2)
lrs = np.array([lr/100, lr/10, lr])
learn.fit(lrs/5, 1, cycle_len=5, use_clr=(32,10))epoch      trn_loss   val_loss   detn_acc   detn_l1       
    0      34.448113  35.972973  0.801683   22.918499 
    1      28.889909  33.010857  0.830379   21.689888     
    2      24.237017  30.977512  0.81881    20.817996     
    3      21.132993  30.60677   0.83143    20.138552     
    4      18.622983  30.54178   0.825571   19.832196[30.54178, 0.82557091116905212, 19.832195997238159]learn.unfreeze()
learn.fit(lrs/10, 1, cycle_len=10, use_clr=(32,10))epoch      trn_loss   val_loss   detn_acc   detn_l1       
    0      15.957164  31.111507  0.811448   19.970753 
    1      15.955259  32.597153  0.81235    20.111022     
    2      15.648723  32.231941  0.804087   19.522853     
    3      14.876172  30.93821   0.815805   19.226574     
    4      14.113872  31.03952   0.808594   19.155093     
    5      13.293885  29.736671  0.826022   18.761728     
    6      12.562566  30.000023  0.827524   18.82006      
    7      11.885125  30.28841   0.82512    18.904158     
    8      11.498326  30.070133  0.819712   18.635296     
    9      11.015841  30.213772  0.815805   18.551489[30.213772, 0.81580528616905212, 18.551488876342773]

A detection accuracy is in the low 80’s which is the same as what it was before. This is not surprising because ResNet was designed to do classification so we wouldn’t expect to be able to improve things in such a simple way. It certainly wasn’t designed to do bounding box regression. It was explicitly actually designed in such a way to not care about geometry — it takes the last 7 by 7 grid of activations and averages them all together throwing away all the information about where everything came from.

Interestingly, when we do accuracy (classification) and bounding box at the same time, the L1 seems a little bit better than when we just do bounding box regression [22:46]. If that is counterintuitive to you, then this would be one of the main things to think about after this lesson since it is a really important idea. The idea is this — figuring out what the main object in an image is, is kind of the hard part. Then figuring out exactly where the bounding box is and what class it is is the easy part in a way. So when you have a single network that’s both saying what is the object and where is the object, it’s going to share all the computation about finding the object. And all that shared computation is very efficient. When we back propagate the errors in the class and in the place, that’s all the information that is going to help the computation around finding the biggest object. So anytime you have multiple tasks which share some concept of what those tasks would need to do to complete their work, it is very likely they should share at least some layers of the network together. Later today, we will look at a model where most of the layers are shared except for the last one.

Here are the result [24:34]. As before, it does a good job when there is single major object in the image.

Multi label classification [25:29]

Notebook

We want to keep building models that are slightly more complex than the last model so that if something stops working, we know exactly where it broke. Here are functions from the previous notebook:

%matplotlib inline
%reload_ext autoreload
%autoreload 2from fastai.conv_learner import *
from fastai.dataset import *

import json, pdb
from PIL import ImageDraw, ImageFont
from matplotlib import patches, patheffects
torch.backends.cudnn.benchmark=True

Setup

PATH = Path('data/pascal')
trn_j = json.load((PATH / 'pascal_train2007.json').open())
IMAGES,ANNOTATIONS,CATEGORIES = ['images', 'annotations', 
                                 'categories']
FILE_NAME,ID,IMG_ID,CAT_ID,BBOX = 'file_name','id','image_id', 
                                  'category_id','bbox'

cats = dict((o[ID], o['name']) for o in trn_j[CATEGORIES])
trn_fns = dict((o[ID], o[FILE_NAME]) for o in trn_j[IMAGES])
trn_ids = [o[ID] for o in trn_j[IMAGES]]

JPEGS = 'VOCdevkit/VOC2007/JPEGImages'
IMG_PATH = PATH/JPEGSdef get_trn_anno():
    trn_anno = collections.defaultdict(lambda:[])
    for o in trn_j[ANNOTATIONS]:
        if not o['ignore']:
            bb = o[BBOX]
            bb = np.array([bb[1], bb[0], bb[3]+bb[1]-1, 
                           bb[2]+bb[0]-1])
            trn_anno[o[IMG_ID]].append((bb,o[CAT_ID]))
    return trn_anno

trn_anno = get_trn_anno()def show_img(im, figsize=None, ax=None):
    if not ax: fig,ax = plt.subplots(figsize=figsize)
    ax.imshow(im)
    ax.set_xticks(np.linspace(0, 224, 8))
    ax.set_yticks(np.linspace(0, 224, 8))
    ax.grid()
    ax.set_yticklabels([])
    ax.set_xticklabels([])
    return ax

def draw_outline(o, lw):
    o.set_path_effects([patheffects.Stroke(
        linewidth=lw, foreground='black'), patheffects.Normal()])

def draw_rect(ax, b, color='white'):
    patch = ax.add_patch(patches.Rectangle(b[:2], *b[-2:], 
                         fill=False, edgecolor=color, lw=2))
    draw_outline(patch, 4)

def draw_text(ax, xy, txt, sz=14, color='white'):
    text = ax.text(*xy, txt,
        verticalalignment='top', color=color, fontsize=sz, 
        weight='bold')
    draw_outline(text, 1)def bb_hw(a): return np.array([a[1],a[0],a[3]-a[1],a[2]-a[0]])

def draw_im(im, ann):
    ax = show_img(im, figsize=(16,8))
    for b,c in ann:
        b = bb_hw(b)
        draw_rect(ax, b)
        draw_text(ax, b[:2], cats[c], sz=16)

def draw_idx(i):
    im_a = trn_anno[i]
    im = open_image(IMG_PATH/trn_fns[i])
    draw_im(im, im_a)

Multi class [26:12]

MC_CSV = PATH/'tmp/mc.csv'trn_anno[12][(array([ 96, 155, 269, 350]), 7)]mc = [set([cats[p[1]] for p in trn_anno[o]]) for o in trn_ids]
mcs = [' '.join(str(p) for p in o) for o in mc]df = pd.DataFrame({'fn': [trn_fns[o] for o in trn_ids], 
                   'clas': mcs}, columns=['fn','clas'])
df.to_csv(MC_CSV, index=False)

One of the students pointed out that by using Pandas, we can do things much simpler than using collections.defaultdict and shared this gist. The more you get to know Pandas, the more often you realize it is a good way to solve lots of different problems.

Question: When you are incrementally building on top of smaller models, do you reuse them as pre-trained weights? or do you toss it away then retrain from scratch [27:11]? When Jeremy is figuring stuff out as he goes like this, he would generally lean towards tossing away because reusing pre-trained weights introduces unnecessary complexities. However, if he is trying to get to a point where he can train on really big images, he will generally start on much smaller and often re-use these weights.

f_model=resnet34
sz=224
bs=64tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO)
md = ImageClassifierData.from_csv(PATH, JPEGS, MC_CSV, tfms=tfms)learn = ConvLearner.pretrained(f_model, md)
learn.opt_fn = optim.Adamlr = 2e-2learn.fit(lr, 1, cycle_len=3, use_clr=(32,5))epoch      trn_loss   val_loss   <lambda>                  
    0      0.104836   0.085015   0.972356  
    1      0.088193   0.079739   0.972461                   
    2      0.072346   0.077259   0.974114[0.077258907, 0.9741135761141777]lrs = np.array([lr/100, lr/10, lr])learn.freeze_to(-2)learn.fit(lrs/10, 1, cycle_len=5, use_clr=(32,5))epoch      trn_loss   val_loss   <lambda>                   
    0      0.063236   0.088847   0.970681  
    1      0.049675   0.079885   0.973723                   
    2      0.03693    0.076906   0.975601                   
    3      0.026645   0.075304   0.976187                   
    4      0.018805   0.074934   0.975165[0.074934497, 0.97516526281833649]learn.save('mclas')learn.load('mclas')y = learn.predict()
x,_ = next(iter(md.val_dl))
x = to_np(x)fig, axes = plt.subplots(3, 4, figsize=(12, 8))
for i,ax in enumerate(axes.flat):
    ima=md.val_ds.denorm(x)[i]
    ya = np.nonzero(y[i]>0.4)[0]
    b = '\n'.join(md.classes[o] for o in ya)
    ax = show_img(ima, ax=ax)
    draw_text(ax, (0,0), b)
plt.tight_layout()

Multi-class classification is pretty straight forward [28:28]. One minor tweak is the use of set in this line so that each object type appear once.:

mc = [set([cats[p[1]] for p in trn_anno[o]]) for o in trn_ids]

SSD and YOLO [29:10]

We have an input image that goes through a conv net which outputs a vector of size 4+c where c=len(cats) . This gives us an object detector for a single largest object. Let’s now create one that finds 16 objects. The obvious way to do this would be to take the last linear layer and rather than having 4+c outputs, we could have 16x(4+c) outputs. This gives us 16 sets of class probabilities and 16 sets of bounding box coordinates. Then we would just need a loss function that will check whether those 16 sets of bounding boxes correctly represented the up to 16 objects in the image (we will go into the loss function later).

The second way to do this is rather than using nn.linear, what if instead, we took from our ResNet convolutional backbone and added an nn.Conv2d with stride 2 [31:32]? This will give us a 4x4x[# of filters] tensor — here let’s make it 4x4x(4+c) so that we get a tensor where the number of elements is exactly equal to the number of elements we wanted. Now if we created a loss function that took a 4x4x(4+c) tensor and and mapped it to 16 objects in the image and checked whether each one was correctly represented by these 4+c activations, this would work as well. It turns out, both of these approaches are actually used [33:48]. The approach where the output is one big long vector from a fully connected linear layer is used by a class of models known as YOLO (You Only Look Once), where else, the approach of the convolutional activations is used by models which started with something called SSD (Single Shot Detector). Since these things came out very similar times in late 2015, things are very much moved towards SSD. So the point where this morning, YOLO version 3 came out and is now doing SSD, so that’s what we are going to do. We will also learn about why this makes more sense as well.

Anchor boxes [35:04]

Let’s imagine that we had another Conv2d(stride=2) then we would have 2x2x(4+c) tensor. Basically, it is creating a grid that looks something like this:

This is how the geometry of the activations of the second extra convolutional stride 2 layer are. Remember, stride 2 convolution does the same thing to the geometry of the activations as a stride 1 convolution followed by maxpooling assuming the padding is ok.

Let’s talk about what we might do here [36:09]. We want each of these grid cell to be responsible for finding the largest object in that part of the image.

Receptive Field [37:20]

Why do we care about the idea that we would like each convolutional grid cell to be responsible for finding things that are in the corresponding part of the image? The reason is because of something called the receptive field of that convolutional grid cell. The basic idea is that throughout your convolutional layers, every piece of those tensors has a receptive field which means which part of the input image was responsible for calculating that cell. Like all things in life, the easiest way to see this is with Excel [38:01].

Take a single activation (in this case in the maxpool layer)and let’s see where it came from [38:45]. In excel you can do Formulas → Trace Precedents. Tracing all the way back to the input layer, you can see that it came from this 6 x 6 portion of the image (as well as filters). What is more, the middle portion has lots of weights coming out of where else cells in the outside only have one weight coming out. So we call this 6 x 6 cells the receptive field of the one activation we picked.

3x3 convolution with opacity 15% — clearly the center of the box has more dependencies

Note that the receptive field is not just saying it’s this box but also that the center of the box has more dependencies [40:27] This is a critically important concept when it comes to understanding architectures and understanding why conv nets work the way they do.

Architecture [41:18]

The architecture is, we will have a ResNet backbone followed by one or more 2D convolutions (one for now) which is going to give us a 4x4 grid.

class StdConv(nn.Module):
    def __init__(self, nin, nout, stride=2, drop=0.1):
        super().__init__()
        self.conv = nn.Conv2d(nin, nout, 3, stride=stride, 
                              padding=1)
        self.bn = nn.BatchNorm2d(nout)
        self.drop = nn.Dropout(drop)
        
    def forward(self, x): 
        return self.drop(self.bn(F.relu(self.conv(x))))
        
def flatten_conv(x,k):
    bs,nf,gx,gy = x.size()
    x = x.permute(0,2,3,1).contiguous()
    return x.view(bs,-1,nf//k)class OutConv(nn.Module):
    def __init__(self, k, nin, bias):
        super().__init__()
        self.k = k
        self.oconv1 = nn.Conv2d(nin, (len(id2cat)+1)*k, 3, 
                                padding=1)
        self.oconv2 = nn.Conv2d(nin, 4*k, 3, padding=1)
        self.oconv1.bias.data.zero_().add_(bias)
        
    def forward(self, x):
        return [flatten_conv(self.oconv1(x), self.k),
                flatten_conv(self.oconv2(x), self.k)]class SSD_Head(nn.Module):
    def __init__(self, k, bias):
        super().__init__()
        self.drop = nn.Dropout(0.25)
        self.sconv0 = StdConv(512,256, stride=1)
        self.sconv2 = StdConv(256,256)
        self.out = OutConv(k, 256, bias)
        
    def forward(self, x):
        x = self.drop(F.relu(x))
        x = self.sconv0(x)
        x = self.sconv2(x)
        return self.out(x)

head_reg4 = SSD_Head(k, -3.)
models = ConvnetBuilder(f_model, 0, 0, 0, custom_head=head_reg4)
learn = ConvLearner(md, models)
learn.opt_fn = optim.Adam

SSD_Head

We start with ReLU and dropout
Then stride 1 convolution. The reason we start with a stride 1 convolution is because that does not change the geometry at all — it just lets us add an extra layer of calculation. It lets us create not just a linear layer but now we have a little mini neural network in our custom head. StdConv is defined above — it does convolution, ReLU, BatchNorm, and dropout. Most research code you see won’t define a class like this, instead they write the entire thing again and again. Don’t be like that. Duplicate code leads to errors and poor understanding.
Stride 2 convolution [44:56]
At the end, the output of step 3 is 4x4 which gets passed to OutConv. OutConv has two separate convolutional layers each of which is stride 1 so it is not changing the geometry of the input. One of them is of length of the number of classes (ignore k for now and +1 is for “background” — i.e. no object was detected), the other’s length is 4. Rather than having a single conv layer that outputs 4+c, let’s have two conv layers and return their outputs in a list. This allows these layers to specialize just a little bit. We talked about this idea that when you have multiple tasks, they can share layers, but they do not have to share all the layers. In this case, our two tasks of creating a classifier and creating and creating bounding box regression share every single layers except the very last one.
At the end, we flatten out the convolution because Jeremy wrote the loss function to expect flattened out tensor, but we could totally rewrite it to not do that.

Fastai Coding Style [42:58]

The first draft was released this week. It is very heavily orient towards the idea of expository programming which is the idea that programming code should be something that you can use to explain an idea, ideally as readily as mathematical notation, to somebody that understands your coding method. The idea goes back a very long way, but it was best described in the Turing Award lecture of 1979 by probably Jeremy’s greatest computer science hero Ken Iverson. He had been working on it since well before 1964 but 1964 was the first example of this approach of programming he released which is called APL and 25 years later, he won the Turing Award. He then passed on the baton to his son Eric Iverson. Fastai style guide is an attempt at taking some of these ideas.

Loss Function [47:44]

The loss function needs to look at each of these 16 sets of activations, each of which has four bounding box coordinates and c+1 class probabilities and decide if those activations are close or far away from the object which is the closest to this grid cell in the image. If nothing is there, then whether it is predicting background correctly. That turns out to be very hard to do.

Matching Problem [48:43]

The loss function needs to take each of the objects in the image and match them to one of these convolutional grid cells to say “this grid cell is responsible for this particular object” so then it can go ahead and say “okay, how close are the 4 coordinates and how close are the class probabilities.

Here is our goal [49:56]:

Our dependent variable looks like the one on the left, and our final convolutional layer is going to be 4x4x(c+1) in this case c=20. We then flatten that out into a vector. Our goal is to come up with a function which takes in a dependent variable and also some particular set of activations that ended up coming out of the model and returns a higher number if these activations are not a good reflection of the ground truth bounding boxes; or a lower number if it is a good reflection.

Testing [51:58]

x,y = next(iter(md.val_dl))
x,y = V(x),V(y)
learn.model.eval()
batch = learn.model(x)
b_clas,b_bb = batch
b_clas.size(),b_bb.size()(torch.Size([64, 16, 21]), torch.Size([64, 16, 4]))

Make sure these shapes make sense. Let’s now look at the ground truth y [53:24]:

idx=7
b_clasi = b_clas[idx]
b_bboxi = b_bb[idx]
ima=md.val_ds.ds.denorm(to_np(x))[idx]
bbox,clas = get_y(y[0][idx], y[1][idx])
bbox,clas(Variable containing:
  0.6786  0.4866  0.9911  0.6250
  0.7098  0.0848  0.9911  0.5491
  0.5134  0.8304  0.6696  0.9063
 [torch.cuda.FloatTensor of size 3x4 (GPU 0)], Variable containing:
   8
  10
  17
 [torch.cuda.LongTensor of size 3 (GPU 0)])

Note that bounding box coordinates have been scaled to between 0 and 1 — basically we are treating the image as being 1x1, so they are relative to the size of the image.

We already have show_ground_truth function. This torch_gt (gt: ground truth) function simply converts tensors into numpy array.

def torch_gt(ax, ima, bbox, clas, prs=None, thresh=0.4):
    return show_ground_truth(ax, ima, to_np((bbox*224).long()),
         to_np(clas), 
         to_np(prs) if prs is not None else None, thresh)fig, ax = plt.subplots(figsize=(7,7))
torch_gt(ax, ima, bbox, clas)

The above is a ground truth. Here is our 4x4 grid cells from our final convolutional layer [54:44]:

fig, ax = plt.subplots(figsize=(7,7))
torch_gt(ax, ima, anchor_cnr, b_clasi.max(1)[1])

Each of these square boxes, different papers call them different things. The three terms you’ll hear are: anchor boxes, prior boxes, or default boxes. We will stick with the term anchor boxes.

What we are going to do for this loss function is we are going to go through a matching problem where we are going to take every one of these 16 boxes and see which one of these three ground truth objects has the highest amount of overlap with a given square [55:21]. To do this, we have to have some way of measuring amount of overlap and a standard function for this is called Jaccard index (IoU).

We are going to go through and find the Jaccard overlap for each one of the three objects versus each of the 16 anchor boxes [57:11]. That is going to give us a 3x16 matrix.

Here are the coordinates of all of our anchor boxes (centers, height, width):

anchorsVariable containing:
 0.1250  0.1250  0.2500  0.2500
 0.1250  0.3750  0.2500  0.2500
 0.1250  0.6250  0.2500  0.2500
 0.1250  0.8750  0.2500  0.2500
 0.3750  0.1250  0.2500  0.2500
 0.3750  0.3750  0.2500  0.2500
 0.3750  0.6250  0.2500  0.2500
 0.3750  0.8750  0.2500  0.2500
 0.6250  0.1250  0.2500  0.2500
 0.6250  0.3750  0.2500  0.2500
 0.6250  0.6250  0.2500  0.2500
 0.6250  0.8750  0.2500  0.2500
 0.8750  0.1250  0.2500  0.2500
 0.8750  0.3750  0.2500  0.2500
 0.8750  0.6250  0.2500  0.2500
 0.8750  0.8750  0.2500  0.2500
[torch.cuda.FloatTensor of size 16x4 (GPU 0)]

Here are the amount of overlap between 3 ground truth objects and 16 anchor boxes:

overlaps = jaccard(bbox.data, anchor_cnr.data)
overlapsColumns 0 to 7   
0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000    0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000    0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000      Columns 8 to 15   
0.0000  0.0091 0.0922  0.0000  0.0000  0.0315  0.3985  0.0000  0.0356  0.0549 0.0103  0.0000  0.2598  0.4538  0.0653  0.0000  0.0000  0.0000 0.0000  0.1897  0.0000  0.0000  0.0000  0.0000 [torch.cuda.FloatTensor of size 3x16 (GPU 0)]

What we could do now is we could take the max of dimension 1 (row-wise) which will tell us for each ground truth object, what the maximum amount that overlaps with some grid cell as well as the index:

overlaps.max(1)(
  0.3985
  0.4538
  0.1897
 [torch.cuda.FloatTensor of size 3 (GPU 0)], 
  14
  13
  11
 [torch.cuda.LongTensor of size 3 (GPU 0)])

We will also going to look at max over a dimension 0 (column-wise) which will tell us what is the maximum amount of overlap for each grid cell across all of the ground truth objects [59:08]:

overlaps.max(0)(
  0.0000
  0.0000
  0.0000
  0.0000
  0.0000
  0.0000
  0.0000
  0.0000
  0.0356
  0.0549
  0.0922
  0.1897
  0.2598
  0.4538
  0.3985
  0.0000
 [torch.cuda.FloatTensor of size 16 (GPU 0)], 
  0
  0
  0
  0
  0
  0
  0
  0
  1
  1
  0
  2
  1
  1
  0
  0
 [torch.cuda.LongTensor of size 16 (GPU 0)])

What is particularly interesting here is that it tells us for every grid cell what is the index of the ground truth object which overlaps with it the most. Zero is a bit overloaded here — zero could either mean the amount of overlap was zero or its largest overlap is with object index zero. It is going to turn out not to matter but just FYI.

There is a function called map_to_ground_truth which we will not worry about for now [59:57]. It is super simple code but it is slightly awkward to think about. Basically what it does is it combines these two sets of overlaps in a way described in the SSD paper to assign every anchor box to a ground truth object. The way it assign that is each of the three (row-wise max) gets assigned as is. For the rest of the anchor boxes, they get assigned to anything which they have an overlap of at least 0.5 with (column-wise). If neither applies, it is considered to be a cell which contains background.

gt_overlap,gt_idx = map_to_ground_truth(overlaps)
gt_overlap,gt_idx(
  0.0000
  0.0000
  0.0000
  0.0000
  0.0000
  0.0000
  0.0000
  0.0000
  0.0356
  0.0549
  0.0922
  1.9900
  0.2598
  1.9900
  1.9900
  0.0000
 [torch.cuda.FloatTensor of size 16 (GPU 0)], 
  0
  0
  0
  0
  0
  0
  0
  0
  1
  1
  0
  2
  1
  1
  0
  0
 [torch.cuda.LongTensor of size 16 (GPU 0)])

Now you can see a list of all the assignments [1:01:05]. Anywhere that has gt_overlap < 0.5 gets assigned background. The three row-wise max anchor box has high number to force the assignments. Now we can combine these values to classes:

gt_clas = clas[gt_idx]; gt_clasVariable containing:
  8
  8
  8
  8
  8
  8
  8
  8
 10
 10
  8
 17
 10
 10
  8
  8
[torch.cuda.LongTensor of size 16 (GPU 0)]

Then add a threshold and finally comes up with the three classes that are being predicted:

thresh = 0.5
pos = gt_overlap > thresh
pos_idx = torch.nonzero(pos)[:,0]
neg_idx = torch.nonzero(1-pos)[:,0]
pos_idx 11
 13
 14
[torch.cuda.LongTensor of size 3 (GPU 0)]

And here are what each of these anchor boxes is meant to be predicting:

gt_clas[1-pos] = len(id2cat)
[id2cat[o] if o<len(id2cat) else 'bg' for o in gt_clas.data]['bg',
 'bg',
 'bg',
 'bg',
 'bg',
 'bg',
 'bg',
 'bg',
 'bg',
 'bg',
 'bg',
 'sofa',
 'bg',
 'diningtable',
 'chair',
 'bg']

So that was the matching stage [1:02:29]. For L1 loss, we can:

take the activations which matched (pos_idx = [11, 13, 14])
subtract from those the ground truth bounding boxes
take the absolute value of the difference
take the mean of that.

For classifications, we can just do a cross entropy

gt_bbox = bbox[gt_idx]
loc_loss = ((a_ic[pos_idx] - gt_bbox[pos_idx]).abs()).mean()
clas_loss  = F.cross_entropy(b_clasi, gt_clas)
loc_loss,clas_loss(Variable containing:
 1.00000e-02 *
   6.5887
 [torch.cuda.FloatTensor of size 1 (GPU 0)], Variable containing:
  1.0331
 [torch.cuda.FloatTensor of size 1 (GPU 0)])

We will end up with 16 predicted bounding boxes, most of them will be background. If you are wondering what it predicts in terms of bounding box of background, the answer is it totally ignores it.

fig, axes = plt.subplots(3, 4, figsize=(16, 12))
for idx,ax in enumerate(axes.flat):
    ima=md.val_ds.ds.denorm(to_np(x))[idx]
    bbox,clas = get_y(y[0][idx], y[1][idx])
    ima=md.val_ds.ds.denorm(to_np(x))[idx]
    bbox,clas = get_y(bbox,clas); bbox,clas
    a_ic = actn_to_bb(b_bb[idx], anchors)
    torch_gt(ax, ima, a_ic, b_clas[idx].max(1)[1], 
             b_clas[idx].max(1)[0].sigmoid(), 0.01)
plt.tight_layout()

Tweak 1. How do we interpret the activations [1:04:16]?

The way we interpret the activation is defined here:

def actn_to_bb(actn, anchors):
    actn_bbs = torch.tanh(actn)
    actn_centers = (actn_bbs[:,:2]/2 * grid_sizes) + anchors[:,:2]
    actn_hw = (actn_bbs[:,2:]/2+1) * anchors[:,2:]
    return hw2corners(actn_centers, actn_hw)

We grab the activations, we stick them through tanh (remember tanh is the same shape as sigmoid except it is scaled to be between -1 and 1) which forces it to be within that range. We then grab the actual position of the anchor boxes, and we will move them around according to the value of the activations divided by two (actn_bbs[:,:2]/2). In other words, each predicted bounding box can be moved by up to 50% of a grid size from where its default position is. Ditto for its height and width — it can be up to twice as big or half as big as its default size.

Tweak 2. We actually use binary cross entropy loss instead of cross entropy [1:05:36]

class BCE_Loss(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        self.num_classes = num_classes

    def forward(self, pred, targ):
        t = one_hot_embedding(targ, self.num_classes+1)
        t = V(t[:,:-1].contiguous())#.cpu()
        x = pred[:,:-1]
        w = self.get_weight(x,t)
        return F.binary_cross_entropy_with_logits(x, t, w, 
                            size_average=False)/self.num_classes
    
    def get_weight(self,x,t): return None

Binary cross entropy is what we normally use for multi-label classification. Like in the planet satellite competition, each satellite image could have multiple things. If it has multiple things in it, you cannot use softmax because softmax really encourages just one thing to have the high number. In our case, each anchor box can only have one object associated with it, so it is not for that reason that we are avoiding softmax. It is something else — which is it is possible for an anchor box to have nothing associated with it. There are two ways to handle this idea of “background”; one would be to say background is just a class, so let’s use softmax and just treat background as one of the classes that the softmax could predict. A lot of people have done it this way. But that is a really hard thing to ask neural network to do [1:06:52] — it is basically asking whether this grid cell does not have any of the 20 objects that I am interested with Jaccard overlap of more than 0.5. It is a really hard to thing to put into a single computation. On the other hand, what if we just asked for each class; “is it a motorbike?” “is it a bus?”, “ is it a person?” etc and if all the answer is no, consider that background. That is the way we do it here. It is not that we can have multiple true labels, but we can have zero.

In forward :

First we take the one hot embedding of the target (at this stage, we do have the idea of background)
Then we remove the background column (the last one) which results in a vector either of all zeros or one one.
Use binary cross-entropy predictions.

This is a minor tweak, but it is the kind of minor tweak that Jeremy wants you to think about and understand because it makes a really big difference to your training and when there is some increment over a previous paper, it would be something like this [1:08:25]. It is important to understand what this is doing and more importantly why.

So now we have [1:09:39]:

A custom loss function
A way to calculate Jaccard index
A way to convert activations to bounding box
A way to map anchor boxes to ground truth

Now all it’s left is SSD loss function.

SSD Loss Function [1:09:55]

def ssd_1_loss(b_c,b_bb,bbox,clas,print_it=False):
    bbox,clas = get_y(bbox,clas)
    a_ic = actn_to_bb(b_bb, anchors)
    overlaps = jaccard(bbox.data, anchor_cnr.data)
    gt_overlap,gt_idx = map_to_ground_truth(overlaps,print_it)
    gt_clas = clas[gt_idx]
    pos = gt_overlap > 0.4
    pos_idx = torch.nonzero(pos)[:,0]
    gt_clas[1-pos] = len(id2cat)
    gt_bbox = bbox[gt_idx]
    loc_loss = ((a_ic[pos_idx] - gt_bbox[pos_idx]).abs()).mean()
    clas_loss  = loss_f(b_c, gt_clas)
    return loc_loss, clas_loss

def ssd_loss(pred,targ,print_it=False):
    lcs,lls = 0.,0.
    for b_c,b_bb,bbox,clas in zip(*pred,*targ):
        loc_loss,clas_loss = ssd_1_loss(b_c,b_bb,bbox,clas,print_it)
        lls += loc_loss
        lcs += clas_loss
    if print_it: print(f'loc: {lls.data[0]}, clas: {lcs.data[0]}')
    return lls+lcs

The ssd_loss function which is what we set as the criteria, it loops through each image in the mini-batch and call ssd_1_loss function (i.e. SSD loss for one image).

ssd_1_loss is where it is all happening. It begins by de-structuring bbox and clas. Let’s take a closer look at get_y [1:10:38]:

def get_y(bbox,clas):
    bbox = bbox.view(-1,4)/sz
    bb_keep = ((bbox[:,2]-bbox[:,0])>0).nonzero()[:,0]
    return bbox[bb_keep],clas[bb_keep]

A lot of code you find on the internet does not work with mini-batches. It only does one thing at a time which we don’t want. In this case, all these functions (get_y, actn_to_bb, map_to_ground_truth) is working on, not exactly a mini-batch at a time, but a whole bunch of ground truth objects at a time. The data loader is being fed a mini-batch at a time to do the convolutional layers. Because we can have different numbers of ground truth objects in each image but a tensor has to be the strict rectangular shape, fastai automatically pads it with zeros (any target values that are shorter) [1:11:08]. This was something that was added recently and super handy, but that does mean that you then have to make sure that you get rid of those zeros. So get_y gets rid of any of the bounding boxes that are just padding.

Get rid of the padding
Turn the activations to bounding boxes
Do the Jaccard
Do map_to_ground_truth
Check that there is an overlap greater than something around 0.4~0.5 (different papers use different values for this)
Find the indices of things that matched
Assign background class for the ones that did not match
Then finally get L1 loss for the localization, binary cross entropy loss for the classification, and return them which gets added in ssd_loss

Training [1:12:47]

learn.crit = ssd_loss
lr = 3e-3
lrs = np.array([lr/100,lr/10,lr])learn.lr_find(lrs/1000,1.)
learn.sched.plot(1)epoch      trn_loss   val_loss                            
    0      44.232681  21476.816406

learn.lr_find(lrs/1000,1.)
learn.sched.plot(1)epoch      trn_loss   val_loss                            
    0      86.852668  32587.789062

learn.fit(lr, 1, cycle_len=5, use_clr=(20,10))epoch      trn_loss   val_loss                            
    0      45.570843  37.099854 
    1      37.165911  32.165031                           
    2      33.27844   30.990122                           
    3      31.12054   29.804482                           
    4      29.305789  28.943184[28.943184]learn.fit(lr, 1, cycle_len=5, use_clr=(20,10))epoch      trn_loss   val_loss                            
    0      43.726979  33.803085 
    1      34.771754  29.012939                           
    2      30.591864  27.132868                           
    3      27.896905  26.151638                           
    4      25.907382  25.739273[25.739273]learn.save('0')learn.load('0')

Result [1:13:16]

In practice, we want to remove the background and also add some threshold for probabilities, but it is on the right track. The potted plant image, the result is not surprising as all of our anchor boxes were small (4x4 grid). To go from here to something that is going to be more accurate, all we are going to do is to create way more anchor boxes.

Question: For the multi-label classification, why aren’t we multiplying the categorical loss by a constant like we did before [1:15:20]? Great question. It is because later on it will turn out we do not need to.

More anchors! [1:14:47]

There are 3 ways to do this:

Create anchor boxes of different sizes (zoom):

From left (1x1, 2x2, 4x4 grids of anchor boxes). Notice that some of the anchor box is bigger than the original image.

2. Create anchor boxes of different aspect ratios:

3. Use more convolutional layers as sources of anchor boxes (the boxes are randomly jittered so that we can see ones that are overlapping [1:16:28]):

Combining these approaches, you can create lots of anchor boxes (Jeremy said he wouldn’t print it, but here it is):

anc_grids = [4, 2, 1]
anc_zooms = [0.75, 1., 1.3]
anc_ratios = [(1., 1.), (1., 0.5), (0.5, 1.)]

anchor_scales = [(anz*i,anz*j) for anz in anc_zooms 
                                    for (i,j) in anc_ratios]
k = len(anchor_scales)
anc_offsets = [1/(o*2) for o in anc_grids]anc_x = np.concatenate([np.repeat(np.linspace(ao, 1-ao, ag), ag)
                        for ao,ag in zip(anc_offsets,anc_grids)])
anc_y = np.concatenate([np.tile(np.linspace(ao, 1-ao, ag), ag)
                        for ao,ag in zip(anc_offsets,anc_grids)])
anc_ctrs = np.repeat(np.stack([anc_x,anc_y], axis=1), k, axis=0)anc_sizes = np.concatenate([np.array([[o/ag,p/ag] 
              for i in range(ag*ag) for o,p in anchor_scales])
                 for ag in anc_grids])
grid_sizes = V(np.concatenate([np.array([ 1/ag 
              for i in range(ag*ag) for o,p in anchor_scales])
                  for ag in anc_grids]), 
                      requires_grad=False).unsqueeze(1)
anchors = V(np.concatenate([anc_ctrs, anc_sizes], axis=1), 
              requires_grad=False).float()
anchor_cnr = hw2corners(anchors[:,:2], anchors[:,2:])

anchors : middle and height, width

anchor_cnr : top left and bottom right corners

Review of key concept [1:18:00]

We have a vector of ground truth (sets of 4 bounding box coordinates and a class)
We have a neural net that takes some input and spits out some output activations
Compare the activations and the ground truth, calculate a loss, find the derivative of that, and adjust weights according to the derivative times a learning rate.
We need a loss function that can take ground truth and activation and spit out a number that says how good these activations are. To do this, we need to take each one of m ground truth objects and decide which set of (4+c) activations is responsible for that object [1:21:58] — which one we should be comparing to decide whether the class is correct and bounding box is close or not (matching problem).
Since we are using SSD approach, so it is not arbitrary which ones we match up [1:23:18]. We want to match up the set of activations whose receptive field has the maximum density from where the real object is.
The loss function needs to be some consistent task. If in the first image, the top left object corresponds with the first 4+c activations, and in the second image, we threw things around and suddenly it’s now going with the last 4+c activations, the neural net doesn’t know what to learn.
Once matching problem is resolved, the rest is just the same as the single object detection.

Architectures:

YOLO — the last layer is fully connected (no concept of geometry)
SSD — the last layer is convolutional

k (zooms x ratios)[1:29:39]

For every grid cell which can be different sizes, we can have different orientations and zooms representing different anchor boxes which are just like conceptual ideas that every one of anchor boxes is associated with one set of 4+c activations in our model. So however many anchor boxes we have, we need to have that times (4+c) activations. That does not mean that each convolutional layer needs that many activations. Because 4x4 convolutional layer already has 16 sets of activations, the 2x2 layer has 4 sets of activations, and finally 1x1 has one set. So we basically get 1 + 4 + 16 for free. So we only needs to know k where k is the number of zooms by the number of aspect ratios. Where else, the grids, we will get for free through our architecture.

Model Architecture [1:31:10]

drop=0.4

class SSD_MultiHead(nn.Module):
    def __init__(self, k, bias):
        super().__init__()
        self.drop = nn.Dropout(drop)
        self.sconv0 = StdConv(512,256, stride=1, drop=drop)
        self.sconv1 = StdConv(256,256, drop=drop)
        self.sconv2 = StdConv(256,256, drop=drop)
        self.sconv3 = StdConv(256,256, drop=drop)
        self.out1 = OutConv(k, 256, bias)
        self.out2 = OutConv(k, 256, bias)
        self.out3 = OutConv(k, 256, bias)

    def forward(self, x):
        x = self.drop(F.relu(x))
        x = self.sconv0(x)
        x = self.sconv1(x)
        o1c,o1l = self.out1(x)
        x = self.sconv2(x)
        o2c,o2l = self.out2(x)
        x = self.sconv3(x)
        o3c,o3l = self.out3(x)
        return [torch.cat([o1c,o2c,o3c], dim=1),
                torch.cat([o1l,o2l,o3l], dim=1)]

head_reg4 = SSD_MultiHead(k, -4.)
models = ConvnetBuilder(f_model, 0, 0, 0, custom_head=head_reg4)
learn = ConvLearner(md, models)
learn.opt_fn = optim.Adam

The model is nearly identical to what we had before. But we have a number of stride 2 convolutions which is going to take us through to 4x4, 2x2, and 1x1 (each stride 2 convolution halves our grid size in both directions).

After we do our first convolution to get to 4x4, we will grab a set of outputs from that because we want to save away the 4x4 anchors.
Once we get to 2x2, we grab another set of now 2x2 anchors
Then finally we get to 1x1
We then concatenate them all together, which gives us the correct number of activations (one activation for every anchor box).

Training [1:32:50]

learn.crit = ssd_loss
lr = 1e-2
lrs = np.array([lr/100,lr/10,lr])learn.lr_find(lrs/1000,1.)
learn.sched.plot(n_skip_end=2)

learn.fit(lrs, 1, cycle_len=4, use_clr=(20,8))epoch      trn_loss   val_loss                            
    0      15.124349  15.015433 
    1      13.091956  10.39855                            
    2      11.643629  9.4289                              
    3      10.532467  8.822998[8.822998]learn.save('tmp')learn.freeze_to(-2)
learn.fit(lrs/2, 1, cycle_len=4, use_clr=(20,8))epoch      trn_loss   val_loss                            
    0      9.821056   10.335152 
    1      9.419633   11.834093                           
    2      8.78818    7.907762                            
    3      8.219976   7.456364[7.4563637]x,y = next(iter(md.val_dl))
y = V(y)
batch = learn.model(V(x))
b_clas,b_bb = batch
x = to_np(x)

fig, axes = plt.subplots(3, 4, figsize=(16, 12))
for idx,ax in enumerate(axes.flat):
    ima=md.val_ds.ds.denorm(x)[idx]
    bbox,clas = get_y(y[0][idx], y[1][idx])
    a_ic = actn_to_bb(b_bb[idx], anchors)
    torch_gt(ax, ima, a_ic, b_clas[idx].max(1)[1], 
             b_clas[idx].max(1)[0].sigmoid(), 0.2)
plt.tight_layout()

Here, we printed out those detections with at least probability of 0.2 . Some of them look pretty hopeful but others not so much.

History of object detection [1:33:43]

Scalable Object Detection using Deep Neural Networks

When people refer to the multi-box method, they are talking about this paper.
This was the paper that came up with the idea that we can have a loss function that has this matching process and then you can use that to do object detection. So everything since that time has been trying to figure out how to make this better.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

In parallel, Ross Girshick was going down a totally different direction. He had these two-stage process where the first stage used the classical computer vision approaches to find edges and changes of gradients to guess which parts of the image may represent distinct objects. Then fit each of those into a convolutional neural network which was basically designed to figure out if that is the kind of object we are interested in.
R-CNN and Fast R-CNN are hybrid of traditional computer vision and deep learning.
What Ross and his team then did was they took the multibox idea and replaced the traditional non-deep learning computer vision part of their two stage process with the conv net. So now they have two conv nets: one for region proposals (all of the things that might be objects) and the second part was the same as his earlier work.

You Only Look Once: Unified, Real-Time Object Detection

SSD: Single Shot MultiBox Detector

At similar time these paper came out. Both of these did something pretty cool which is they achieved similar performance as the Faster R-CNN but with 1 stage.
They took the multibox idea and they tried to figure out how to deal with messy outputs. The basic ideas were to use, for example, hard negative mining where they would go through and find all of the matches that did not look that good and throw them away, use very tricky and complex data augmentation methods, and all kind of hackery. But they got them to work pretty well.

Focal Loss for Dense Object Detection (RetinaNet)

Then something really cool happened late last year which is this thing called focal loss.
They actually realized why this messy thing wasn’t working. When we look at an image, there are 3 different granularities of convolutional grid (4x4, 2x2, 1x1) [1:37:28]. The 1x1 is quite likely to have a reasonable overlap with some object because most photos have some kind of main subject. On the other hand, in the 4x4 grid cells, the most of 16 anchor boxes are not going to have a much of an overlap with anything. So if somebody was to say to you “$20 bet, what do you reckon this little clip is?” and you are not sure, you will say “background” because most of the time, it is the background.

Question: I understand why we have a 4x4 grid of receptive fields with 1 anchor box each to coarsely localize objects in the image. But what I think I’m missing is why we need multiple receptive fields at different sizes. The first version already included 16 receptive fields, each with a single anchor box associated. With the additions, there are now many more anchor boxes to consider. Is this because you constrained how much a receptive field could move or scale from its original size? Or is there another reason? [1:38:47] It is kind of backwards. The reason Jeremy did the constraining was because he knew he was going to be adding more boxes later. But really, the reason is that the Jaccard overlap between one of those 4x4 grid cells and a picture where a single object that takes up most of the image is never going to be 0.5. The intersection is much smaller than the union because the object is too big. So for this general idea to work where we are saying you are responsible for something that you have better than 50% overlap with, we need anchor boxes which will on a regular basis have a 50% or higher overlap which means we need to have a variety of sizes, shapes, and scales. This all happens in the loss function. The vast majority of the interesting stuff in all of the object detection is the loss function.

Focal Loss [1:40:38]

The key thing is this very first picture. The blue line is the binary cross entropy loss. If the answer is not a motorbike [1:41:46], and I said “I think it’s not a motorbike and I am 60% sure” with the blue line, the loss is still about 0.5 which is pretty bad. So if we want to get our loss down, then for all these things which are actually back ground, we have to be saying “I am sure that is background”, “I am sure it’s not a motorbike, or a bus, or a person” — because if I don’t say we are sure it is not any of these things, then we still get loss.

That is why the motorbike example did not work [1:42:39]. Because even when it gets to lower right corner and it wants to say “I think it’s a motorbike”, there is no payoff for it to say so. If it is wrong, it gets killed. And the vast majority of the time, it is background. Even if it is not background, it is not enough just to say “it’s not background” — you have to say which of the 20 things it is.

So the trick is to trying to find a different loss function [1:44:00] that looks more like the purple line. Focal loss is literally just a scaled cross entropy loss. Now if we say “I’m .6 sure it’s not a motorbike” then the loss function will say “good for you! no worries” [1:44:42].

The actual contribution of this paper is to add (1 − pt)^γ to the start of the equation [1:45:06] which sounds like nothing but actually people have been trying to figure out this problem for years. When you come across a paper like this which is game-changing, you shouldn’t assume you are going to have to write thousands of lines of code. Very often it is one line of code, or the change of a single constant, or adding log to a single place.

A couple of terrific things about this paper [1:46:08]:

Equations are written in a simple manner
They “refactor”

Implementing Focal Loss [1:49:27]:

Remember, -log(pt) is the cross entropy loss and focal loss is just a scaled version. When we defined the binomial cross entropy loss, you may have noticed that there was a weight which by default was none:

class BCE_Loss(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        self.num_classes = num_classes

    def forward(self, pred, targ):
        t = one_hot_embedding(targ, self.num_classes+1)
        t = V(t[:,:-1].contiguous())#.cpu()
        x = pred[:,:-1]
        w = self.get_weight(x,t)
        return F.binary_cross_entropy_with_logits(x, t, w, 
                          size_average=False)/self.num_classes
    
    def get_weight(self,x,t): return None

When you call F.binary_cross_entropy_with_logits, you can pass in the weight. Since we just wanted to multiply a cross entropy by something, we can just define get_weight. Here is the entirety of focal loss [1:50:23]:

class FocalLoss(BCE_Loss):
    def get_weight(self,x,t):
        alpha,gamma = 0.25,2.
        p = x.sigmoid()
        pt = p*t + (1-p)*(1-t)
        w = alpha*t + (1-alpha)*(1-t)
        return w * (1-pt).pow(gamma)

If you were wondering why alpha and gamma are 0.25 and 2, here is another excellent thing about this paper, because they tried lots of different values and found that these work well:

Training [1:51:25]

learn.lr_find(lrs/1000,1.)
learn.sched.plot(n_skip_end=2)

learn.fit(lrs, 1, cycle_len=10, use_clr=(20,10))epoch      trn_loss   val_loss                            
    0      24.263046  28.975235 
    1      20.459562  16.362392                           
    2      17.880827  14.884829                           
    3      15.956896  13.676485                           
    4      14.521345  13.134197                           
    5      13.460941  12.594139                           
    6      12.651842  12.069849                           
    7      11.944972  11.956457                           
    8      11.385798  11.561226                           
    9      10.988802  11.362164[11.362164]learn.save('fl0')
learn.load('fl0')learn.freeze_to(-2)
learn.fit(lrs/4, 1, cycle_len=10, use_clr=(20,10))epoch      trn_loss   val_loss                            
    0      10.871668  11.615532 
    1      10.908461  11.604334                           
    2      10.549796  11.486127                           
    3      10.130961  11.088478                           
    4      9.70691    10.72144                            
    5      9.319202   10.600481                           
    6      8.916653   10.358334                           
    7      8.579452   10.624706                           
    8      8.274838   10.163422                           
    9      7.994316   10.108068[10.108068]learn.save('drop4')
learn.load('drop4')plot_results(0.75)

This time things are looking quite a bit better. So our last step, for now, is to basically figure out how to pull out just the interesting ones.

Non Maximum Suppression [1:52:15]

All we are going to do is we are going to go through every pair of these bounding boxes and if they overlap by more than some amount, say 0.5, using Jaccard and they are both predicting the same class, we are going to assume they are the same thing and we are going to pick the one with higher p value.

It is really boring code, Jeremy didn’t write it himself and copied somebody else’s. No reason particularly to go through it.

def nms(boxes, scores, overlap=0.5, top_k=100):
    keep = scores.new(scores.size(0)).zero_().long()
    if boxes.numel() == 0: return keep
    x1 = boxes[:, 0]
    y1 = boxes[:, 1]
    x2 = boxes[:, 2]
    y2 = boxes[:, 3]
    area = torch.mul(x2 - x1, y2 - y1)
    v, idx = scores.sort(0)  # sort in ascending order
    idx = idx[-top_k:]  # indices of the top-k largest vals
    xx1 = boxes.new()
    yy1 = boxes.new()
    xx2 = boxes.new()
    yy2 = boxes.new()
    w = boxes.new()
    h = boxes.new()

    count = 0
    while idx.numel() > 0:
        i = idx[-1]  # index of current largest val
        keep[count] = i
        count += 1
        if idx.size(0) == 1: break
        idx = idx[:-1]  # remove kept element from view
        # load bboxes of next highest vals
        torch.index_select(x1, 0, idx, out=xx1)
        torch.index_select(y1, 0, idx, out=yy1)
        torch.index_select(x2, 0, idx, out=xx2)
        torch.index_select(y2, 0, idx, out=yy2)
        # store element-wise max with next highest score
        xx1 = torch.clamp(xx1, min=x1[i])
        yy1 = torch.clamp(yy1, min=y1[i])
        xx2 = torch.clamp(xx2, max=x2[i])
        yy2 = torch.clamp(yy2, max=y2[i])
        w.resize_as_(xx2)
        h.resize_as_(yy2)
        w = xx2 - xx1
        h = yy2 - yy1
        # check sizes of xx1 and xx2.. after each iteration
        w = torch.clamp(w, min=0.0)
        h = torch.clamp(h, min=0.0)
        inter = w*h
        # IoU = i / (area(a) + area(b) - i)
        rem_areas = torch.index_select(area, 0, idx)  
        # load remaining areas)
        union = (rem_areas - inter) + area[i]
        IoU = inter/union  # store result in iou
        # keep only elements with an IoU <= overlap
        idx = idx[IoU.le(overlap)]
    return keep, countdef show_nmf(idx):
    ima=md.val_ds.ds.denorm(x)[idx]
    bbox,clas = get_y(y[0][idx], y[1][idx])
    a_ic = actn_to_bb(b_bb[idx], anchors)
    clas_pr, clas_ids = b_clas[idx].max(1)
    clas_pr = clas_pr.sigmoid()

    conf_scores = b_clas[idx].sigmoid().t().data

    out1,out2,cc = [],[],[]
    for cl in range(0, len(conf_scores)-1):
        c_mask = conf_scores[cl] > 0.25
        if c_mask.sum() == 0: continue
        scores = conf_scores[cl][c_mask]
        l_mask = c_mask.unsqueeze(1).expand_as(a_ic)
        boxes = a_ic[l_mask].view(-1, 4)
        ids, count = nms(boxes.data, scores, 0.4, 50)
        ids = ids[:count]
        out1.append(scores[ids])
        out2.append(boxes.data[ids])
        cc.append([cl]*count)
    cc = T(np.concatenate(cc))
    out1 = torch.cat(out1)
    out2 = torch.cat(out2)

    fig, ax = plt.subplots(figsize=(8,8))
    torch_gt(ax, ima, out2, cc, out1, 0.1)for i in range(12): show_nmf(i)

There are some things still to fix here [1:53:43]. The trick will be to use something called feature pyramid. That is what we are going to do in lesson 14.

Talking a little more about SSD paper [1:54:03]

When this paper came out, Jeremy was excited because this and YOLO were the first kind of single-pass good quality object detection method that come along. There has been this continuous repetition of history in the deep learning world which is things that involve multiple passes of multiple different pieces, over time, particularly where they involve some non-deep learning pieces (like R-CNN did), over time, they always get turned into a single end-to-end deep learning model. So I tend to ignore them until that happens because that’s the point where people have figured out how to show this as a deep learning model, as soon as they do that they generally end up something much faster and much more accurate. So SSD and YOLO were really important.

The model is 4 paragraphs. Papers are really concise which means you need to read them pretty carefully. Partly, though, you need to know which bits to read carefully. The bits where they say “here we are going to prove the error bounds on this model,” you could ignore that because you don’t care about proving error bounds. But the bit which says here is what the model is, you need to read real carefully.

Jeremy reads a section 2.1 Model [1:56:37]

If you jump straight in and read a paper like this, these 4 paragraphs would probably make no sense. But now that we’ve gone through it, you read those and hopefully thinking “oh that’s just what Jeremy said, only they sad it better than Jeremy and less words [2:00:37]. If you start to read a paper and go “what the heck”, the trick is to then start reading back over the citations.

Jeremy reads Matching strategy and Training objective (a.k.a. Loss function)[2:01:44]

Some paper tips [2:02:34]