Fastai V1 object detection with SSD


Special thanks to fastai leading me into the data science world, after 5 months study, I think it is time for me to post some findings along the path. This post will tend to explain things to beginners, just like me…


If we think computer vision topics like a game, single label classification (dog vs cat) will more like a novice training ground, and multi object detection will be the end game boss.

The path would be, image classification->multi-label classification -> high Imbalanced class classification -> Object detection -> multi-object detection

If you think about real world, you are highly likely encounter imbalanced dataset, such as face detection with one shot learning. One solution would be using Siamese network, then if the item you are interested in is in the part of the image, you want to take it out first. Here is where object detection steps in.

Fastai 2019 part-2 didn’t talk about much high level applications, therefore, to better understand posted Retina Network, to replicate 2018 SSD seems a good approach for both fastai API and object detection key concepts.


Things will be included in this post:

  1. Input Data structure with fastai V1 object detection datablock
  2. Constructing anchor box with fastai (-1,1) style
  3. Loss function, display prediction.

You can find the SSD source code here:

After fastai 2019 part 1, I explored a bit more for lesson 5 MNIST:

  1. Using customized ItemList (tutorial can be found in fastai docs), so you can apply transfer learning to pixel value input

2. Going off-road, wrap fastai Databunch on top of Pytorch Dataset structure, training using fastai callback system. Using callback to handle transformation, therefore you don’t need fastai datablock API

What makes fastai different?

My personal opinion is:

Data argumentation, lr_find, discrimative learning rate, cycle training and most importantly, callback system.

show_batch(), show_result(),get_pred() is not, using helper function sometimes blocks your understanding of your own model (especially if you start to customize your model), you should be able to display your data, make prediction and show the result without using them.

With that in mind, let’s take a look of pascal voc 2007

(For detailed explanation of the dataset, please refer to fastai 2018 lesson 8–9. I will keep this post concentrate on the difference between fastai v0.7 and v1.0)

  1. Input data
Image for post
Image for post
Object Item List

What is object detection?

Jeremy explained in a such beautiful way, it is simply a Regression model plus Classification model.

Therefore, this is how fastai v1.0 handles for you.

For ObjectItemList, it’s slightly different with other ImageItemLists. The input is simple, it is just a image, but the output is a list with two items:

yb[0] is a list of all the coordinates in fastai style (top,left,bottom,right). Please note that coco-style is (top left point (x,y), hight, width).

We will come back later to talk about how to scale coco-style to fastai style, and when you have output is in fastai style, how would you display bounding box(bbox)

Image for post
Image for post
Pic in fastai style

yb[1] is ground truth labels, so in pascal dataset, it is length is 21, 20 classes + 1 bg, where bg is the first item in the list.

yb[0].shape will be (bs, n,4)

yb[1].shape will be (bs,n)

Question: What is n?

To talk about n, we first have to talk about bb_pad_collate. This is a function that will be called when fastai create your batch.

This is what fastai hides from you, think about your dataset, for example: Image A has two objects in it, Image B has 4 objects in it. When you pack them together in one batch, your yb must be in the same shape.

If you have Image A yb[0].shape = (2,4), Image B yb[0].shape = (4,4). Your batch can’t have consistent shape. Therefore, what bb_pad_collate did is make them same shape, so Image A will be yb[0].shape = (4,4) to consistent with Image B.

If you think the whole batch, the n is the same length of the most objects in the batch, so your yb will have different shape for each batch.

Image for post
Image for post

This simply means the first batch we showed here, one image will have most 17 ground truth objects, and second batch will have most 13 ground truth objects. If the Image in one batch doesn’t have the maximum objects, pad 0 to make it consistent.

Therefore, simply calling cnn_learner() to create model will fail. When you call cnn_learner(), it will cut the pre-defined network head, and put a customized head with output equal to data.c, which in this case, is 21. When you call learn.lr_find(), it will fail, becuase pre-defined cross-entropy() loss doesn’t know how to handle your data, it would expect a tensor, but you passed a list. Therefore you will have to define a loss function, in current step, your prediction is a list with 2 items (preds[0]=bounding box, preds[1]=labels)

Before we move to loss function, let’s take about fastai bounding box style.

2. Fastai bounding box (bbox)

Question: With fastai object detection data block API, without show_batch, how can I display ground truth?

Let’s take a look of one item in the batch

Image for post
Image for post
Before un_pad
Image for post
Image for post

For details of un_pad(), please refer to 2018 lesson 8–9. What it did here is unwrap bb_pad_collate(). We basically just remove all the 0s.

The t_bbox shows you ground truth object in fastai style, with the top left corner is (-1,1), bottom right corner is (1,1). The t_bbox means the following:

the top line of the bounding box is 0.2568

the left line of the bounding box is -0.4742

the bottom line of the bounding box is 0.7074

the right line of the bounding box is 0.1978

How fastai turned coco-style to its own style is irrelevant to this post, but a brief explanation would be, it called something called FlowField, it will first grab the image height and width, and scale the coordinates using

points(x,y) / (height/2, width/2)- 1, so the points now is between (-1,1). This is essential for data argumentation, because as you rotating image, you also need to rotate your bounding box.

Another important take away is FlowField actually turned the coco-style to a 4 coordinates system, it is shape is (4,2), which is the top left corner, top right corner, bottom left corner, bottom right corner.

FlowField is irrelevant because you can call ImageBBox.create(), the above step is done by the fastai library. Once you have coco style coordinates, you can pass scale = True, so fastai will know how to turn the coco-sytle to its own style, if you already have fastai style coordinates, simply set scale = False, so it will not re-scale the coordinates.

Image for post
Image for post

When you create ImageBBox, you need to tell the coordinates system H,W for it to scale if needed, then pass in coordinates, if it is fastai style, set scale to False. You don’t have to supply labels and classes, if you don’t supply, it will just display bounding box.

Image for post
Image for post

3a. Loss function

Now we are clear about the input, and we know how to display ground truth, the rest is following the fastai 2018 lesson 9, with turning the anchor box also to the scale (-1,1)

Image for post
Image for post

Like Jeremy said, details don’t matter, here is the result of 189 anchor boxes

Image for post
Image for post
189 anchors’ center

Loss function is clearly explained on 2018 lesson 9, I will only point the part that confused me about.

The reason we use Binary Cross Entropy loss instead of Cross Entropy, as discussed, it is because you have multiple different backgrounds. The CNN will have a very hard to figure out different backgrounds all are a single thing ‘background’, therefore we use the one-vs-all style, are you a car? are you a person? … If not, you are background.

As you move to prediction, you have to understand this style, the background field in the final prediction tensor, you should also ignore it. Remember during training, you didn’t calculate the loss for the background field, therefore, your background value is trivial (probably very large, if you count it, highly likely you will predict everything is background)

The right way to understand your prediction is the following:

Set some thresh hold, if your probability on the box of the 20 things (exclude background) is high enough, you say there is no background. Else, it is background

idx = 21
p_cls_test = preds[1][idx][:,1:].max(dim=1)
idx_clas = p_cls_test[0].sigmoid() > 0.1

This two cells here simply means, check my prediction labels, pred[1], check idx item in the batch(in above is index 21), preds[1][idx], and its label probablily except background column preds[1][idx][:,1:].

Check the thresh hold, only save the one with probability greater than 10%

p_f_clas = (p_cls_test[1] + 1) * (idx_clas).long()

When you are doing the above label prediction, don’t forget you excluded background, so at index 0, it is not background but the first item in your data classes list, therefore, you need to plus 1 to make it consistent with your data classes list so you can cast the class back properly.

3b. Display

img = vision.Image(denormalize(xb[idx],*tensor(imagenet_stats)))
img_box = ImageBBox.create(224,224,yb[0][idx].cpu(),labels=yb[1][idx],classes=data.classes,scale=False)

p_final_bbox = act_to_bbox(preds[0][idx].cpu(),anchors)
t_bbox = p_final_bbox.cuda().index_select(dim=0,index=idx_clas.nonzero().squeeze())
test_clas = p_f_clas.index_select(dim=0,index=idx_clas.nonzero().squeeze())
p_img_box = ImageBBox.create(224,224,t_bbox.cpu(),test_clas,classes=data.classes,scale=False)

fig,axes = plt.subplots(1,2,figsize=(10,6))[0],y=img_box,title='Ground Truth')[1],y=p_img_box,title='Prediction')

Display prediction is easy once you finish the setup,

1. Read the image from batch with idx, denormalize the image.

2. Using fastai ImageBBox.create() to create two set of boxes, one for ground truth, one for predicted bounding boxes.

3. Pass the predicted label into display.

Image for post
Image for post

The End:

I hope this post can bridge the gap between part-1 and Retina Net posted in fastai github.

It helps me to understand the following concepts:

  1. Construct the data structure based on input requirement. If needed, we can supply Pytorch dataloader to databunch (see the above example)
  2. Create customized model with fastai discriminative lr, and applying needed fastai features.
  3. Create Loss function based on model structure
  4. Make predictions.

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store