Object Detection 101

Güldeniz Bektaş
11 min readJan 11, 2023

1. Object Detection: Task Definition

Object detection is a computer vision task that allows us to identify and locate objects in an image or a video. The model predicts where each object is and what label should be applied. With this kind of identification and localization, object detection can be used to count objects in a scene and determine and track their precise locations, all while accurately labeling them.

An image that contains two cats and a person.

Object detection allows us to at once classify the types of things found while also locating instances of them within the image.

Classification — what’s in image. overall label for an image

Semantic Segmentation — no objects, just pixels

Object Detection — muultiple object detection and localization

Input — single RGB image

Output — is going to be a set of detected objects. For each object we’re going to output several things.

  • category label — which category. Ahead of time we’re going to pre-specify (CIFAR10, ImageNet)
  • bounding box — spatial extent of the object. The location of each object (x,y,w,h).
  • x and y is the center of the box in pixels.
  • w and h is width and height of the box.

2. Object Detection: Challenges

  1. Multiple Outputs: Need to output variable numbers of objects per image.
  2. Multiple Types of Output: Need to predict “what” (category label) and “where” (bounding box).
  3. Large Images: Classification works at 224x224, for object detection task we need higher resolution, often 800x600.

Detecting Single Object

We can use a simple architecture to detect a single object. We have an input image. Then we fed it into a simple CNN arc, like VGG or ResNet. At the end we have a vector representation of image. So, we can have a branch do the classification. What is in the image? It will output scores for each class. It’s going to be trained with softmax loss on the ground truth category.

So far it’s the same with image classification process. But now we have a new branch that outputs the bounding box’s coordinations. x, y, w, h. It’s input is the same vector representation of the image, now it has a fully connected layer 4096 to 4 since we need 4 outputs. And they’re going to be trained with some regression loss, like L2 loss. L2 difference between the predicted box coordinates with the real box coordinates.

The problem is now we have two loss. Since we need only one loss to compute gradient descent, we need to end up with a single scalar loss. We don’t know how to deal with set of losses. Now, we need to take weighted sum of this two losses. We do this because we don’t want them to overpower each other. This process called multitask loss. Because we want to traing our deep learning model to do sort of multiple different tasks all at once. But we need to boil it down to one single scalar loss at the end.

🎤 The network we fed with our image is called backbone network. You can use ImageNet, VGG, AlexNet or other sota CNN models. This network often be pre-trained for ImageNet classification and after you would fine-tune this whole network for doing this multitask detection problem.

This straightforward method actually works when you know you’ll detect only one object in an image.

Detecting a Multiple Object

Different images might have different numbers of objects that we need to detect.

If an image contains one cat, the model needs to predict 4 numbers plus a label. If an image contains two dog and a cat the model needs to predict 4x3, 12 numbers plus 3 labels. If an image contains many objects the model needs to predict many numbers.

There’s a relatively simple way to do this.

3. Detecting Multiple Objects: Sliding Window

We’re going to have a CNN and we’re going to train it to categorize sub-windows of input image. We reduce this problem to a classification task. We apply CNN to every region of the input image and it’s classifying the regions as categories plus background.

📌 Problem: How many possible boxes are there in an image of size HxW?

Consider a box of size hxw:

Possible x positions: W — w + 1

Possible y positions: H — h + 1

Possible positions: (W — w + 1)*(H — h + 1)

In fact we need to consider not just boxes of a fixed size. We need to consider all possible boxes of all possible sizes.

🔗 If the image size is 800x600, image has approximately 58M boxes!

😅 No computational way we could run our object detection model. 58M regions just for ONE image.

So, there is an other approach to overcome this problem: region proposals.

Region Proposals

Since we can’t evaluate regions for every image for object detection model, then maybe we can have some external algorithm that can generate a set of candidate regions in image for us. Since the candidate regions gives a relatively small set of regions per image, there is a high probability of covering all.

Few years ago there was a whole bunch of different papers proposing different mechanisms for generating region proposals. At the end they’ve been replaced with neural networks.

One of the famous region proposals was a method called selective search. It was an algorithm you would run on a CPU. It would give 2000 regions for per image in a couple of seconds of processing images on a CPU.

After we have an idea about region proposals it gives us very straightforward way to train an object detector with deep neural networks.

4. R-CNN: Region-Based CNN

These 2000 candidate region proposals are wrapped into a square and fed into a convolutional neural network that produces a 4096-dimensional feature vector as output. The CNN acts as a feature extractor and the output dense layer consists of the features extracted from the image and the extracted features are fed into an SVM to classify the presence of the object within that candidate region proposal. In addition to predicting the presence of an object within the region proposals, the algorithm also predicts four values which are offset values to increase the precision of the bounding box.[1]

What if the region proposals wouldn’t contain the object fully? What if the predicted box contains label person only has the half of the face? To solve this other output we need from this model is a bounding box regression, a prediction to transform to correct the 4 bounding box coordination numbers.

Region proposal:

Transform:

Output box:

Translate relative to box size:

Log-space scale transform:

5. R-CNN: Test-Time

This is giving us first full object detection method using convolutional neural networks. The pipeline at test time looks something like this:

  1. Run region proposal method to copmute appx. 2000 region proposal.
  2. Resize each region 224x224 and run independently through CNN to predict class scores and bbox transform that will transform the coordinates of the original region proposal.
  3. Use scores to select a subset of region proposals to output (threshold on background or per-category, take top K proposals per image)
  4. Compare with ground-truth boxes

Comparing Boxes: Intersection over Union (IoU)

How can we compare our prediction to the ground truth box?

The way that we compare two sets of bounding boxes is with a metric called Intersection over Union (IoU).

IoU > 0.5 is decent

IoU > 0.7 is pretty good

IoU > 0.9 is almost perfect

6. Overlapping Boxes: Non-Max Supression (NMS)

Problem — object detectors often output many overlapping detections. The model will not output one bounding box per object. So we need some kind of mechanism to get rid of these overlapping boxes.

Solution — pre-process raw detections using Non-Max Supression (NMS)

Let’s say we have a list of predictions, P (x, y, w, h, score). For NMS, we choose the box with the highest score, say it’s S. We remove this prediciton S from P and calculate IoU with every one predictions in P. After we determine a threshold for IoU, if the IoU between prediction and S is greater than this thereshold, we eliminate the prediciton in P, and remove it from P. We do this operation with every element in P. When it’s done we have remain predictions that might be another object in the image, so we repeat the process for the other box with the highest score in P. S is not an element of P anymore!

⛳ But there is a problem with this. When we have a lot of objects, overlapping in an image, NMS is not enough.

7. Evaulating Object Detectors: Mean Average Precission (mAP)

  1. Run object detector on test image (with NMS)
  2. For each category compute average precission (AP) = area under precission & recall curve.
  3. For each detection (highest score to lowest score)
  4. If it matches some ground truth box with IoU greater than 0.5, mark it as positive and eliminate the box
  5. Otherwise mark it as negative.
  6. Plot a point on PR curve
  7. Average precision = area under PR curve
  8. Mean Average Precission (mAP) = average of AP for each category
  9. For COCO mAP: compute mAP@thresh for each IoU threshold and take average
  • mAP@0.5, mAP@0.55, mAP@0.60…

If we go back to the R-CNN, we can see there’s a huge problem is that it’s very slow! Need to do apprx. 2k forward passes for each image. If we have to do 2000 forward passes of our CNN for every image, so we need to come up with some way to make this process faster.

And people have made this process faster. With Fast R-CNN.

Fast R-CNN

We’re going to swap the order of convolution and region. Take the input image and process the whole image at a high resolution with a single convolutional neural network. And this is going to be no fully connected layers, just all convolutional layers. So the output from this thing will be a convolutional feature map giving us convolutional features for the entire high resolution image. The convnet we run the images on is called backbone network.

We’re still going to run our region proposal method like selective search to get region proposals on the raw input image. But now rather than cropping the pixels of the input image instead we’re going to project those region proposals onto that convolutional feature map and then apply cropping now on the feature map itself rather than on the raw pixels of the image so we’ll do this cropping and resizing on the features that are coming out from the convolutional backbone network. Then we’re going to run a little lightweight per region network that will output our classification scores and bounding box regression transforms for each of these detected regions and now this is going to be very fast because most of the computation is going to happen in this backbone network an the per region network that we run per region is going to be very relatively small and relatively lightweight and very fast to run. If you’re doing something like Fast R-CNN with an AlexNet then the backbone that is going to be all of the convolutional layers of the AlexNet and this per region network will just be the two fully connected layers at the end. So these are really relatively fast to compute even if we need to run them for a large set of regions.

❓There’s a question that what does it mean exactly to crop these features?

  • Because in order to backpropogate we need to actually backpropogate into the weights of the backbone network as well as we need to crop thee features in a way that is differentiable and that ends up being a little bit tricky. So one way to crop these images in a differentiable way is this operator called RoI Pool.

Cropping Features: RoI Pool

Region of interest pooling.

We have our input image and some region proposal that has been computed on that input image. Then we’re going to run the backbone network to get these convolutional image features across the entire input image. Each point in this convolutional feature map corresponds to points in the input image. Then what we can do is just project that region proposal onto the feature map and we can snap that feature because projection of the region proposal might not perfectly align to the grid of the convolutional feature map. Next step is to snap that grid to the convolutional feature map and then divide it up into sub regions. Say we want to do two by two pooling. Then we could divide region proposal into roughly equal two by two regions as close as we can get keeping on a line to grid cells. And perform max pooling within each of those regions. Region features always the same size even if input regions have different sizes.

🤚 Runtime dominated by region proposals. Region proposals computed by “Selective Search” algorithm on CPU, let’s learn them with a CNN insted!

8. Faster R-CNN: Learnable Region Proposals

Insert “Region Proposal Network” to predict proposals from features.

Faster R-CNN possesses an extra CNN for gaining the regional proposal, which we call the regional proposal network. In the training region, the proposal network takes the feature map as input and outputs region proposals. And these proposals go to the ROI pooling layer for further procedure.

Well, after this we can add an implementation to it but not in this article. There’s a link you can find code about object detection with TensorFlow Object Detection API. You’re going to create your own custom dataset to train object detection model. Pretty awesome, right?

Source

The notes taken from the Michigan Universities Object Detection lecture. Check that out:

And watch the videos:

--

--