(Part 1) Generating Anchor boxes for Yolo-like network for vehicle detection using KITTI dataset.

Vivek Yadav, PhD

In this post, I will present steps for computing anchor boxes for YOLO9000 (or YOLOv2). YOLOv2 is a combined classification-bounding box prediction framework where we directly predict the objects in each cell and the corrections on anchor boxes. More specifically, YOLOv2 divides the entire image into 13X13 grid cells, next places 5 anchor boxes at each location and finally predicts corrections on these anchor boxes. YOLOv2 makes 5 predictions corresponding to corrections on location of center (x and y), height and width, and finally the intersection over union (IOU) between predicted bounding boxes and ground truth boxes. A unique feature of YOLOv2 is that all the predictions are have magnitude less than 1, as a result the chance of one type of cost dominating the optimization is less likely. A unique feature of YOLOv2 is that the anchor boxes are designed specifically for the given dataset using K-means clustering. Unlike other anchor boxes (or prior) based methods, like Single Shot Detection, YOLOv2 does not assume the aspect ratios or shapes of the boxes. As a result, the YOLOv2 in general has lower localization loss and has higher intersection over union (IOU) between the target and network prediction. The rest of the post is organized as follows,

  1. Data preparation
  2. Exploratory data analysis
  3. Generating anchor boxes using K-means clustering
  4. Assigning anchor boxes to ground truth targets
  5. Data preparation

I first downloaded images and label from the kitti object-detection data set. I downloaded detection labels and images for the car’s left-camera. After downloading, I put the images and labels in separate folders titled kitti_land kitti_labels respectively. I then combined cars, trucks and van into one group called vehicles, and cyclists and pedestrian into a group called Person. In addition, I computed the center location, width and height of each bounding box, and normalized it by image dimensions. This allowed to resize image as needed and compute the same bounding boxes without any additional calculations. This also makes the process of data augmentation easy, that I will discuss in a more appropriate section later.

2. Exploratory data analysis (EDA)

EDA is perhaps the most important step of building any machine learning algorithm. Here, I will try and explain what YOLOv2 tries to do. Figure below presents a representative image from kitti dataset, in kitti-dataset each image is of size 1242 X 375, and there are about 7400 images with approximately 25000 annotations. Please note, the labels are normalized by height and width of the image.

Sample image and bounding box labels

Kitti is especially interesting data set, and more real-life type of data set. In kitti, images are organized by varying level of difficulty, and images have different types of issues that one may encounter in a more realistic data set. These issues can broadly be grouped into 3 types,

Occlusion: Occlusion happens when one object is either partially or completely occluded by

Example of occlusion

Overcrowded image: Most bounding box prediction methods predict a fixed number of bounding boxes. However, if the images is overcrowded, the algorithm will try to lump

Crowded bounding boxes
A nightmare for any human detection system

Incorrect annotations: The last issue is one where images are not completely annotated. For example, in the image below there are about 16 cars that are not tagged.

Mislabeled cars

3. Generating anchor boxes using K-means clustering

There are many ways to compute bounding boxes for detection tasks. One approach is to directly predict the bounding box values, however this approach is susceptible to errors as it tends to favor bounding boxes with large dimensions. Further, the training process is unstable because the range of values to predict can vary significantly. An alternate approach is to use template bounding boxes called anchor boxes or priors, and then use corrections on top of these anchor boxes to match the ground truth bounding box dimensions. In other models, like single shot detection (SSD), corrections are made on top of bounding box of fixed hand-selected sizes and aspect ratios. For example in SSD, there are 9 anchor/prior boxes are predicted per cell, based on hand selected aspect ratio and sizes. This however, does not guarantee that the anchor boxes are good candidates for bounding boxes. In YOLO, no anchor boxes are used and bounding box locations and dimensions are predicted directly. In YOLOv2, the first step is to compute good candidate anchor boxes. This is achieved using K-means clustering. However, using direct Euler distance metric for K-means minizers error for larger bounding boxes, but not for smaller boxes. Therefore, in YOLOv2, intersection over union (IOU) is used as a distance metric. The IOU calculations are made assuming all the bounding boxes are located at one point, i.e. only width and height are used as features. Figure below shows the height and width plotted against each other. Fixed slopes indicate that most bounding boxes have specific predefined aspect ratios, and size. This is not surprising given the fact that a person and vehicle are expected to have certain fixed dimensions.

I next used K-means clustering to compute cluster centers (centroids). As it is not clear how many centroids to use, I computed different number of clusters and computed the mean of maximum IOU between the bounding box and individual anchors. Figure below presents IOU vs number centroids data for kitti dataset.

mean IOU vs number of centroids
mean IOU vs number of centroids

After plotting number of centroids vs mean IOU, it is clear that as number of centroids increase, the mean IOU between anchor boxes and bounding boxes plateaus. Choosing large number of prior boxes will allow for greater overlap between anchor boxes and bounding boxes, however, as the number of anchor boxes increase, the number of convolution filters in prediction filters increase linearly. Having 5 vs 10 bounding boxes will result in 13X13X35 vs 13X13X70 predictions, this will result in large network size and increased training time. So I stuck with 5 anchor boxes, to stay true with the original YOLOv2 implementation. At 5 anchor boxes, mean IOU was above 65%. The anchor boxes from K-means are plotted below,

Predicted anchor boxes

As can be seen above, each anchor box is specialized for particular aspect ratio and size. A clearer picture is obtained by plotting anchor boxes on top of the image. In YOLOv2, an image is divided into 13X13 grid, and bounding box and class predictions are made for each anchor box located at those locations. The appropriate bounding box is selected as the bounding box with highest IOU between the ground truth box and anchor box. Note that this trick of assignment ensures that an anchor box predicts ground truth for an object centered at its own grid center, and not a grid cell far away (like YOLO may). Figure below presents anchor boxes and ground truth labels plotted on top of one another. Note that the anchor box that is responsible to predict a ground truth label is chosen as the box that gives maximum IOU when placed at the center of the ground truth box, i.e. only size is considered while assigning ground truth boxes. The location of anchor boxes is the center of the cell in which the center of the ground truth box falls. Figures below show the ground truth label (in green) and the assigned anchor box (in magenta). As we computed anchor boxes directly from the data set, we see a high overlap between the shapes of ground truth boxes and anchors.

Anchors and ground truth boxes for easy case

Figure below presents bounding box for a person (pedestrian) note that the shape of the bounding box is very close to the true shape of the person, and a minor correction in location of bounding box will result in them overlapping well.

Anchor box and ground truth box for a pedestrian

Below are some more examples of ground truth bounding boxes and anchor boxes.

Conclusion

In this post, I covered the concept of generating candidate anchor boxes from bounding box data, and then assigning them to the ground truth boxes. The anchor boxes or templates are computed using K-means clustering with intersection over union (IOU) as the distance measure. The anchors thus computed do not ignore smaller boxes, and ensure that the resulting anchors ensure high IOU between ground truth boxes. In generating the target for training, these anchor boxes are assigned or are responsible for predicting one ground truth bounding box. The anchor box that gives highest IOU with the ground truth data when located at its center is responsible for predicting that ground truth label. The location of the anchor box is the center of the grid cell within which the ground truth box falls.