Review — CornerNet: Detecting Objects as Paired Keypoints (Object Detection)

Detecting Corners, Outperforms Mask R-CNN, Faster R-CNN, CoupleNet, G-RMI, FPN, TDM, YOLOv2, DSOD, GRF-DSOD, SSD, DSSD, RetinaNet, RefineDet

Sik-Ho Tsang
Nerd For Tech


CornerNet detects the top-left corner and the bottom-right corner

In this story, CornerNet: Detecting Objects as Paired Keypoints, (CornerNet), by University of Michigan, is reviewed. In this paper:

  • An object bounding box is detected as a pair of keypoints, the top-left corner and the bottom-right corner, eliminating the need for designing a set of anchor boxes commonly used in prior single-stage detectors.

This is a paper in 2018 ECCV with over 900 citations. (Sik-Ho Tsang @ Medium)


  1. CornerNet: Network Architecture
  2. Corner Detection (Heatmaps & Offsets)
  3. Corner Grouping (Embeddings)
  4. Corner Pooling
  5. Comparisons with State-Of-The-Art Detectors

1. CornerNet: Network Architecture

CornerNet: Network Architecture
  • The hourglass network used in Newell ECCV’16, which is originally used for human pose estimation, is used as backbone.
  • Using Newell ECCV’16 as backbone probably because right now CornerNet is going to detect keypoints, similar to the purpose of networks for human pose estimation.
Two Examples of bounding box predictions overlaid on predicted heatmaps of corners. Left: Top-left corner heat map, Right: Bottom-right corner heat map
  • Each keypoint is predicted as a heat map.
  • Here, corners are treated as a keypoints.
  • In Newell ECCV’16, head, shoulder, hand palm, etc are treated as keypoints.
  • The hourglass network is followed by two prediction modules.
  • One module is for the top-left corners, while the other one is for the bottom-right corners.
Multiple branches for predicting the heatmaps, embeddings and offsets.
  • Each module has its own corner pooling module, as shown above, to pool features from the hourglass network then predicting the heatmaps, embeddings and offsets.
  • The depth of the hourglass network is 104.
  • Unlike many other state-of-the -art detectors, only the features from the last layer of the whole network are used to make predictions.
  • The full training loss is:
  • where Ldet is the detection loss for heatmap, Lpull and Lpush are the losses for embedding, and Loff is the loss for offsets. These losses will be described in more details below.
  • α and β set to 0.1 and γ sets to 1.

2. Corner Detection (Heatmap & Offsets)

  • Each set of heatmaps has C channels, where C is the number of categories, and is of size H×W.
  • Let pcij be the score at location (i, j) for class c in the predicted heatmaps, and let ycij be the “ground-truth” heatmap augmented with the unnormalized Gaussians, as similar to human pose estimation.
  • With also the use of focal loss in RetinaNet, the detection loss Ldet is:
  • where N is the number of objects in an image, α=2, and β=4.
  • Remapping the locations from the heatmaps to the input image, some precision may be lost, which can greatly affect the IoU of small bounding boxes with their ground truths.
  • Hence, Location offsets are predicted to slightly adjust the corner locations before remapping them to the input resolution.
  • where ok is the offset, xk and yk are the x and y coordinate for corner k.
  • Smooth L1 Loss, as in Fast R-CNN, at ground-truth corner locations, is used:

3. Corner Grouping (Embedding)

The network is trained to predict similar embeddings for corners that belong to the same object.
  • Inspired by the associate embedding in Newell ECCV’16, CornerNet predicts an embedding vector for each detected corner such that if a top-left corner and a bottom-right corner belong to the same bounding box, the distance between their embeddings should be small.
  • The actual values of the embeddings are unimportant. Only the distances between the embeddings are used to group the corners.
  • Let etk be the embedding for the top-left corner of object k and ebk for the bottom-right corner.
  • The “pull” loss is used to train the network to group the corners and the “push” loss is used to separate the corners:
  • where ek is the average of etk and ebk. Δ=1.
  • Similar to the offset loss, the losses are only applied at the ground-truth corner location.
Error analysis by replacing prediction with ground-truth
  • Using the ground-truth heatmaps alone improves the AP from 38.5% to 74.0%.
  • Replace the predicted offsets with the ground-truth offsets, the AP further increases by 13.1% to 87.1%.
  • This suggests that although there is still ample room for improvement in both detecting and grouping corners, the main bottleneck is detecting corners.

3. Corner Pooling

Often there is no local evidence to determine the location of a bounding box corner
  • As shown above, there is often no local visual evidence for the presence of corners.
  • Corner pooling is proposed to better localize the corners by encoding explicit prior knowledge.
Corner pooling
  • It can be formulated as the below equation:
  • An example is illustrated below:
Example of corner pooling
  • The top-left corner pooling layer can be implemented very efficiently.
  • The feature map is scanned from right to left for the horizontal max-pooling and from bottom to top for the vertical max-pooling.
  • Then two max-pooled feature maps are added.
Corner pooling on MS COCO validation.
  • With corner pooling, there is significant improvement: 2.0% on AP, 2.1% on AP50 and 2.2% on AP75.

4. Comparisons with State-Of-The-Art Detectors

CornerNet versus others on MS COCO test-dev.
  • With multi-scale evaluation, CornerNet achieves an AP of 42.1%, outperforms the state of the art among existing one-stage methods such as YOLOv2, DSOD, GRF-DSOD, SSD, DSSD, RefineDet.
  • CornerNet also is competitive with two-stage methods such as Cascade R-CNN, and outperforms Mask R-CNN, Faster R-CNN, RetinaNet and CoupleNet.



Sik-Ho Tsang
Nerd For Tech

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.