Center and Scale Prediction for pedestrian detection

Published in

knowledge-engineering-seminar

5 min readMay 21, 2020

This post is focused on the problem of computer vision called pedestrian detection, looks at three families of the pedestrian detection algorithms and algorithm called Center and Scale Prediction [1] used to tackle this problem.

Pedestrian detection

Pedestrian detection is a task in computer vision where the objective is to decide whether or not there are persons on the input image and if so, how many and coordinates of each of them. Although this task is popular there are many challenges that makes the solution of this problem that much harder.

Few of these challenges are:

Various style of clothing.
Mutual occlusion of pedestrians between themselves
Occlusion of the pedestrian caused by them using accessories such as an umbrella.

There are many fields which benefit from solution of such task, including but not limited to video surveillance usages such as security alarm or detection of people on the road for self-driving cars which makes this task appealing despite its challenges.

Sliding window detection

Sliding window algorithms for pedestrian detection or object detection in general are based as the name suggests on a sliding window. This sliding windows starts at the top left corner of the image and moves to the right by predefined step size. When it is not possible to move the window on its X axis its moved by the step size on the Y axis until the window reaches bottom right corner of the image. On every move of the sliding window the algorithm tries to find a person inside a bounding box defined by the window [2].

Anchor box based detection

Another family of algorithm used for pedestrian detection are algorithms using so called anchor boxes. In one image there can be thousands of anchor boxes based on combinations scale, aspect ratio and ideal coordinates of the object we want to detect. For each anchor box we calculate IoU (Intersection over Union) and if its more than 50% the object is detected here. This approach enables the network to detect multiple objects, objects of different scales or overlapping objects [3].

Visualization of anchor boxes at one image [4]

Box-free object detection

Box-free detectors avoid the requirement of anchor boxes or sliding windows and detect the object directly from an image.

Center and scale prediction

Center and scale prediction, from here on out referred to as CSP, detector is a box-free approach to pedestrian detection. Instead of anchor boxes it formulates detection as two separate tasks of predicting scale and center via convolution. Although Center and Scale prediction can be used for face detection in this post we focus solely on the pedestrian detection.

Thanks to this approach CSP is able to generate bounding boxes of the pedestrians in a single pass of a fully convolutional network without any post processing schemes except the Non-Maximum suppression.

Architecture

Architecture of CSP comprises of two components one for feature detection and and the detection head. The feature extraction simply merges all of the feature maps into one while detection head consists of 3x3 convolutional layer followed by prediction layer, for the offset prediction, center location and the scale.

Experiments

For demonstrating the effectivnes of the CSP detector for pedestrian detection it was evaluated on two largest pedestrian detection benchmarks Caltech and CityPesons. Caltech consists approximately 2.5 hours of autodriving video and CityPersons is a large-scale pedestrian detection dataset with various levels of occlusion. For Caltech the model was trained using about 42k of frames and tested on 4024 frames, while for CityPersons the model was trained on 2975 images and tested on the validation of 500 images. Evaluation metric is average miss rate over false positive per image (MR^-2).

Caltech

CSP achieves MR^-2 of 4.5% on the Reasonable settings outperforming competitors including the best competitor RepLoss by 0.4% as seen in a) of the chart bellow. As demonstrated in b) CSP presents the superiority on detecting pedestrians of various scales and occlusion levels. And as seen in c) part of the chart in heavy occlusion CSP performs very well outperforming RepLoss and OR-CNN which were specifically designed for occlusion cases.

CityPersons

Table below shows the comparison of CSP with previous state of the arts on CityPersons.

On the reasonable subset CSP achieves the best performance with a gain of 1.0% MR ^-2 to its closest competitor ALFNet with comparable speed on the same environment of 0.33 second per image over ALFNet’s 0.27 second per image. As seen in table where red and green means the best and the second best performance respectively below CSP beats all the competitors and performs quite well on occlusion cases without any strategies for handling occlusion.

Comparison with state of the arts on CityPersons benchmark [1]

Cross-dataset evaluation

Next experiment is a comparison of CSP with ALFNet where both of the models were trained on CityPersons training subset and tested on CityPersons testing subset and another experiment where the models were trained on CityPersons train subset and tested on Caltech test subset.

As seen in the table above for CityPersons -> CityPersons the gap is only 1% but while for CityPersons -> Caltech it increases to 5.9%. This suggests the evidence that the Center and Scale prediction generalizes better to another dataset than the anchor-box based competitors.

Conclusion

This post introduces three families of pedestrian detection algorithms and analyses the algorithm from one of these families called Center and Scale prediction. The results presented in this post show the algorithm’s value for pedestrian detection system.

References

[1]: Wei Liu · Irtiza Hasan · Shengcai Liao, Center and Scale Prediction: A Box-free Approach for Pedestrian and Face Detection

[2]: Adrian Rosebrock, Sliding Windows for Object Detection with Python and OpenCV

[3]: Mathworks, Anchor Boxes for Object Detection

[4]: Anders Christiansen, Anchor Boxes — The key to quality object detection