Object detection is one of the key aspects of computer vision. There are a lot of pre-trained models able to detect a wide variety of objects. What if you need to detect your custom objects (not present in the pre-trained model)? This story will tell you how to do it using Detectron2 platform.
Table of contents:
- Object detection overview
- Problem definition
- Project setup
- Register the Dataset
- Model training and evaluation
Object detection overview
These days state-of-the-art object detection models are powered by deep learning and involve two tasks:
- Image classification predicting the type or class of an object in an image
- Object localization identifying the location of one or more objects in an image (so we can draw a bounding box around their content)
We can distinguish two main deep learning approaches for object detection:
- region-based object detectors including R-CNN, Fast R-CNN, Faster R-CNN, R-FCN
- single-shot object detectors (SSD, YOLO, RetinaNet)
Region-based object detectors are two-stage detector where first, we use a Region Proposal Network (RPN) to generate regions of interests and then send the region proposals down the pipeline for object classification and bounding-box regression. They are mostly more accurate but at the expense of computational complexity.
Single-shot object detectors are one-stage detector where we apply our classifier and bounding box regressor over a dense, regularly sampled set of possible object locations. They tend to be significantly faster, simpler, and more intuitive, but may not be as accurate.
The exception is the RetinaNet model which was proposed by Lin et al. in the 2017 paper Focal Loss for Dense Object Detection. They introduce a new loss function called Focal Loss which is reshaped standard cross-entropy loss, solving the problem of foreground-background class imbalance with single-stage detectors.
Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. — arXiv:1708.02002
Object Detection is one of the most valuable computer vision tasks. The use cases are endless, be it object tracking, pedestrian detection, video surveillance, activity recognition, face detection, recognition and identification, self-driving cars, and so on.
Let’s assume we are building Automatic Number/License Plate Recognition (ANPR) system. Generally, the system is used to automatically detect and recognize license plates in images or video stream. Then, it is possible to look up information on the owner of the car and sends him a traffic ticket for exceeding the speed limit.
Another case could be collecting data for our self-driving car and be compliant to the privacy regulations (like EU’s General Data Protection Regulation). They imply that individual (car owner) can demand company gathering data to remove all personal data that they hold about the subject. It is satisfied by anonymizing personal information.
We can observe the anonymization results looking at Google Street View.
In both cases, we need to create a license plate detection model to be used in our image/video processing pipeline. Let’s create one!
I have prepared detectron2-licenseplates project with all the necessary code and data to go through this story.
For detailed description how to train your own detection model using a custom dataset and evaluate it read the Medium…
If you know nothing about Detectron2 and how to use it in your computer vision pipeline, look at my previous story:
How to embed Detectron2 in your computer vision project
Use the power of the Detectron2 model zoo.
First clone the project repository:
$ git clone git://github.com/jagin/detectron2-licenseplates.git
$ git checkout edd03e4b31ec52487a506f2ed711ce9faf0b94f6
$ cd detectron2-licenseplates
edd03e4b31ec52487a506f2ed711ce9faf0b94f6 indicates the source code compatible with the content of this story.
For project environment setup, I’m using Conda which is also included in Anaconda — data science and machine learning platform. If you are curious about the platform and why to use it read: Get your computer ready for machine learning: How, what and why you should use Anaconda, Miniconda and Conda
Let’s create the project environment:
$ conda env create -f environment.yml
$ conda activate detectron2-licenseplates
The created environment includes all the requirements we need to train and test our model on Detectron2 platform.
To train our model, we will use images from MediaLab LPR dataset. This dataset doesn’t contain annotations, but I created them for you in PASCAL VOC format using CVAT tool (there are also other interesting tools for data labelling like labelimg and labelme).
Here is the structure of our license plates dataset:
│ ├── 04ow1.xml
│ ├── ...
│ ├── zb35o.xml
│ └── zhr5k.xml
│ ├── 04ow1.jpg
│ ├── ...
│ ├── zb35o.jpg
│ └── zhr5k.jpg
annotations folder contains Pascal VOC annotations XML files, one file per image. It stores metadata about an image like a folder where the image is stored, its filename, size and each bounding box. There is only one class: licenseplate.
Next, we have
images folder with the following content:
test.txt is our dataset split to train and test the model.
This dataset cannot be used to build a production-ready model. It is too small. After some cleaning, there are 137 images with one license plate in each. But that’s all we need to play around.
Register the Dataset
For Detectron2 to know how to obtain the dataset, we need to register it and optionally, register metadata for your dataset.
The process is well described with details in Detectron2 documentation.
In general, Detectron2 uses its own format for data representation which is similar to COCO’s JSON annotations. It is a matter of implementing a function that returns the items in your custom dataset and register it:
return dicts # in the Detectron2 formatfrom detectron2.data import DatasetCatalog
For dataset which is already in the COCO format, Detectron2 provides the
register_coco_instances function which will register
load_coco_json for you and add metadata about your dataset.
Metadata is a key-value mapping that provides information about dataset like names of classes, colors of classes, root of files, etc. which are accessible through
In our case, we have the dataset in Pascal VOC format and there is no general-purpose loader for that format. Fortunately, Detectron2 has an implementation for registering Pascal VOC datasets (see
register_all_pascal_voc function in
detectron2/detectron2/data/datasets/builtin.py) which could be an inspiration for us.
In our project, there is
register_licenseplates_voc function in
licenseplates/dataset.py file which will load our data and register it together with metadata.
def register_licenseplates_voc(name, dirname, split):
To see if it works there is a quick test in
if __name__ == ‘__main__’: block of the code to display the image with annotation using our loader and Detectron2
if __name__ == ‘__main__’: block of the code triggers if it’s run as the main module only so we would be able to import our module later safely.
$ python licenseplates/dataset.py
will display random 10 images with annotation from the train dataset. You can switch to the test dataset with option
We are ready to train our model.
Model training and evaluation
Our approach will be using transfer learning where the weights of existing network architecture are tuned to predict classes that the original network was not trained on.
In practice, very few people train an entire Convolutional Network from scratch (with random initialization), because it is relatively rare to have a dataset of sufficient size. Instead, it is common to pretrain a ConvNet on a very large dataset (e.g. ImageNet, which contains 1.2 million images with 1000 categories), and then use the ConvNet either as an initialization or a fixed feature extractor for the task of interest. — Andrej Karpathy, Transfer Learning
- Create model, optimizer, scheduler, dataloader from the given config.
- Load a checkpoint or
cfg.MODEL.WEIGHTS, if exists.
- Register a few common hooks.
as Detectron2 documentation states.
Trainer class is as simple as:
def build_evaluator(cls, cfg, dataset_name):
We could use
DefaultTrainer directly but in our case, we want to add some custom detection evaluation. As a metric in measuring the accuracy of the object detector, we use Average Precision (AP, AP50, AP75). The evaluation procedure of the detection task for PASCAL VOC is described here. Be also sure to reference Jonathan Hui’s excellent article.
The training process goes in four steps:
- Register the license plates dataset
- Setup model configuration
- Run the training process
- Evaluate the model
We already know how to register dataset but let’s focus a little bit on the model configuration which is stored in
Detectron2 provides a lot of different models which can be accessed with
detectron2.model_zoo package, but we need to modify them for our case (we have only one class to detect) and have version control on the config in our repository.
I included two COCO object detection baselines from Detectron2 Model Zoo:
- Fast R-CNN — region-based object detector
- RetinaNet — single-shot object detector
and adjust it to our needs.
The model config is setup through
setup_cfg function from
Let’s train the Fast R-CNN model:
$ python train.py --config-file configs/lp_faster_rcnn_R_50_FPN_3x.yaml
It takes a few minutes to train this toy dataset (300 iterations) on RTX 2080 Ti with the results alike:
[01/22 14:08:38 d2.utils.events]: eta: 0:00:12 iter: 239 total_loss: 0.139 loss_cls: 0.026 loss_box_reg: 0.115 loss_rpn_cls: 0.000 loss_rpn_loc: 0.004 time: 0.2075 data_time: 0.0048 lr: 0.004795 max_mem: 2357M
[01/22 14:08:42 d2.utils.events]: eta: 0:00:08 iter: 259 total_loss: 0.128 loss_cls: 0.023 loss_box_reg: 0.097 loss_rpn_cls: 0.000 loss_rpn_loc: 0.003 time: 0.2074 data_time: 0.0046 lr: 0.005195 max_mem: 2357M
[01/22 14:08:46 d2.utils.events]: eta: 0:00:04 iter: 279 total_loss: 0.125 loss_cls: 0.024 loss_box_reg: 0.096 loss_rpn_cls: 0.000 loss_rpn_loc: 0.003 time: 0.2072 data_time: 0.0045 lr: 0.005594 max_mem: 2357M
[01/22 14:08:51 fvcore.common.checkpoint]: Saving checkpoint to ./output/model_final.pth
[01/22 14:08:54 d2.engine.defaults]: Evaluation results for licenseplates_test in csv format:
[01/22 14:08:54 d2.evaluation.testing]: copypaste: Task: bbox
[01/22 14:08:54 d2.evaluation.testing]: copypaste: AP,AP50,AP75
[01/22 14:08:54 d2.evaluation.testing]: copypaste: 81.8429,100.0000,100.0000
[01/22 14:08:54 d2.utils.events]: eta: 0:00:00 iter: 299 total_loss: 0.132 loss_cls: 0.025 loss_box_reg: 0.105 loss_rpn_cls: 0.000 loss_rpn_loc: 0.003 time: 0.2083 data_time: 0.0044 lr: 0.005994 max_mem: 2357M
[01/22 14:08:54 d2.engine.hooks]: Overall training speed: 297 iterations in 0:01:02 (0.2090 s / it)
[01/22 14:08:54 d2.engine.hooks]: Total training time: 0:01:05 (0:00:03 on hooks)
To train the RetinaNet model on our dataset you can run the same script with different model configuration (it will overwrite the results from the previously trained model):
$ python train.py --config-file configs/lp_retinanet_R_50_FPN_3x.yaml
You can observe all the metrics on TensorBoard running:
$ tensorboard --logdir output
The trained model is saved to
output/model_final.pth file and we can use it in our prediction on images from the test dataset:
$ python predict.py --config-file configs/lp_faster_rcnn_R_50_FPN_3x.yaml MODEL.WEIGHTS output/model_final.pth
The script will randomly display 10 samples (see
--samples option) from the test dataset.
Did you spot the false positive? You can get rid of it increasing the confidence threshold with option
Detectron2 is the object detection and segmentation platform released by Facebook AI Research (FAIR) as an open-source project. Beyond state-of-the-art object detection algorithms includes numerous models like instance segmentation, panoptic segmentation, pose estimation, DensePose, TridentNet. It is easy to reuse them in your research or create your custom model thanks to its modular design.
I hope that this story will help you train your own model. Happy codding!
- What do we learn from region based object detectors (Faster R-CNN, R-FCN, FPN)?
- What do we learn from single shot object detectors (SSD, YOLO), FPN & Focal loss?
- Design choices, lessons learned and trends for object detections?
- Detectron2 documentation
- Get your computer ready for machine learning: How, what and why you should use Anaconda, Miniconda and Conda
- mAP (mean Average Precision) for Object Detection