Digging into Detectron 2 — part 1

Basic Network Architecture and Repo Structure

5 min readJan 5, 2020

Figure 1. Inference result of Faster (Base) R-CNN with Feature Pyramid Network.

Hi I’m Hiroto Honda, a computer vision researcher¹ [homepage] [linkedin]

In this article I would like to share my learnings about Detectron 2 — repo structure, building and training a network, handling a data set and so on.

In 2019 I won the 6th place at Open Images competition (ICCV 2019) using maskrcnn-benchmark, which detectron 2 is based on. It was not an easy task for me to understand the whole framework, so I hope this article helps researchers and engineers who are eager to learn the details of the system and develop their own models.

part 1 (you are here): Introduction — Faster R-CNN FPN architecture and repo structure
part 2 : Feature Pyramid Network

What’s Detectron 2?

Detectron 2 ² is a next-generation open-source object detection system from Facebook AI Research. With the repo you can use and train the various state-of-the-art models for detection tasks such as bounding-box detection, instance and semantic segmentation, and person keypoint detection.

facebookresearch/detectron2

Detectron2 is Facebook AI Research's next generation software system that implements state-of-the-art object detection…

github.com

You can run a demo by following the instructions of the repository — [Installation] and [Getting Started] — but if you want to go further than just running example commands, it would be necessary to know how the repo works.

Faster R-CNN FPN architecture

As an example I choose the Base (Faster) R-CNN with Feature Pyramid Network³ (Base-RCNN-FPN), which is the basic bounding box detector extendable to Mask R-CNN⁴. Faster R-CNN⁵ detector with FPN backbone is a multi-scale detector that realizes high accuracy for detecting tiny to large objects, making itself the de-facto standard detector (see Fig. 1).

Let’s look at the structure of the Base R-CNN FPN:

Figure 2. Meta architecture of Base RCNN FPN.

The schematic above shows the meta architecture of the network. Now you can see there are three blocks in it, namely:

Backbone Network: extracts feature maps from the input image at different scales. Base-RCNN-FPN’s output features are called P2 (1/4 scale), P3 (1/8), P4 (1/16), P5 (1/32) and P6 (1/64). Note that non-FPN (‘C4’) architecture’s output feature is only from the 1/16 scale.
Region Proposal Network: detects object regions from the multi-scale features. 1000 box proposals (by default) with confidence scores are obtained.
Box Head: crops and warps feature maps using proposal boxes into multiple fixed-size features, and obtains fine-tuned box locations and classification results via fully-connected layers. Finally 100 boxes (by default) in maximum are filtered out using non-maximum suppression (NMS). The box head is one of the sub-classes of ROI Heads. For example Mask R-CNN has more ROI heads such as a mask head.

What’s inside each block? Fig. 3 shows the detailed architecture:

Figure 3. Detailed architecture of Base-RCNN-FPN. Blue labels represent class names.

Well, it’s more complicated! Now let’s leave it for now and look at the repository.

Structure of the detectron 2 repo

The following is the directory tree of detectron 2 (under the ‘detectron2’ directory⁶). Please just look at the ‘modeling’ directory. The Base-RCNN-FPN architecture is built by the several classes under the directory.

detectron2
├─checkpoint  <- checkpointer and model catalog handlers
├─config      <- default configs and handlers
├─data        <- dataset handlers and data loaders
├─engine      <- predictor and trainer engines
├─evaluation  <- evaluator for each dataset
├─export      <- converter of detectron2 models to caffe2 (ONNX)
├─layers      <- custom layers e.g. deformable conv.
├─model_zoo   <- pre-trained model links and handler
├─modeling   
│  ├─meta_arch <- meta architecture e.g. R-CNN, RetinaNet
│  ├─backbone  <- backbone network e.g. ResNet, FPN
│  ├─proposal_generator <- region proposal network
│  └─roi_heads <- head networks for pooled ROIs e.g. box, mask heads
├─solver       <- optimizer and scheduler builders
├─structures   <- structure classes e.g. Boxes, Instances, etc
└─utils        <- utility modules e.g. visualizer, logger, etc

Meta Architecture:
GeneralizedRCNN (meta_arch/rcnn.py)
which has:

Backbone Network:
FPN (backbone/fpn.py)
└ ResNet (backbone/resnet.py)
Region Proposal Network:
RPN(proposal_generator/rpn.py)
├ StandardRPNHead (proposal_generator/rpn.py)
└ RPNOutput (proposal_generator/rpn_outputs.py)
ROI Heads (Box Head):
StandardROIHeads (roi_heads/roi_heads.py)
├ ROIPooler (poolers.py)
├ FastRCNNConvFCHead (roi_heads/box_heads.py)
├ FastRCNNOutputLayers (roi_heads/fast_rcnn.py)
└ FastRCNNOutputs (roi_heads/fast_rcnn.py)

Each block has a main class and sub classes.
Now please see the blue labels on the Fig. 3. You can see which class corresponds to which part of the pipeline. I’m going to show details of each class in the next part.

[Update: Jul. 7, 2020]

Here I add the architecture figure without class names.

Figure 4. Detailed architecture of Base-RCNN-FPN (without class names).

To Be Continued…

That’s it for this part. Thank you for reading and please wait for the next part!

part 1 (you are here): Introduction — Basic Network Architecture and Repo Structure
part 2 (next story!) : Feature Pyramid Network
part 3: Data Loader and Ground Truth
part 4: Region Proposal Network
part 5: ROI (Box) Head

Check this out too!

I also published the slides named ‘Digging into Sample Assignment Methods for Object Detection’, where I focus on how the detectors (Faster-RCNN, RetinaNet, YOLOv1–5, etc) define training samples for the given feature map and ground truth boxes.

https://speakerdeck.com/hirotohonda/digging-into-sample-assignment-methods-for-object-detection

[1] This is a personal article and the opinions expressed here are my own and not those of my employer.
[2] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo and Ross Girshick, Detectron2. https://github.com/facebookresearch/detectron2, 2019.
[3] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
[4] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask R-CNN. In ICCV, 2017.
[5] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
[6] as of Jan. 5, 2020. The file, directory, and class names are cited from the repository² ( Copyright 2019, Facebook, Inc. )