Digging into Detectron 2 — part 3

Data Loader and Ground Truth

4 min readFeb 7, 2020

Figure 1. Inference result of Faster (Base) R-CNN with Feature Pyramid Network.

Hi I’m Hiroto Honda, a computer vision researcher¹. [homepage] [linkedin]

In this article series I would like to share my learnings about Detectron 2 — repo structure, building and training a network, handling a data set and so on.

Detectron 2 ² is a next-generation open-source object detection system from Facebook AI Research.

facebookresearch/detectron2

Detectron2 is Facebook AI Research’s next generation software system that implements state-of-the-art object detection…

github.com

In part 2, I have shown the details of feature pyramid network (FPN).
Before proceeding to the Region Proposal Network (RPN), we should understand the data structure of ground truth. In this part I’m going to share how to load ground truth from a dataset and how the loaded data are processed before being fed to the network.

Where in the network are ground truth data used ?

To train a detection model, you need to prepare images and annotations.
As for the Base-RCNN-FPN (Faster R-CNN), the ground truth data are used in the Region Proposal Network (RPN) and the Box Head (see Fig. 2).

Figure 2. Ground truth box annotations are used in the Region Proposal Network and Box Head to calculate losses.

Annotation data for object detection consist of:

Box label : location and size of the object (e.g. [x, y, w, h])
Category label: object class id (e.g. 12: “parking meter”)

Note that the RPN does not learn to classify object categories, so category labels are used only at the ROI Heads.

The ground truth data are loaded from the annotation file of the specified dataset. Let’s look at the process of data loading.

Data loader

The data loader of Detectron 2 is multi-level nested. It is built by the builder before starting training³.

dataset_dicts (list) is a list of annotation data registered from the dataset.
DatasetFromList (data.Dataset) takes a dataset_dicts and wrap it as a torch dataset.
MapDataset (data.Dataset) calls DatasetMapper class to map each element of DatasetFromList. It loads images, transforms images and annotations, and converts annotations to an ‘Instances’ object.

Loading annotation data

Let’s assume we have a dataset called ‘mydataset’ with the following image and annotations⁴.

FIgure 4. Example of an image and annotations

To load data from a dataset, it must be registered to DatasetCatalog. For instance, to register mydataset,

from detectron2.data import DatasetCatalog
from mydataset import load_mydataset_jsondef register_mydataset_instances(name, json_file):
    DatasetCatalog.register(name, lambda: load_mydataset_json(json_file, name))

and call register_mydataset_instances function specifying your json file path.

The load_mydataset_json function must include a json loader so that the following list of dict records is returned:

[
{
'file_name': 'imagedata_1.jpg',   # image file name
'height': 640,                     # image height 
'width': 640,                      # image width
'image_id': 12,                    # image id
'annotations': [                   # list of annotations
{'iscrowd': 0,                           # crowd flag
'bbox': [180.58, 162.66, 24.20, 18.29],  # bounding box label
'category_id': 9,                        # category label
'bbox_mode': <BoxMode.XYWH_ABS: 1>}      # box coordinate mode
,...
]
},
,...
]

For COCO dataset (Detectron 2’s default), load_coco_json function plays the role.

Mapping data

During training, registered annotation records are picked one by one. We need actual image data (not path) and corresponding annotations. The dataset mapper (DatasetMapper) deals with the records to add an ‘image’ and ‘Instances’ to the dataset_dict. ‘Instances’ are the ground truth structure object of Detectron 2.

Load and transform images
An image specified by ‘file name’ is loaded by read_image function. Loaded image is transformed by pre-defined transformers (such as left-right flip) and finally the image tensor whose shape is (channel, height, width) is registered.
Transform annotations
The ‘annotations’ of dataset_dict are transformed by the transformations performed on the images. For example, if the image has been flipped, the box coordinates are changed to the flipped location.
Convert annotations to Instances
The annotations are converted to Instances by this function called in the dataset mapper. ‘bbox’ annotations are registered to Boxes structure object which can store a list of bounding boxes. ‘category_id’ annotations are simply convereted to a torch tensor.

After mapping, the dataset_dict looks like:

{'file_name': 'imagedata_1.jpg','height': 640, 'width': 640, 'image_id': 0,'image': tensor([[[255., 255., 255.,  ...,  29.,  34.,  36.],...[169., 163., 162.,  ...,  44.,  44.,  45.]]]),'instances': {'gt_boxes': Boxes(tensor([[100.55, 180.24, 114.63, 103.01],[180.58, 162.66, 204.78, 180.95]])),'gt_classes': tensor([9, 9]),}

Now we have images and ground-truth annotations which Detectron 2 models can learn.

To Be Continued…

In the next part we will see how region proposal network learns object locations. Thank you for reading and please wait for the next part!

part 1: Introduction — Basic Network Architecture and Repo Structure
part 2 : Feature Pyramid Network
part 3 (you are here) : Data Loader and Ground Truth Instances
part 4 (next story!): Region Proposal Network
part 5: ROI (Box) Head

[1] This is a personal article and the opinions expressed here are my own and not those of my employer.
[2] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo and Ross Girshick, Detectron2. https://github.com/facebookresearch/detectron2, 2019. The file, directory, and class names are cited from the repository ( Copyright 2019, Facebook, Inc. )
[3] In some cases, AspectRatioGroupedDataset is used additionally to group the data into landscape and portrait image groups judging from image sizes.
[4] It’s one of my vacation photos and not from a specific dataset :)