Creating a Labelled Dataset using a Pretrained Model

COCO, RetinaNet, IceVision and Roboflow

10 min readSep 30, 2021

The bottleneck in the advancement of artificial intelligence (AI) is no longer the absence of technology, nor the absence of human knowledge. It is also not because of the absence of data, for we are swimming on a sea of data. The bottle neck is the absence of labelled data that would serve as a reference for which the models can learn.

Scientists are trying to address this by various ways. Manual labelling of data for supervised learning, is dependable if done by domain experts, but tedious and potentially expensive. Self-supervised learning techniques utilize less manual supervision from humans but need a big computing power and a huge amount of data.

Self-training offers a middle ground between totally-supervised and unsupervised learning. It still requires a reasonably ‘big’ number of labelled reference data to generate a teacher-model.

This blog will describe how to utilize a pre-trained model to generate labelled data that can later be used to feed models. We will use open-source libraries and tools including Colab, PyTorch/ Fast.ai/ IceVision and Roboflow.

So, if you want to advance your knowledge beyond pets and cars, open your Notebook and code along :)

Outline:

A. Rationale and Objective for the project

B. Installation and imports

C. Gathering images

D. Establishing paths and directories

E. Inference

F. Generate annotation file

G. Refine annotations

H. Parse refined Data

A. Rationale and Objective for the Project

As a bigger project, the aim is to create a model that can detect different surgical instruments. However, there are currently no datasets that can be used to train the model.

Therefore, as a subproject, a dataset containing at least 1K labelled images will need to be created. Some surgical instruments fall into two categories that were included in the COCO dataset: scissors and knife. Therefore, a model that has been pretrained on COCO can facilitate detection and labelling of these two types of instruments.

B. Installation and imports

!wget https://raw.githubusercontent.com/airctic/icevision/master/install_colab.sh
!bash install_colab.sh

Let the above installation finish before proceeding.

# instead of restarting the kernel suggested by the output, place:
exit()     import icevision
from icevision.all import *

C. Gathering images

There are various ways of generating a dataset of images, for an example refer here.

For this exercise, we will not need a lot of data because we will not be training a model: 5–20 images should be enough to show proof of concept.

For our purposes, 15 images of surgical scissors and scalpels were uploaded in Github.

!git clone https://github.com/yrodriguezmd/pilot15_for_pseudolabel.git

Feel free to use your own or another dataset: clone it, or upload it to Colab (refer to Section C here for uploading). I suggest to use images in JPEG form (some users of annotation tools have claimed difficulties when using the PNG form).

D. Establishing paths and directories

!ls           # output: pilot15_for_pseudolabel  (among other files)
!ls pilot15_for_pseudolabel/    # output: individual image filenamesimage_path = Path('pilot15_for_pseudolabel/')
img_files = get_image_files(image_path)img = PIL.Image.open(img_files[0])

These series of codes give the directions on where to find the images, to fetch them, and to open each image.

img.to_thumb(150,150)

Since we are planning to use a model pre-trained on the COCO dataset, we will therefore use the COCO object classes.

CLASSES = ('person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant',
'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog','horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe','backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat','baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot','hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink','refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush')class_map = ClassMap(CLASSES)
len(class_map)                  # output: 81

The trimmed COCO dataset has 80 classes, and the ClassMap will account for ‘background’ as class = 0.

print(class_map.get_by_name('knife'))         # output: 44
print(class_map.get_by_name('scissors'))      # output: 77

The pilot15 dataset contains scissors and knives, so we will be expecting class labels 44 and 77 to be predicted.
Note: The initial COCO dataset contained more than 80 classes. Verify that the classes correspond to the expected label. If the original set is to be used, ‘44’ will correspond to ‘bottle’ instead of ‘knife’.

E. Inference

E.1. Pretrained model

Landmark datasets and models have published checkpoints that contain the model configuration and weights. A few of these have been compiled in the IceVision framework.

from icevision.models.checkpoint import *

The RetinaNet model developed by MMDetection uses a combination of ResNet and the Feature Pyramid Network. It addresses the imbalanced caused by backgrounds, as well as improves detection in different scales.

RetinaNet- Resnet50- FPN trained on the COCO dataset has an Average Precision of 39.5.

model_type = models.mmdet.retinanet
backbone = model_type.backbones.resnet50_fpn_1x
model = model_type.model(backbone=backbone(pretrained=True),
                         num_classes=len(class_map))backbone.__dict__

The pretrained model contains the model architecture as well as the weights obtained after training for object detection with COCO.

E.2. Generate predictions

We expect the above model to be reasonably able to classify and localize scissors and knives. We will take advantage of this to generate annotations for the scissors and knives in the pilot15 dataset.

model.eval()

For generating predictions, the PyTorch model.eval() is called. This turns off the batch normalization and drop out steps which are normally utilized during training and not during inference.

imgs_array = [PIL.Image.open(file) for file in img_files]imgs_array = [image.convert('RGB') for image in imgs_array]

Some of the images are in grayscale (single channel for luminosity). Converting these to RGB (3 channels) will orient the tensors to the expected shape.

img_size = 384valid_tfms = tfms.A.Adapter([*tfms.A.resize_and_pad(img_size), tfms.A.Normalize()])infer_ds = Dataset.from_images(imgs_array, valid_tfms, class_map = class_map)samples = [infer_ds[0] for _ in range(3)]
show_samples(samples, 
             denormalize_fn = denormalize_imagenet, ncols=3)

The transformations for inference are limited to resizing, padding and normalization. Thus, we do not expect to see significantly transformed images (unlike the transformations developed for training).

infer_dl = model_type.infer_dl(infer_ds, batch_size=4,   shuffle=False)preds_saved = model_type.predict_from_dl(model, infer_dl,
                                          keep_images=True,)

predict_from_dl uses the model to generate classification and bounding box (bbox) predictions for the objects in the images.

show_preds(preds_saved, font_size=30, label_color='#ffff00')

The predicted bbox and object class is shown, along with the probability score for the class.
The predict function within the predict_from_dl has a default detection_threshold of 0.5 for Non Maximum Suppression: There will be areas in an image that will have more than one prediction. The prediction with the highest score is chosen as the base prediction. Then, the IOU is determined between this base prediction with each of the other predictions. If the IOU is > 50% (i.e. detection_threshold = 0.5), the two boxes are overlapping significantly, and are considered to be referring to the same object. Thus, the prediction with the lesser probability score is eliminated. Conversely, if the overlap is <50%, the predictions are likely referring to different objets.

F. Generate annotation file

show_sample(preds_saved[3])
preds_saved[3].pred

Looking closer at a sample data, the annotations generated by the model provides the Record ID, image size, label (77 = ‘scissors’), and bboxes in the [xmin, ymin, xmax, ymax] format.

F.1. Add filepath

for pred in preds_saved:
  pred.add_component(FilepathRecordComponent())for _ in range(len(preds_saved)): 
  preds_saved[_].set_filepath(img_files[_])

A filepath is necessary to include in the annotations so that the information can be associated to the corresponding image. After adding the component and filepath, a sample BaseRecord will now look like this:

F.2. COCO format

For consistency we will adapt the COCO format, notably for the bboxes (i.e. format of [xmin, ymin, width, height] instead of [xmin, ymin, xmax, ymax]).

conv = convert_preds_to_coco_style(preds_saved)

The conversion will output a nested dictionary with two major keys: ‘images’ and ‘annotations’.

F.2.a. ‘images’

The ‘images’ dictionary contains the filenames which generated from the filepath).
It also contains each images’ width and height (all 384, which was the size set during the transforms stage (Section E.2).
An ‘id’ is available for image record identification.

F.2.b. ‘annotations’

The ‘bbox’ now corresponds to [xmin, ymin, box width and box height]. Looking at the image and bbox that the sample annotation is referring to, with the image size of 384 x 384, and with xmin and ymin referring to the top left corner, a [xywh: 13, 8, 358, 335] is reasonable.

Reiterating that in PyTorch and computer vision, the x = 0 and y = 0 is located at the top left of the image, instead of the conventional approach in graphing where the [0,0] is at the bottom left.
The ‘area’ corresponds to the product of the bbox width and height.
The ‘category_id’ refers to the class or label (e.g., 77 = scissors).
The ‘id’ refers to the identification number for the annotation.
The ‘image_id’ corresponds to the identification number for the image.
‘iscrowd’ pertains to the number of objects detected in the image, where 0 = a single object, and 1 = more than one object.
The ‘score’ is the probability score for the object’s classification.

F.2.c. completing the COCO format

The keys for the annotation file nested dictionary need to correspond to the accepted format for it to be recognized by other applications, such as annotation tools.

The present annotation already has ‘images’ and ‘annotations’ keys. To complete the format, ‘info’ and ‘categories’ keys are added.

add_info = {
    "info": {
    "description": "Surgery Instruments",
    "url": "http://cocodataset.org",
    "version": "1.0",
    "year": 2021,
    "contributor": "MR",
    "date_created": "2021/09/27"
    },
  "categories": [
    {
      "supercategory": "kitchen",
      "id": 44,
      "name": "knife"
    },
    {
      "supercategory": "indoor",
      "id": 77,
      "name": "scissors"   } ] }

The ‘info’ key can contain information that is relevant to the author.
The ‘categories’ key contains the COCO object name, its assigned id number and the supercategory. You have an option of including all the 80 categories, or to adjust it based on your objectives. The objective of this mini-project was to detect scissors and knives, and the ‘categories’ was limited to these two to avoid crowding the notebook with unnecessary information.

annot = {**add_info, **conv}

The merged nested dictionary annot should now contain the major keys ‘info’, ‘categories’, ‘images’ and ‘annotations’. The information in this dictionary is what will be used for communicating with the annotation tool.

F.3. JSON file

f = open('annot.json','w') 
f.write(str(annot)) 
f.close()

For consistency with the COCO processing, the annot file is written as JSON.
The code above will generate the .json file and save it as a temporary Colab file.

The annot.json is downloaded locally. Using either a text editor or VS Code, replace the ‘ with “, otherwise the file will not be considered a ‘true json file’.

G. Refine annotations

G.1. Upload data to annotation tool

There are various annotation tools that can be used. Roboflow was chosen because of the user-friendly interface, easy uploading and exporting of data. For our purposes, there are no charges for its use.

Open a new workspace in Roboflow and upload the revised annot.json file and the pilot15 images (you have these now on your Colab folder and can easily access it).

G.2. Check and adjust annotations

Once uploaded, most of the predicted bboxes will be shown superimposed to their corresponding image. The bboxes can be easily adjusted and the classes renamed as necessary. Annotations can also be added if warranted.

Once the annotation refinement is done, click on the button ‘Generate New Version’. There are options for splitting the dataset as well as performing further transforms. For this set, all images were assigned to the ‘train’ set and no transforms were done.

G.3. Export refined data

Once the new version is generated, click on the ‘Export’ button. For this case, we opt to export via downloading the zip file to our local computer. The zip file is automatically unzipped locally. The folder is then uploaded to Colab, in this case under the folder name ‘pilot15_roboflow’.

!ls pilot15_roboflow  # output: _annotations.coco.json, as well 
                               as the images exported from roboflow

H. Parse refined Data

By parsing the refined data, we can check whether the annotation remains consistent and can be used for training in a later step.

parser = parsers.COCOBBoxParser(annotations_filepath =  Path('/content/pilot15_roboflow/_annotations.coco.json'), img_dir = Path('/content/pilot15_roboflow/', mask =False) )

We will use the images which were exported from roboflow along with the annotations. This will avoid ‘shifting’ of the bboxes that may be seen when the new annotations are superimposed to the original images.

data_splitter = RandomSplitter([1.0,0])  
train_records, valid_records = parser.parse(data_splitter)

The parser splits the data by default to a 0.8 / 0.2 ratio. This split is necessary when training. However, for this case we want the whole data in a single set.

show_records(train_records[3:],ncols=3, font_size=30, label_color = '#ffff00')

Looking at the parsed data, we are able to confirm that the annotations are intact and correct.

Summary:

With the aim of creating a labelled dataset that can be used for training, we utilized a model pretrained in COCO to generate an initial set of annotations. These annotations were then refined in Roboflow. The parsed refined data show excellent annotation of the images.

Moving Forward:

Increasing the dataset size
Including other surgical object classes
Merging the labelled dataset with an unlabelled dataset for Self-Training

I hope you enjoyed playing! :)

Maria

LinkedIn: https://www.linkedin.com/in/rodriguez-maria/

Github: https://github.com/yrodriguezmd?tab=repositories

Twitter: https://twitter.com/Maria_Rod_Data

Creating a Labelled Dataset using a Pretrained Model

COCO, RetinaNet, IceVision and Roboflow

Written by Maria L Rodriguez