Object Detection using a Deep Neural Network

IceVision, Retinanet, Resnet50 and bounding boxes

Maria L Rodriguez
6 min readAug 21, 2021
* image courtesy of Unsplash/ Jackie Zhao

Computer vision has a wide field of application in our modern society. One aspect of computer vision is Object Recognition, which can be divided into three sub-fields of Image Classification, Object Localization and Object Detection. We have discussed Single- and Multi-label Classification. Object Localization involves assigning bounding boxes to relevant objects in an image. Object detection combines the objectives of classification and localization.

This blog will focus on Object Detection. We will identify objects in an image, localize it using a bounding box and classify its class. We will combine codes from resources/ docs from the Airctic/ IceVision framework.

Outline:

A. Set-up

B. Data Loading and fast exploration

C. Data Preparation

C.1. Formatting

C.2. Transformations

D. Modelling

D.1. Model and backbone

D.2. Dataloader

D.3. Metric

D.4. Learning

E. Visualize Results

F. Inference

If you want to get acquainted with State-of-the-Art in Computer Vision technology, open your Notebook and walk the code with me!

A. Set-up

I utilized Colab Pro, on GPU runtime and High RAM settings.

Familiarity with PyTorch / Fast.ai coding would help. For an intro/ refresher on Fast.ai, proceed here.

!wget https://raw.githubusercontent.com/airctic/icevision/master/install_colab.sh!bash install_colab.sh
from icevision.all import *
import icedata

B. Data Loading and fast exploration

path = icedata.pets.load_data()
# path.ls()
Path('/root/.icevision/data/pets/annotations').ls()
  • IceVision utilizes the Oxford-IIIT Pet dataset to facilitate learning.
  • The .ls() will give you an idea of the subfolders available.
Path('/root/.icevision/data/pets/images').ls()

The dataset comprise of 7,393 jpg files including annotations.

C. Data Preparation

C.1. Formatting

class_map = icedata.pets.class_map()

The class_map utility simplifies the integration of each number ID to its corresponding class name. There were a total of 37 classes (36 dog/cat breeds + background).

data_splitter = RandomSplitter([0.8, 0.2])parser = icedata.pets.parser(data_dir=path)train_records, valid_records = parser.parse(data_splitter)
  • the parser formats the data for processing and storage.
show_records(train_records[:6], ncols=3, class_map=class_map, show=True)
  • The formatted data contains dog/ cat images with corresponding bounding boxes and annotation.

C.2. Transformations

image_size = 384train_tfms = tfms.A.Adapter([*tfms.A.aug_tfms(size=image_size, presize=512), tfms.A.Normalize()])valid_tfms = tfms.A.Adapter([*tfms.A.resize_and_pad(image_size), tfms.A.Normalize()])

Augmentation transformations serve to provide slight variations of the image, which improve generalizability, as well as increase the total number of images for training. IceVision uses the Albumentations library for the transformation functions.

The validation images are resized to the size expected by the model.

Normalization enables placing different sets of values in the same scale so that they can be compared. Review the effects of normalization in Step B.2. here.

train_ds = Dataset(train_records, train_tfms)
valid_ds = Dataset(valid_records, valid_tfms)

PyTorch separates the codes for processing data and for modelling. The Dataset contains the codes for manipulation and storage of data. The DataLoader contains the codes for modelling.

D. Modelling

D.1. Model and Backbone

model_type = models.mmdet.retinanetbackbone = model_type.backbones.resnet50_fpn_1x(pretrained=True)model = model_type.model(backbone=backbone(pretrained=True), num_classes=len(parser.class_map)
  • retinanet is one of the configurations developed by the mmdetection team. The retinanet approach eases the imbalance between background and foreground. It also down weighs easy examples, enabling it to focus learning on harder ones.
  • The R-50-FPN is one of many supporting backbones available for use with retinanet. The intermittent assimilation of the identity function in the Resnet design makes the network capable of going deeper, and thus increasing the capacity for learning.

D.2. Dataloader

train_dl = model_type.train_dl(train_ds, batch_size=8, num_workers=4, shuffle=True)valid_dl = model_type.valid_dl(valid_ds, batch_size=8, num_workers=4, shuffle=False)

The PyTorch Dataloader uses the information from the PyTorch Dataset to facilitate modelling processes.

D.3. Metric

metrics = [COCOMetric(metric_type=COCOMetricType.bbox)]

The COCOMetric measures the average precision (AP) of the bounding boxes at IoU=0.50 : 0.05 : 0.95.

  • The model predicts a bounding box (bbox) for an object. It then assesses whether these predicted bboxes correspond well with an actual bbox (ground truth). This is done using the Intersection-over-Union (IoU) method.
  • The IoU is a ratio of the area of overlap between the predicted and ground truth bboxes over the combined area of the bboxes.
* images courtesy of Wikipedia:https://en.wikipedia.org/wiki/Jaccard_index
  • To illustrate, we will use an IoU of 0.5: If the IoU threshold is 0.5, all predictions with IoU ≥ 0.5 will be considered as True Positives (i.e., if the bboxes are overlapping by at least half their combined areas, it is considered a correct prediction). The precision reflects the total number of correct predictions over the total number of positive predictions (precision = (true positive) / (true and false positives).
  • The metric will also compute for the cumulative recall or sensitivity, where recall is a measure of correct predictions over the total number of actual bboxes (recall = (true positives) / (true positives and false negatives).
  • The metric will then plot the precision vs recall for every instance. The area under the precision-recall curve will be considered as the Average Precision (AP).
  • The AP for each category will then be averaged, to yield the mean Average Precision (mAP).
  • The COCOmetric used IoU thresholds from 0.5, progressing every 0.05 until 0.95 (COCO mAP). This technique recognizes detectors that localizes well.
  • In essence, this approach gives a measure of the ability of the model to predict bounding boxes that overlap actual bounding boxes for at least half of the boxes’ area.

D.4. Learning

learn = model_type.fastai.learner(dls=[train_dl, valid_dl], model=model, metrics=metrics)learn.lr_find()
*****

Based on the plot, we will utilize a learning rate of 0.0003 (3e-4).

learn.fine_tune(10, 0.0003, freeze_epochs=1)
import matplotlib.pyplot as plt plt.plot(L(learn.recorder.values).itemgot())
plt.xlabel('epoch')
plt.ylabel('COCOmetric (mAP)')
plt.title('Metric and Losses distribution for Training');
* Legend: mAP (green), train_loss (blue), valid_loss (orange)

After 10 epochs of training at a learning rate of 0.0003, we reached a mean AP of 74%. The validation loss continued to go down. However, the validation loss and mean AP has started to plateau. We will consider this model level of training adequate for the purposes of detecting and labelling dog and cat pets.

E. Visualize Results

model_type.show_results(model, valid_ds, detection_threshold=.5)

The predicted bboxes’ size differ slightly from those of the actual bboxes, however, the results are very reasonable.

F. Inference

We will test the adequacy of the model based on an image that was not included in the dataset.

!pip install bing-image-downloader
from bing_image_downloader import downloader
query_string = 'Persian cat' #
cat_1 = downloader.download(query_string, limit = 1,
output_dir = 'dataset',
adult_filter_off=True,
force_replace=False,
timeout=60,
verbose=True)
cat_1

This will provide a url based on the query string provided.

cat_1_url = ['https://pixfeeds.com/images/cats/1280-633113756-white-persian-cats.jpg'] #
dest = 'Desktop'
download_url(cat_1_url[0],dest)
from PIL import Imageimage = Image.open(dest)
#image.to_thumb(128)
img = np.array(image)infer_tfms = tfms.A.Adapter([*tfms.A.resize_and_pad(size=384), tfms.A.Normalize()])infer = Dataset.from_images([img], infer_tfms, class_map=class_map )#
  • The image is secured from the url and is converted to an array.
  • It then undergoes the same transformations as the validation set.
preds = model_type.predict(model, infer, keep_images=True)show_preds(preds=preds)

We can visualize that the prediction bbox is well placed and the object classified as ‘Persian’ , which was part of our string query — the model was able to detect the object and assign the correct class.

Summary:

Utilizing the IceVision framework, we were able to utilize the retinanet/ resnet50 deep neural network in correctly localizing and identifying a pet instance in an image.

Future play:

Analyzing the performance of the different models on the same dataset.

I hope you enjoyed coding, see you around the virtual world! :)

Maria

LinkedIn: https://www.linkedin.com/in/rodriguez-maria/

Github: https://github.com/yrodriguezmd?tab=repositories

Twitter: https://twitter.com/Maria_Rod_Data

* image courtesy of Unsplash/ Georgi Benev

--

--