Building Accurate Object Detection Models with RetinaNet: A Comprehensive Step-by-Step Guide

Published in

Artificial Intelligence in Plain English

12 min readFeb 20, 2023

This article aims to provide a comprehensive guide on how to train a state-of-the-art object detection model called RetinaNet. Object detection is a fundamental computer vision task that involves identifying and localizing objects in an image or a video. RetinaNet is a popular object detection model that has shown impressive results on various benchmark datasets, thanks to its unique architecture that balances the trade-off between localization and classification accuracy.

If you’re an AI enthusiast looking to learn how to build a RetinaNet model from scratch, you’re in the right place! This article takes the reader through the entire process of building a RetinaNet model, starting from preparing the training data to training and testing a RetinaNet model. By the end, you’ll have a solid understanding of the entire process and be ready to create your own object detection model. The article will cover the following topics in detail:

An introduction to RetinaNet and its architecture
Data preparation and annotation
Model training and evaluation

By the end of this article, the reader will have a clear understanding of how to build an end-to-end object detection pipeline with RetinaNet and will be able to apply this knowledge to solve real-world computer vision problems.

Introduction

RetinaNet is a state-of-the-art object detection model that was introduced in 2017 by Facebook AI Research. It is a single, unified network architecture that can be used for both object detection and object classification, making it very efficient and accurate. RetinaNet uses a feature pyramid network (FPN) to extract features at different scales and a novel focal loss function that is designed to give more weight to complex examples, resulting in better performance on small and hard-to-see objects. RetinaNet has achieved impressive results on several benchmark datasets and is widely used for various computer vision tasks.

Dataset and Annotation Tool

To train an object detection model, the first step is to gather a dataset of images and annotate them with labels that identify the objects of interest. This annotation process can be accomplished with the help of an annotation tool such as LabelImg, which allows you to draw bounding boxes around the objects in each image and label them with a corresponding class name.

LabelImg is an open-source graphical image annotation tool used to label object bounding boxes in images. It provides an easy-to-use interface for annotating images with object detection labels. LabelImg supports various formats such as Pascal VOC, YOLO, and Tensorflow. The tool is written in Python and Qt, and is available on multiple platforms, including Windows, Linux, and macOS. It is widely used by researchers and practitioners for creating datasets for training object detection models. Once you have annotated your dataset, you can use it to train your object detection model using a framework such as RetinaNet.

LabelImg can be easily installed with just one command! All you need to do is open up your terminal and type pip3 install labelImg. With this simple command, you'll have the annotation tool up and running in no time, ready to help you annotate your dataset and prepare it for use in training your deep learning models. Once the installation is complete, you can use the labelImg command in your terminal to start the tool and begin annotating your images. If you encounter any difficulties during the installation process, don’t worry! Detailed installation steps for each operating system are available on the official GitHub page for LabelImg. Simply head over to https://github.com/heartexlabs/labelImg and you’ll find everything you need to get up and running with this powerful annotation tool.

Annotating images using LabelImg is a breeze, and the tool even generates .xml files automatically as soon as you’re done annotating an image. This saves you time and effort and ensures that your annotations are consistently formatted and ready to be used in your deep-learning models. With its intuitive interface and user-friendly features, LabelImg is the perfect tool for annotating your dataset quickly and accurately.

Model Training

To keep your dataset organized, it’s recommended to create a parent folder for your RetinaNet project. Inside this folder, you should create two child folders named “JPEGImages” and “Annotations”. The “JPEGImages” folder should contain all the original image files in the .jpg format, while the “Annotations” folder should contain all the .xml files generated using LabelImg. This will help you keep track of your files and ensure a smooth training process. Here’s an example of what your file structure could look like:

By organizing your files in this way, you’ll be able to easily load your dataset into the RetinaNet model during training. If you’re using Google Colab for training your RetinaNet model, the next step is to upload the parent folder that you created into your Google Drive account. Once the folder is uploaded, you can access it directly from your Colab notebook. In the following steps, I will be explaining how to train your RetinaNet model using Google Colab.

To connect your Google Drive with Colab, you can use the following code snippet:

from google.colab import drive
drive.mount('/content/drive')

This will prompt you to authenticate your account and provide an authorization code. Once you have provided the code, your Google Drive will be mounted in Colab, and you can access your files and folders by navigating to the /content/drive directory. To start working with RetinaNet in Colab, the first step is to clone the keras-retinanet repository. You can easily do this by running the following command in a code cell:

!git clone https://github.com/fizyr/keras-retinanet.git

This will clone the entire repository into your current working directory, allowing you to access all the necessary files for training and testing RetinaNet. Once this is done, import os check the current working directory using the os.getcwd() command. Switch to the keras-retinanet directory using the %cd keras-retinanet/ command. Once you are in this directory, you need to perform a few installations. The installation commands are given below:

!pip install .
!python setup.py build_ext --inplace

Now that the installation is complete, it’s time to import the necessary libraries. Some of the key libraries that you’ll need to include are given below:

import numpy as np
import pandas as pd
import os, sys, random
import xml.etree.ElementTree as ET
from keras_retinanet.utils.visualization import draw_box, draw_caption , label_color
from keras_retinanet.utils.image import preprocess_image, resize_image
import shutil
from os.path import isfile, join
import matplotlib.pyplot as plt
from PIL import Image
import requests
import urllib
from os import listdir

To make things easier, you can store the paths of the JPEG images and corresponding XML files in variables. This way, you can easily access these files during training. Here’s an example of how you can do this:

jpgPath="/content/drive/MyDrive/RetinaNet/JPEGImages/"
annPath="/content/drive/MyDrive/RetinaNet/Annotations/"

To train the RetinaNet model using the Keras-RetinaNet repository, you need to create a dataframe with attributes that the repository supports. The following code snippet can be used to create the dataframe:

data=pd.DataFrame(columns=['fileName','xmin','ymin','xmax','ymax','class'])

To start the annotation process, you can read in all the annotation files and extract data from these files into the previously created dataframe. This can be achieved by using the os library to iterate through the annotations directory and read in each file using the ElementTree library. Once the data has been extracted, it can be stored in the dataframe created earlier. This process is critical in preparing the dataset for training your RetinaNet model, and must be done accurately. It can be done in the following way:

all_files = [f for f in listdir(annPath) if isfile(join(annPath, f))]

for file in all_files:
    if file.split(".")[-1] != 'xml':
        continue
    
    filename = jpgPath + file.replace(".xml", ".jpg")
    tree = ET.parse(join(annPath, file))
    root = tree.getroot()
    
    for obj in root.iter('object'):
        class_name = obj.find('name').text
        xml_box = obj.find('bndbox')
        xmin = int(xml_box.find('xmin').text)
        ymin = int(xml_box.find('ymin').text)
        xmax = int(xml_box.find('xmax').text)
        ymax = int(xml_box.find('ymax').text)
        
        data = data.append({'filename': filename, 'xmin': xmin, 'ymin': ymin, 'xmax': xmax, 'ymax': ymax, 'class': class_name}, ignore_index=True)

The above code snippet loops through all the XML annotation files in the annotation folder, extracts the class labels, bounding box coordinates and the corresponding file name, and then appends them to the dataframe. The ‘if’ condition checks if the file extension is ‘xml’. The file name of the corresponding image is obtained by replacing the ‘.xml’ extension with ‘.jpg’. The bounding box coordinates and class labels are then extracted using ElementTree (ET) parsing. Finally, the extracted data is appended to the dataframe. This allows for an easy and efficient way of gathering all the necessary information for training the RetinaNet model, as the dataframe can easily be converted to a CSV file, which is used for training the model.

After the dataframe is created, it can be viewed using the data.head() command. In order to use this data for training the RetinaNet model, you need to convert it into a CSV file. This can be done by using the following command to remove the indices and headers:

data.to_csv('../TrainData.csv',header=False,index=False)

To ensure that all model snapshots are saved in a specific location after each epoch, you can create a “snapshots” folder using the following steps. First, check if the folder already exists or not. If it doesn’t exist, create it using the os.mkdir() command.

if not os.path.exists('snapshots'):
  os.mkdir('snapshots')

To get a list of unique class names from the data dataframe, you can use the following code: classes=data['class'].unique(). This will give you an array of all the unique class names present in the 'class' column of the data dataframe. To create a new file and write the class names and indices, you can use the built-in Python function open() with the file mode set to 'w' (write mode). This will create a new file if it doesn't exist or overwrite the existing file if it does.

with open('../Classes.csv', 'w') as file:
  for i, class_name in enumerate(classes):
    file.write(f'{class_name},{i}\n')

In the above code, the open() function is used to create a new file called '../handDetectorClasses.csv' with the file mode set to 'w'. The file is opened in write mode which allows data to be written to it. The 'w' mode overwrites the file if it already exists.

The for loop iterates over each unique class name in the classes list and writes the class name and its corresponding index to the file separated by a comma. The \n character at the end of each line is used to indicate a new line in the file. This file will be used to map class names to class indices for the RetinaNet model during the training process.

It is recommended to start with a pre-trained model rather than training a model from scratch. In our case, we will use the ResNet50 model, which is already pre-trained on the Coco dataset. This can be done using the code snippet given below:

url = 'https://github.com/fizyr/keras-retinanet/releases/download/0.5.1/resnet50_coco_best_v2.1.0.h5'
model = '/content/keras-retinanet/snapshots/resnet50_csv_v1.h5'
urllib.request.urlretrieve(url, model)

The above code utilizes the urllib.request.urlretrieve() function to download a ResNet model specified by a URL and save it to a desired path. The URL of the model and the path to save the model are stored in the variables url and modelrespectively.

The urllib.request.urlretrieve() function takes two arguments, the URL of the file to be downloaded and the local file path where the file should be saved. When called, the function downloads the file and saves it to the specified file path. In this case, the ResNet model will be downloaded from the specified URL and saved to the path specified by the model variable.

To train the keras-retinanet model in Colab, follow the steps below:

!keras_retinanet/bin/train.py 
--freeze-backbone 
--random-transform 
--weights '/content/keras-retinanet/snapshots/resnet50_csv_v1.h5' 
--batch-size 8 
--steps 500 
--epochs 15 csv 'path/to/TrainData.csv' 'path/to/classes.csv'

The above command is used to train a RetinaNet object detection model on the specified training data using a pre-trained ResNet50 backbone. The arguments can be explained as follows:

--freeze-backbone: This argument freezes the ResNet50 backbone of the RetinaNet model, which is already trained on the COCO dataset.

--random-transform: This argument applies random transformations on the images during training to augment the data and reduce overfitting.

--weights: This specifies the path to the pre-trained ResNet50 model we downloaded earlier, which will be used as the backbone of the RetinaNet model.

--batch-size: This specifies the batch size used for training.

--steps: This specifies the number of steps (batches) per epoch.

--epochs: This specifies the number of epochs to train the model.

csv 'path/to/TrainData.csv' 'path/to/classes.csv': This specifies the paths to the CSV files containing the training data and class information.

To load the trained model, we can use the glob module to find the path of the saved model in the snapshots folder. Here's an example:

from glob import glob
model_path = glob.glob('/content/keras-retinanet/snapshots/*.h5')[-1]
from keras_retinanet import models
model = models.load_model(model_path, backbone_name='resnet50')

In the above code, we use the glob function to find all files in the snapshots folder with the extension .h5. The [-1] at the end of the line returns the last file found, which is assumed to be the most recent model saved during training. We then load the model using the load_model function from keras_retinanet.models, passing in the path of the saved model and the name of the backbone used during training (in this case, resnet50). The loaded model can then be used to make predictions on new data. Now we can start making predictions!

Let's define a simple function to make predictions:

def show_predictions(filename, threshold=0.5):
    # Construct the path to the image file
    file_path = os.path.join(jpgPath, filename)
    print(f'File path: {file_path}')
    
    # Load the image and its annotations
    image_df = data[data['fileName'] == file_path]
    image = np.array(Image.open(file_path))[:, :, :3]  # Remove alpha channel if any
    
    # Draw the ground-truth bounding boxes on the image
    for _, row in image_df.iterrows():
        box = [row['xmin'], row['ymin'], row['xmax'], row['ymax']]
        draw_box(image, box, color=(255, 0, 0))
    
    # Preprocess the image and make predictions with the model
    input_image = preprocess_image(image)
    input_image, scale = resize_image(input_image)
    boxes, scores, labels = model.predict_on_batch(np.expand_dims(input_image, axis=0))
    boxes /= scale
    
    # Draw the predicted bounding boxes on the image, along with their labels and scores
    for box, score, label in zip(boxes[0], scores[0], labels[0]):
        if score < threshold:
            break
        box = box.astype(np.int32)
        color = label_color(label)
        draw_box(image, box, color=color)
        class_name = label_map[label]
        caption = f"{class_name} {score:.3f}"
        draw_caption(image, box, caption)
        score, label = score, label
    
    # Display the final image with the annotations
    plt.figure(figsize=(20, 10))
    plt.imshow(image)
    plt.axis('off')
    plt.show()
    
    return score, label

The above function, show_predictions(), takes an image filename from the dataset and generates object detection predictions for that image. The function first reads the image from the specified file path and extracts the bounding boxes of the objects in the image from the dataset. It then preprocesses the image, resizes it, and passes it through the trained RetinaNet model to generate predictions for the objects in the image. Finally, the function visualizes the image with the predicted bounding boxes and their corresponding class labels and scores. This function can be used to quickly evaluate the performance of the trained model on individual images in the dataset.

To generate predictions for a given image in the dataset, you can use the show_predictions function by passing the name of the image to the function. For instance, to retrieve predictions for an image called image_name.jpg in the jpgPath directory, you can use the following code:

score, label = show_predictions('image_name.jpg', threshold=0.5)

The threshold argument specifies the confidence threshold for the predicted objects. The function first loads the image and gets its shape, then iterates over the rows in the data dataframe to draw bounding boxes for the ground truth objects in the image. It then preprocesses the image and passes it to the trained model for object detection, and draws the predicted bounding boxes for the objects with a score above the specified threshold.

Finally, the function displays the image with the predicted bounding boxes and their corresponding labels. The score and label variables contain the scores and labels for the predicted objects, respectively.

Before we wrap up, it’s important to keep in mind that while object detection can be a powerful tool, it is not without limitations. Like any machine learning model, object detection algorithms have their own set of biases and limitations that can affect their accuracy and effectiveness. For example, if the model is trained on a dataset that is not diverse enough, it may struggle to recognize objects that are not well-represented in the training data. Similarly, object detection can be used in ways that are unfair or unjust, such as for surveillance or discriminatory profiling.

As with any technology, it’s important to approach object detection with a critical eye and to carefully consider how it can be used in a way that is ethical and fair. By being aware of these limitations and biases, we can work to ensure that object detection is used in a way that benefits society as a whole.

While object detection has come a long way, it’s important to remember that we’re not quite in a “Terminator” world just yet. Despite its limitations, object detection technology holds incredible potential to make our lives easier and safer. Whether you’re using it to track down lost pets or to detect potential hazards in industrial settings, there’s no denying that object detection is an exciting and constantly evolving field. So let’s keep training those models, fine-tuning those hyperparameters, and building a future where our machines can truly see the world around us!

More content at PlainEnglish.io.

Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

Build awareness and adoption for your tech startup with Circuit.

Building Accurate Object Detection Models with RetinaNet: A Comprehensive Step-by-Step Guide

Introduction

Dataset and Annotation Tool

Model Training

Written by Prithvi Seshadri