How to Setup an Active Learning Framework for your Object Detection Model

Published in

Ixor

6 min readJan 22, 2020

In a supervised learning context, convolutional neural networks proved to achieve state of the art performance in computer vision applications such as object detection. These algorithms, in general, require a large amount of labeled data, which itself is a very common obstacle since it is time-consuming and very expensive.

In addition, after putting an object detection model in production it is hard to keep track of its performance. In order to keep the performance up, the model should be retrained and updated with the data that it has seen during production. However storing and labelling all the data would generate too much overhead. A common practice is randomly selecting samples to label and retrain the model. This does not solve the real problem since you would be missing on the valuable data that could make a difference in the performance of the model.

Active Learning: Cut Labeling Costs

It is a special case in machine learning used when labeling a dataset is time-consuming and expensive, in addition to having limited processing power and storage. By using active learning we can achieve high performance while saving time on labeling and selecting only valuable data.

There are situations in which unlabeled data is abundant but manually labeling is expensive. In such a scenario, learning algorithms can actively query the user/teacher for labels. This type of iterative supervised learning is called active learning. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning. With this approach, there is a risk that the algorithm is overwhelmed by uninformative examples. [1]

Active Learning Proposed Framework for Object Detection

Active Learning in your Object Detection Project

Certainly object detection is a situation where active learning is useful, but when exactly should it be used?

Before production:

When the object detection model is still in the training-testing phase, the data is already collected.

1- Split the dataset: The dataset should be split into training and validation sets. It is very important to keep into account that you should have a sufficiently large validation set and should not be changed. This ensures that you can keep track of the model’s performance metrics after each training process since you are constantly updating the training set.

2- Split the training set: The training set should be split into multiple subsets, which will then be used iteratively to retrain the model.

3- Label a subset: The first training subset should be labeled manually by drawing the bounding boxes on the objects needed to be detected in the future.

4- Train and evaluate the model: The model should be trained on the training subset. After that evaluate the performance of the model on the validation set and store its performance metrics.

5- Generate the labels: If you noticed, no test set was used. This is because the subsets of the training set are going to take the role of this set. The model is tested on the following training subset and the labels are saved in xml files. This enables us to edit the labels.
A function to generate and save the xml files is called after the model generates bounding boxes and categories on each image in the training subset:

def xml_writer(xml_directory, full_path_to_image, image_width, image_height, bounding_boxes, categories):    file_xml = full_path_to_image.split(".", 1)[0].split("/")[-1]
    file_xml = file_xml + '.xml'
    file_xml = os.path.join(xml_directory, file_xml)

    file_name = full_path_to_image.split('/')[-1]
    folder_name = full_path_to_image.split('/')[-2]

    annotation = ET.Element('annotation')
    folder = ET.SubElement(annotation, 'folder')
    filename = ET.SubElement(annotation, 'filename')
    path = ET.SubElement(annotation, 'path')
    source = ET.SubElement(annotation, 'source')
    ############################################
    database = ET.SubElement(source, 'database')
    ############################################
    size = ET.SubElement(annotation, 'size')
    ############################################
    width = ET.SubElement(size, 'width')
    height = ET.SubElement(size, 'height')
    depth = ET.SubElement(size, 'depth')
    ############################################
    segmented = ET.SubElement(annotation, 'segmented')

    for i in range(len(bounding_boxes)):        object = ET.SubElement(annotation, 'object')
        name = ET.SubElement(object, 'name')
        pose = ET.SubElement(object, 'pose')
        truncated = ET.SubElement(object, 'truncated')
        difficult = ET.SubElement(object, 'difficult')
        bndbox = ET.SubElement(object, 'bndbox')
        ###################
        xmin = ET.SubElement(bndbox, 'xmin')
        ymin = ET.SubElement(bndbox, 'ymin')
        xmax = ET.SubElement(bndbox, 'xmax')
        ymax = ET.SubElement(bndbox, 'ymax')
        ###################
        category = get_category_name(categories[i])
        name.text = category
        pose.text = 'Unspecified'
        truncated.text = '0'
        difficult.text = '0'
        xmin.text = str(int(bounding_boxes[i, 0]))
        ymin.text = str(int(bounding_boxes[i, 1]))
        xmax.text = str(int(bounding_boxes[i, 2]))
        ymax.text = str(int(bounding_boxes[i, 3]))

    ############################################
    folder.text = folder_name
    filename.text = file_name
    path.text = full_path_to_image
    database.text = 'Unknown'
    width.text = str(image_width)
    height.text = str(image_height)
    depth.text = '3'
    segmented.text = '0'
    # create a new XML file with the results
    mydata = ET.tostring(annotation).decode("utf-8")
    myfile = open(file_xml, "w")

    myfile.write(mydata)

6- Evaluate and correct the labels: Here comes the job of the individual responsible for evaluating the labels generated by the previously trained model. The labels are to be corrected, such as resizing the bounding box, removing false positives and adding false negatives.

Corrected Labels (In Red) After Noticing False Negatives

7- Create the new training set: This topic is still heavily researched. One approach is to combine the old training subsets with the new ones to train the model, this is recommended when you have sufficient computing resources. Another approach is to select randomly a percentage of the past training subsets to decrease the size of the overall training subset, this is recommended when the computing resources are limited.

8- Repeat from step 4: Repeat the process from step 4 until all the subsets are labeled, you successfully trained and achieved the needed performance. Other cases include stoping after the performance does not improve significantly above a certain threshold.

After production:

After putting the model in production, it is very important to keep track of the performance of the model. Since in that phase the model is introduced to new data so you constantly need to update the model. However, a problem we face is in the strategy needed to build the new training subset. As previously mentioned a common approach is randomly sampling from the data that is passed through the model during production mode and this is not enough. A better approach is to let the model choose from the data based on its performance on the samples. Two common strategies:

1- Sampling based on the confidence of the model on detecting any object in the image sample. If the confidence is lower than a certain threshold, the sample is passed through the model which labels it and produces an xml file for its labels. This is saved in the new training subset, evaluated by the person responsible, and then is used to train the model to ensure better performance on similar future appearances.

2- Sampling based on the underperformance of the model on a certain class. In many cases, the original training set could cause underperformance of a certain number of classes. A reason could be due to an imbalance in the classes of the dataset. Another reason is due to the presence of two or more classes that have a lot of similarities, leading to difficulties in the model’s ability to distinguish between the samples of these classes.
The proposed sampling method ensures that in future appearances of any of the previously mentioned cases, the sample is passed through the model which labels it and produces an xml file for its labels. This is saved in the new training subset, evaluated by the person responsible, and then is used to train the model to ensure better performance on similar future appearances.

Conclusion

A simple framework for using active learning in object detection has been proposed to significantly save time and resources on the labeling efforts to train an object detection model. In addition to that, the proposed work proved to be effective in improving the model’s performance after the production stage.

References

[1] https://en.wikipedia.org/wiki/Active_learning_(machine_learning)

[2] https://www.datacamp.com/community/tutorials/active-learning

[3] https://arxiv.org/abs/1908.02454

At IxorThink we are constantly trying to improve our methods to create state-of-the-art solutions. As a software-company, we can provide stable and fully developed solutions. Feel free to contact us for more information.