How to train your own Object Detector with TensorFlow’s Object Detector API

This is a follow-up post on “Building a Real-Time Object Recognition App with Tensorflow and OpenCV” where I focus on training my own classes. Specifically, I trained my own Raccoon detector on a dataset that I collected and labeled by myself. The full dataset is available on my Github repo.

By the way, here is the Raccoon detector in action:

The Raccoon detector.

If you want to know the details, you should continue reading!

Motivation

After my last post, a lot of people asked me to write a guide on how they can use TensorFlow’s new Object Detector API to train an object detector with their own dataset. I found some time to do it. In this post, I will explain all the necessary steps to train your own detector. In particular, I created an object detector that is able to recognize Racoons with relatively good results.

WHAT THE HECK? WHY RACCOOOONS🐼????

Nothing special😄 they are one of my favorite animals and somehow they are also my neighbors! I swear, there are so many potential use cases with the Raccoon detector. For example, now you can detect if a Raccoon is knocking on your door while you’re not at home. The system could send you a push message to your mobile phone so that you know that you have some visitors.

Full video: https://youtu.be/Bl-QY84hojs

Creating the dataset

So let’s get serious! The first thing I needed to do was to create my own dataset:

  • Tensorflow Object Detection API uses the TFRecord file format, so at the end we need to convert our dataset to this file format
  • There are several options to generate the TFRecord files. Either you have a dataset that has a similar structure to the PASCAL VOC dataset or the Oxford Pet dataset, then they have ready-made scripts for this case (see create_pascal_tf_record.py and create_pet_tf_record.py). If you don’t have one of those structures you need to write your own script to generate the TFRecords (they also provide an explanation for this). This is what I did!
  • To prepare the input file for the API you need to consider two things. Firstly, you need an RGB image which is encoded as jpeg or png and secondly you need a list of bounding boxes (xmin, ymin, xmax, ymax) for the image and the class of the object in the bounding box. In terms of me, this was easy as I only had one class.
  • I scraped 200 Raccoon images (mainly jpegs and a few pngs) from Google Images and Pixabay where I made sure that the images have a large variations in scale, pose and lighting. Here is a subset of the Raccoon image dataset that I collected:
Subset of the Raccoon image dataset.
  • Afterwards, I hand-labeled them manually with LabelImg. LabelImg is a graphical image annotation tool that is written in Python und uses Qt for the graphical interface. It supports Python 2 and 3 but I built it from source with Python 2 and Qt4 as I had problems with Python 3 and Qt5. It’s super easy to use and the annotations are saved as XML files in the PASCAL VOC format which means that I could also use the create_pascal_tf_record.py script but I didn’t do this as I wanted to create my own script.
  • Somehow, LabelImg had problems with opening the jpegs on MAC OSX so I had to convert them to pngs and then later back to jpegs. Actually, I could leave them in pngs as well as the API should also support this. I figured out this too late. This is what I will do next time.
  • Finally, after labeling the images I wrote a script that converted the XML files to a csv and then created the TFRecords. I used 160 images for training (train.records) and 40 images for testing (test.records). The script is also available on my repo.

Notes:

  • I found another annotation tool called FIAT (Fast Image Data Annotation Tool) that seems to be good as well. In the future, I might try this out.
  • For image processing on the command line like converting multiple images to different file formats, ImageMagick is a very good tool. In case, you haven’t used, it’s worth trying it out.
  • Usually, creating the dataset is the most painful part. It took me 2hrs to sort out the images and labeling them. And this was just for one class.
  • Make sure that the image size is medium (see Google images to see what medium means). If images are too large, you might run in out-of-memory errors during training in particular when you don’t change the default batch size settings.

Training the model

After I created the required input file for the API, I now can train my model.

For training, you need the following:

  • An object detection training pipeline. They also provide sample config files on the repo. For my training, I used ssd_mobilenet_v1_pets.config as basis. I needed to adjust the num_classes to one and also set the path (PATH_TO_BE_CONFIGURED) for the model checkpoint, the train and test data files as well as the label map. In terms of other configurations like the learning rate, batch size and many more, I used their default settings.

Note: The data_augmentation_option is very interesting if your dataset doesn’t have much of variability like different scale, pose etc.. A full list of options can be found here (see PREPROCESSING_FUNCTION_MAP).

  • The dataset (TFRecord files) and its corresponding label map. Example of how to create label maps can be found here. Here is also my label map which was very simple since I had only one class:
item {
id: 1
name: 'raccoon'
}

Note: It’s very important that your label map should always start from id 1. The index 0 is a placeholder index (see also this discussion for more information on this topic).

  • (Optional) Pre-trained model checkpoint. It is recommended to use a checkpoint as it’s always better to start from from pre-trained models as training from scratch can take days before we get good results. They provide several model checkpoint on their repo. In my case, I used the ssd_mobilenet_v1_coco model as the model speed was more important for me than accuracy.

Now you can start the training:

  • Training can be either done locally or on the cloud (AWS, Google Cloud etc.). If you have GPU (at least more than 2 GB) at home then you can do it locally otherwise I would recommend to go with the cloud. In my case, I went with Google Cloud this time and essentially followed all the steps described in their documentation.
  • For Google Cloud, you need to define a YAML configuration file. A sample file is also provided and I basically just took the default values.
  • It is also recommended during the training to start the evaluation job. You can then monitor the process of the training and evaluation jobs by running Tensorboard on your local machine.
tensorboard — logdir=gs://${YOUR_CLOUD_BUCKET}

Here are the results from my training and evaluation jobs. In total, I ran it over about one hour/22k steps with a batch size of 24 but I already achieved good results in about 40mins.

This is how the total loss evolved:

Total loss decreased pretty fast due to the pre-trained model.

Since I only had one class, it was enough to just look at total mAP (mean average precision):

The mAP hit 0.8 at around 20k steps which is quite good.

And here is an example for the evaluation of one image while training the model:

The detected box around the Raccoon got much better over time.

Exporting the model

  • After finishing with training, I exported the trained model to a single file (Tensorflow graph proto) so that I can use it for inference.
  • In my case, I had to copy the model checkpoints from the Google Cloud bucket to my local machine and then used the provided script to export the model. The model can be found on my repo, just in case if you really want to use it in production;)

Bonus:

I applied the trained model on a video that I found on YouTube.
  • If you’ve watched the video, you’ll see that not every raccoon is detected or there are some misclassifications. This is logical as we only trained the model on a small dataset. To create a more generalized and robust Raccoon detector that is, for example, also able to the detect the most famous raccoon on earth which is Rocket Raccoon from the Guardians of the Galaxy, we just need much more data. That’s just one of the limitations of AI right now!
Most famous raccoon on earth.

Conclusion

I hope you liked this post. Give me a ❤️ if you do. Hopefully, you can now train your own object detector. In this article, I only used one class because I was lazy labeling more data. There are services like CrowdFlower, CrowdAI or Amazon’s Mechanical Turk that offer data labeling services but that would be too much for this article.

I obtained quite decent results for such a short training time but this is due to the fact that the detector was trained on a single class only. For more classes, total mAP won’t be as good as the one that I got and definitely longer training time would be needed to get good results. In fact, I also trained an object detector on the annotated driving dataset (Dataset 1) provided by Udacity. It took me quite a while to train a model that is decently able to recognize cars, trucks and pedestrians. In many other cases, even the model that I used would be too simple to capture all the variability across multiple classes so that more complicated models must be used. There is a also the tradeoff between model speed and model accuracy that one must consider. However, this is a different story and actually could be another independent article.

Follow me here on Medium Dat Tran or on twitter @datitran to stay up-to-date with my work.