Find Waldo With YOLOv2

Train YOLOv2 to detect very small custom object

natalie
The Startup
10 min readJul 29, 2020

--

Waldo found!

This series is inspired by a Waldo dataset on Kaggle. That dataset has some serious drawbacks though so I went on creating my own dataset to better suit what i want to achieve.

There are currently a number of algorithms used in object detection, YOLO is one of them. You might have heard of others like ResNet, SSD etc. I chose to start with YOLOv2 because it’s pretty fast and easy to implement.

First off, if you need a review of how object detection works using CNN, head over to Coursera’s Deep Learning Specialization series, Convolutional Neural Networks course. Week 3 in that course has all the basics you need. There’s a lesson on YOLO algorithm that’s really useful to learn how it works. The discussion forum also has great notes and clarifications.

Second, if you’re planning to train a custom object detector and you dont have an existing labeled dataset with bounding boxes, keep in mind that you will need to collect images and create bounding boxes yourself. Depending on what your target object is, it can be challenging.

The code repository is here: https://github.com/nataliele/waldo

  1. Prepare Google Colab:

Create your new notebook by going here.

Your notebook will be created in the new folder Colab Notebooks in your GDrive.

Previously you have to run a bunch of commands to mount your GDrive but now you should be able to click a button on the left-hand panel that says Mount Drive and you should have something like this:

Now because Google Colab doesnt give you a terminal, you have to run your commands in the cells. These commands start with ! and some commands like cd starts with %.

2. Get Darknet:

YOLO is a type of technique, algorithm used in object detection. Darknet is an implementation of this technique. I’ll let you read up about it here. The original repo is here and the popular fork for Windows (and now YOLOv4 because the original creator is not continuing the project anymore) is here. Alexey’s fork has a lot of helpful note that you can checkout.

So now that we’re in the folder Colab Notebooks, we’ll clone the darknet repo:

YOLOv3 has a couple of differences and improvements compared to YOLOv2, see more here. The biggest difference with regards to finding Waldo is that YOLOv3 can detect objects at different scales, meaning it is better at detecting small objects compared to YOLOv2. YOLOv2 is a lot faster though, so that’s the trade-off.

3. Compile Darknet:

After cloning and getting all the files, you’ll have to compile the code to create the darknet program. If you’re using a GPU you need to change the Makefile in the darknet folder before compiling.

The original darknet saves the weights every 100 iterations until 1000 iterations then it saves only every 10,000 iterations. Since I dont plan to (and probably dont need to) train to tens of thousands iterations and want access to the weights earlier, I need to change the following line in the examples/detector.c file. So I’m saving every 200 iterations until the 2000th iteration, then save every 1000 iteration.

line 138:

to

Since Google Colab (at the time of writing) doesnt allow editing files, you will have to edit these files locally in your Google Drive. You can also connect your GDrive to an third-party app to edit files in your browser (I use Text Editor).

After yo

There will be a bunch of outputs but if you’ve compiled successfully, you’ll see a new file called darknet in your darknet folder. On google Colab, you’ll have to run this command to give yourself execute permission to run the file.

4. Get pre-trained weights

When we clone the Darknet repo, we get all the config files but not the pre-trained weights of the models. We have model configs for 3 general datasets: VOC, COCO and ImageNet. VOC has 20 classes, COCO has 80 classes and ImageNet has 601 classes.

This main weight file below is the pre-trained weights of the model that was trained on the COCO dataset. Using this weight file, you’ll be able to detect 80 classes of objects.

For training custom detector, the website recommends using the weight file below, which has been pre-trained on ImageNet:

5. Test darknet detector:

You should go ahead and test if darknet is actually running the way it should. I’d recommend using the the full command instead of the shortcut so you’ll always know which config files are being used.

You should see a bunch of CNN layers and class probability output like this:

6. Prepare custom data

There are a lot of variations here and you’ll need to think about the specifics of your data and use your judgment. For example, Waldo, apart from small differences in the versions of books that he’s in, is always drawn to be relatively identical. He’s always facing east, he’s always wearing red and white stripes, his face is the same wherever he is. This is why I actually dont need a lot of training data.

The difficulty with Waldo is finding images with high resolution. Because he occupies such a small space in the overall picture, I need pictures that have enough resolution to still be able to zoom in and identify him. Similarly, if your object is small compared to the whole image, you need to find high-resolution images.

You can create a new folder under darknet/data/ and put your custom images there. For me all the Waldo pictures are put in darknet/data/waldo.

Assuming you’ve got all your images, the next step is to create the bounding boxes. LabelImg is a great tool for simple bounding boxes. You just clone the git repo and run python labelImg.py and the windows will open up for you to create bounding boxes. I did this on my local computer and uploaded to GDrive later. Make sure you choose YOLO format. An example is below.

7. Preprocess images for YOLOv2

As mentioned above, YOLOv2 (or even YOLOv3) is not great at detecting very small objects. My Waldo images are 1536x1024 pixels and the average bounding box for Waldo is about 6x6 pixels. When images are fed into the CNN layers, they are automatically resized, default is 416x416, which would shrink the object to basically nothing. We can increase the input size but this would lead to increase in training time and might not work if the object is very small anyway.

Here my solution is to divide the original picture into smaller tiles and then train the smaller tiles. When testing, the test image will also be divide into smaller tiles, then prediction is made on each tile and then the tiles will be stitched back together to get the original image with the prediction. A discussion of the topic can be found here.

So for this project, I’ve written preprocessing scripts to resize all my images to 1536x1024 then crop them into 24 tiles of 256x256 pixels. I then created bounding boxes for these tiles. I think a better approach would have been creating bounding box on the original tiles, then modify the bounding box based on resizing and cropping. This would allow experimenting with different sizing and cropping ratios.

The bounding box files should be in the same folder and have the same name as your images.

We’d also need a train.txt and a test.txt file to tell darknet which images should be used to training vs testing. Runing the script process.py will help you create this. This file is written by Nils Tijtgat and i’ve modified it for my project.

8. Modify config file

There are 3 config files that you need in the cfg folder:

  • a .names file, which stores the label of your objects
  • a .data file, which stores the path to your training and test data
  • a .cfg file, which stores the configurations for the CNN

You’ll see a different .cfg file for YOLOv2, YOLOv3, YOLOv3-tiny, etc. These are the different configurations when training on v2 or v3 etc. I only have 1 class so I copied and used yolov2-voc.cfg. There are helpful descriptions and explanations of the parameters on this page. If you use yolov2-voc.cfg file from the main Darknet repo, the changes you need to make are:

  • Line 6: set batch=32, this means we will be using 32 images for every training step. That is the loss will be calculated after every 32 images and weights will be updated every 32 images. If you have more memory, you can increase this.
  • Line 7: set subdivisions=4, images in the batch will be divided by the subdivision number so only 32/4=8 images will be loaded to memory at one point. If you have more memory, you can decrease this number.
  • Line 244: set classes=1, the number of categories we want to detect
  • Line 237: set filters=(classes + 5)*5 in our case filters=30

Source: https://timebutt.github.io/static/how-to-train-yolov2-to-detect-custom-objects/

When making predictions, you have to comment out line 6 and 7 and uncomment line 3 and 4 in the .cfgfile : batch=1 and subdivisions=1.

9. Start training

We have everything we need to start training, using the pretrained weights for YOLOv2 and writing the output out to a log file like so:

Training output

You can see after 985 batches, current loss is 1.369544. We’ve trained on 985*32=31520 images and it takes about 4 seconds for each batch. The IOU is around 70%, not bad.

If you just let it train, Google Colab would time out after a couple of hours. You might also be locked out of using GPU for a day or so. For me, when it times out like this, the log file doesnt get saved.

The weights will be saved in the backup folder specified in your .data file. So if your training is interrupted, you can start again from the backup weights like so:

10. Make predictions

The basic command for making prediction is similar to the train command. You’ll use the weight file that you want to make the predictions and put the test image last.

When making predictions, you have to comment out line 6 and 7 and uncomment line 3 and 4 in the .cfgfile : batch=1 and subdivisions=1.

To make a prediction on a full size image, I’ve created a script to resize, create tiles, make predictions then stitch the tiles back.

This is test1.jpg, in which we’re trying to find Waldo. The resolution for pictures on Medium is not the greatest but I did have the full resolution for training.

The actual tile result is like this:

And this is the full image after stitching up all the tiles, pretty cool!

Waldo found!

11. Loss graph

There’s a log parser script that you can use to visualize your training loss.

Loss up to ~1750th iteration

Tips:

  • Open all files mentioned here and check the paths used in them. Choose a folder where you’ll eventually execute darknet and make sure all file paths are relative to that folder.
  • Most of the time I get Tesla K80 on Google Colab so if you want a more powerful GPU, you can sign up for Google Cloud free 1-year, $300-credit trial. The Ai Platform’s Notebooks of Google Cloud is also very similar and you get access to the terminal as well, which is always nice.
  • When starting training, if you see darknet quit in the middle of loading the layers. You might need to increase subdivisions so you dont run out of memory.

--

--