Deep Learning : Detect and classify multiple The Simpsons characters in the same picture.
In Part 1, I trained a convolutional neural network to recognize (i.e classify) 20 The Simpsons characters. Giving a picture of a character, the model returns the character on this image. You can still find the open dataset on Kaggle. My accuracy was pretty high (F1 : 96%) but it’s just a simple classifier. So, it can only recognize one character at a time and it doesn’t return the position of this character.
Now, I would like to create a model to detect and classify each character in the picture.
The model would be much more complicated than the previous one and would be able to draw bounding boxes around each character.
At the beginning, I was thinking of a sliding window which classify multiple windows in a picture. Afterward, to detect a character, we group overlapping boxes with the same character. This algorithm would predict a lot of sub-pictures for each picture so it’s very slow.
So, I will use a faster and state-of-the-art model which is very interesting : Faster R-CNN.
Again, I will be using Keras with TensorFlow backend.
You can find the code on the related GitHub repo.
The Faster R-CNN network
Object detection networks depend on region proposal algorithms to hypothesize object locations
This model is based on Region Proposal Network : fully convolutional network that simultaneously predicts object bounds and objectness scores at each position.
Faster R-CNN is an updated version of R-CNN (and Fast R-CNN). The structure is similar to Fast R-CNN, but proposal part is replaced by a ConvNet.
This is the feed-forward pass stucture :
- Convolutional network to get feature map from the last convolutional layer.
- Region Proposal Network (RPN) with a convnet to propose Region of Interest (RoI) and project them on the feature map
- Each proposed region is passed into a RoI pooling layer.
- Fully connected layer to classify region
This model has been implemented in Keras by Yann Henon.
Completing the Dataset
The dataset used for the first part wasn’t complete enough. Indeed, we need the bounding boxes for each character of the training set.
So, using the already label data -you can see explanations of the dataset in the first part or on Kaggle, I just label bounding boxes for each picture with matplotlib and mouse clicks.
So, in addition to pictures, I have a text file with bounding boxes coordinates and classes :
# pic, x1, y1, x2, y2, class
For each character, we have to point the upper left and lower right corner of a bounding box around the character.
Pictures passed into the model can have different ratio/sizes but must be preprocessed
We resize the image so that smallest side length is 300 pixels but while keeping the same picture ratio. We also normalize pictures by subtracting the dataset’s mean for each channel in order to center the data (we want each feature to have a similar range so that our gradients don’t go out of control.
x_img = x_img[:,:, (2, 1, 0)] # BGR -> RGB
x_img = x_img.astype(np.float32)
x_img[:, :, 0] -= C.img_channel_mean
x_img[:, :, 1] -= C.img_channel_mean
x_img[:, :, 2] -= C.img_channel_mean
Defining the network
The base network is ResNet. Then, we build the RPN on the base layers. In addition, we have the classifier also built on the base layers.
# base layers
shared_layers = nn.nn_base(img_input, trainable=True)# define the RPN, built on the base layers
rpn = nn.rpn(shared_layers, num_anchors)# define the classifer, built on the base layers
classifier = nn.classifier(shared_layers, roi_input, C.num_rois, nb_classes=len(classes_count), trainable=True)# defining the models + a model that holds both other models
model_rpn = Model(img_input, rpn[:2])
model_classifier = Model([img_input, roi_input], classifier)
model_all = Model([img_input, roi_input], rpn[:2] + classifier)
For the training, the model is iterating over batches of training set for 90 epochs (the length of each epoch is 1000) .
It’s impossible to run it on CPU so I run it on GPU with AWS EC2, Tesla K80: 410 seconds per epoch. It took 10 hours.
For each picture, we are detecting characters and classify them. The network predicts the coordinates of the bounding boxes for each character predicted.
Characters are well predicted and are often detected but there is too much overlapping. Indeed, bounding boxes are often too large around characters so, when there is more than one character in the picture, the bounding boxes are overlapped. We can improve this by tuning the overlapping_threshold and the Non-Maximum Suppression function itself.
Of course, the accuracy is not as high as on the convnet which predict one character at a time with a full picture containing one character.
Predictions are very slow on my laptop CPU : 8s per picture. On a GPU (Tesla K80), it’s 0.98 second per picture. It could be interesting to compare with a sliding window a simple convnet (such the one used in Part 1) : for a a 640x460 picture, using a sliding window of 64x128 (moves 8 pixels at a time horizontally and 4 pixels vertically, we have to predict 6000 pictures ! Even with a really quick network which predicts one box in 0.01s, it’s still 60s per picture ! Moreover, with a sliding window, we have only one ratio possible.
I am still annotating pictures, I will update this post with the new predictions.
Again, if you have any questions, please feel free to contact me and moreover, if you like this post don’t hesitate to recommend it :-).
The dataset is on Kaggle, download it and have fun !