The Simpsons characters recognition and detection using Keras (Part 1)
Deep Learning : Training a convolutional neural network to recognize The Simpsons characters.
As a big Simpsons fan, I have watched a lot (and still watching) of The Simpson episodes -multiple times each- over the years. I wanted to build a neural network which can recognize characters. I don’t know right now what will be the applications of the neural net (perhaps computing the characters presence in each episode).
This project is not specially difficult but can be time consuming, because I have to manually label many pictures of each character. I didn’t find any The Simpsons characters database on the Internet so I am building it by myself (I am still labeling pictures when I have time). I think it could be useful for other ones. The dataset is already available on Kaggle with exploratory code (in the Kernels section).
After learning and using TensorFlow for different projects, I want to use Keras because of its simplicity (compared to TensorFlow for example) and its capacity (TensorFlow backend) for experimentation. Keras is a Deep Learning library written in Python by Francois Chollet. My approach to solve this problem will be based on convolutional neural networks (CNNs) : multi-layered feed-forward neural networking able to learn many features.
You can find the code on the github repo .
Building the image dataset
The dataset currently features 18 classes/characters (the data on Kaggle contains 20 classes, but currently I used only 18 characters for training). Please check the image below for the characters used. The pictures are under various size, scenes, could be cropped from other characters and are mainly extracted from episodes (season 4 to 24).
The training set includes about 1000 images per character (still labeling data to get to this number). The character is not necessarily centered in each image and could sometimes be with other characters (but it should be the most important part in the picture).
With label_data.py, you can label data from .avi movies : you can get a cropped sub picture (left or right part) or the full picture and then label it by entering a part of the character name (burns for Charles Montgomery Burns).
To add more data, I also use the Keras model. I capture videos and get 3 pictures for each frame I analyzed (left part, right part, full) and then I ask my algorithm to classify each pictures. Afterward, I check each picture it has classified. It’s still manual but it’s faster and it’s an incremental process that’s more and more fast, particularly for “small” characters.
The first step for preprocessing pictures is resizing them. We need to have all pictures with the same size for training. I will convert data as float32 to save some memory and normalize them (divide by 255.). Then, instead of characters name, I use numbers and thanks to Keras, I can quickly convert those categories to vectors :
pic_size = 64
num_classes = 10
img = cv2.resize(img, (pic_size, pic_size)).astype('float32') / 255. ...
y = keras.utils.to_categorical(y, num_classes)
I am splitting my dataset into a training and a testing set : for this, I use sklearn train_test_split function.
Deep Learning Model(s)
Now, let’s begin the “funny” part : defining our model. Right now, we’ll use a feed forward 4 convolutional layers with ReLU activation followed by a fully connected hidden layer (see below for a deeper model). This model is similar to the CIFAR example from Keras documentation. I also use dropout layers to regularize and avoid overfitting. The output layer uses softmax activation to output the probability for each class. I also tried to replace ReLU by ELU (like ReLU but with a mean closer to zero) but it didn’t work.
Categorical Cross Entropy loss is -as often- used. And for the optimizer, I use RMS Prop which is a stochastic gradient descend where we “divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight” .
Training the model
For the training, the model is iterating over batches of training set (batch size : 32) for 200 epochs.
As I don’t have a huge data set, I am using data augmentation (which is really simple to use with Keras library). It means doing a number of random variations over the pictures so the model never see the same picture twice. This helps prevent overfitting and helps the model generalize better.
datagen = ImageDataGenerator(
featurewise_center=False, # set input mean to 0 over the dataset
samplewise_center=False, # set each sample mean to 0
featurewise_std_normalization=False, # divide inputs by std
samplewise_std_normalization=False, # divide each input by its std
rotation_range=0, # randomly rotate images in the range
width_shift_range=0.1, # randomly shift images horizontally
height_shift_range=0.1, # randomly shift images vertically
horizontal_flip=True, # randomly flip images
vertical_flip=False) # randomly flip images
This take a while running on CPU (on my computer) so I run it on GPU with AWS EC2, Tesla K80: 8 seconds per epoch. In total, it took 20 minutes (which is really quick for deep learning).
As we can on the plot, after 200 epochs, it seems to have reach the asymptote, without an obvious overfitting. Moreover, the accuracy seems good too.
Of course, right now, it’s complicated to have a true model accuracy because of the low number of pictures but as the number of pictures will grow, it will be more pertinent. Thanks to sklearn it’s really easy to print a classification report :
As you can see, the accuracy (f1-sport) is really good : above 90 % for every character except Lisa. The precision for Lisa is 82%. Maybe Lisa is mixed up with other characters.
Indeed, Lisa is often mixed up with Bart. Probably because many pictures of Lisa contain Bart too.
Adding a threshold to improve the accuracy
In order to improve the precision (so, of course decrease the recall, but I would try to not decrease it too much), I thought that I can maybe add a threshold.
Before to talk about a threshold to improve accuracy. I just want to had a famous graph about recall and precision.
I compute some statistics about good and wrong predictions : maximum probability predictions, the probability difference between the best two candidates and the std.
For good predictions : Max : 0.83, Difference Two First : 0.773, STD : 0.21
For wrong predictions : Max : 0.27, Difference Two First : 0.092, STD : 0.07
If the probability of the predicted character (1.) is too low, the standard deviation of the prediction (2.) is too high or the probability difference between the two most likely characters (3.) is too low maybe we can say that we don’t want to predict a character at all.
So I plot those 3 values for the test set to find a line (or a hyperplane) to separate good and wrong predictions. I did it for both characters.
As you can see it’s impossible to find a linear separation and to have a simple threshold, for both graphs, between good and wrong predictions. Of course, we can see that wrong predictions are concentrated in the lower left of each graph but in this corner, there are too many good predictions too. If I choose a threshold (for example, threshold regarding the probability difference and probability of the best candidate), my recall will be lower.
Maybe the best thing, to do to improve the accuracy and not affect too much the recall, is to plot those graphs for every character or for a character with a low precision (e.g. Lisa Simpson).
Moreover, the threshold could be useful for pictures without famous characters or with not character at all. Currently, I do have a “no-character” class in my model but I can probably add with a threshold. I don’t think that we can finding the perfect formula (between the probability of the best prediction, the probability difference and the standard deviation) so I will just focus on the probability of the best prediction.
Recall and Precision regarding the probability of the best prediction
There is classic trade-off between recall and precision and as often we couldn’t maximize recall and precision at the same time. So, it depends what we want exactly.
Regarding the probability minimum for the predicted class, we can plot the F1-score, the recall and the precision.
As we can see, it really depends on the characters. For example, if we focus on Lisa Simpson, it would be interesting to add a probability minimum for predicted class (=0.2), but this threshold will not be really useful for all classes combined.
So regarding of the application, we should add or not a threshold around 0.2–0.4 for the probability minimum for the predicted class.
Improving the CNN model
As I said earlier, I have a four convolutional layers models. To make the neural net understands more details and more complexity, we can get deeper and add more convolutional layers.
It’s what I did. I tried with 6 convolutional layers and going deeper (dimensions of the output space 32, 64, 512 vs 32, 64, 256, 1024) . It has improved the accuracy (precision and recall) as you can see below. The lower precision is 0.89 for Nelson Muntz and we only had 300 training examples for this character.
Moreover, this model converge quicker : only 40 epochs (vs 200). It tooks 15 minutes to train on a Tesla K80.
Visualizing predicted characters
As you can see, the neural network is pretty accurate to recognize and classify characters. Then, I predict characters in a video. Indeed, the predictions are faster enough (less than 0.1 s to predict a picture) to predict multiple frames each second.
If you have any questions, please feel free to contact me and moreover, if you like this post don’t hesitate to recommend it :-).
The dataset is on Kaggle, download it and have fun !
The next steps with a detection model in addition of the classification model are described in Part 2.