How I trained AI to watch the Simpsons

Joseph George Lewis
Geek Culture
Published in
9 min readJan 24, 2023

--

Photo by Jack O'Rourke on Unsplash

I’ve wanted to get my hands dirty with object detection ever since I started writing blogs for the community. The trouble has always been finding a dataset that I found interesting. Sure, I could crack out a pedestrian detection like the tutorials or make a model recognise stop signs but that’s just not fun. Enter Homer Simpson …

The code for this article can all be found on my GitHub. I used the super helpful tutorial from PyTorch in Google Colab to bootstrap my project and then finetuned some processing and parameters for this dataset. I also found so many helpful links from around Medium to help build the project all of which can be found at the end of the article. This is my first time using PyTorch and image processing at all so please leave any suggestions or ask for help on your own projects in the comments.

Project Brief

Object detection is first and foremost a computer vision problem. With the primary goal being to identify objects in images. The secondary goal is to then classify those objects. PyTorch is a tool that takes advantage of Tensors to run neural network training jobs quickly.

The aim of the work here is to get some experience with image processing and to share that with the community. I also want to learn more about PyTorch and how it works by taking advantage of Google Colab and the awesome free GPU feature. The actual aim of the code is just to train and test a model to recognise some of the characters from the longest-running animated American TV show, the Simpsons.

There are some code steps needed to get up and running with PyTorch in Colab but they are easy enough to follow. For reference I have pasted them below, though they mostly came from the PyTorch documentation:

Data Description

The data in use comes from Kaggle with a non-commercial use license which means it is safe to use in this project. It goes without saying that all credit for the original images goes to the original artists.

Made up of around 6,000 images the training dataset has been carefully constructed by humans to have an image of a character (stored as a jpg file) and a set of bounding box coordinates, surrounding the character in the image. The goal of the neural network will be to get a machine to build these boxes. Figure One below has an example of the human labelled box:

Figure One: Example of Grampa Simpson in a bounding box. (Credit: Original Animators).

There are some steps taken to process the images before they can be used in the project but the important bit here is just validating that the bounding boxes are sensible. The model built using PyTorch will use the bounding box coordinates and the label for the inside of the box to predict the location of the character and who they are in this case, Grampa Simpson.

The testing dataset then contains a set of image files for most of the characters again with bounding box coordinates and a label for the characters. The model will first be trained on the training data and then tested to see how well it can evaluate the images it sees.

Image processing and Data Prep

The image processing stage starts with reading the images and sizing them all consistently. The PyTorch model requires our images to be of height, H and width, W. However, simply re-sizing the images isn’t quite enough because then the bounding box coordinates will no longer match the image size. So they have to be manipulated together.

Resizing images and bounding boxes can get really tricky but thanks to this amazing article from ML Engineer Aakanksha right here on Medium I was able to process my images and their bounding boxes too:

This really cool method from Aakanksha involves reading an image and creating a mask of the image using the original bounding box. I did have to make some alterations to the logic which included flipping row and column values and the x and y values used in the original, however, I’m still super grateful for the code provided. An example of a mask using the same image as above is below:

Figure Two: Bounding box as a mask

Hopefully, you can see that the yellow box was formerly the bounding box of good-old Grampa Abe Simpson! Now we simply resize the image to be 300x300 and do the same thing to the mask image:

Figure Three: Resized original image and resized mask (300x300)

The final resized image and bounding box look something like this:

Figure Four: Resized image with the overlayed bounding box

Here are some more examples just to show the success of the method even with some complex examples:

Figure Five: Extended examples of re-sized bounding boxes and images

That’s all the ‘pre-processing’ that we will apply to the images for now. However, the implementation of some PyTorch transformations to flip the images and convert them to tensors will also be added later.

Finally, there is a brief stage of label encoding the targets so that each character we are trying to identify has a distinct label for the model to use. In the dataset, there are around 30 distinct characters including Grampa, Mr Burns and Skiiiinnnnnerrrr. The encoding, therefore, runs from one all the way to 30. Usually, we would start at zero but in object detection tasks there is by default an extra class, the background, which takes up the label of 0.

Now we are ready to start interfacing with PyTorch! The requirements PyTorch has for being able to load our data is that it is held in a class, which inherits from one set up in PyTorch and in which the __getitem__ and __len__ methods are both implemented. Setting this up was a bit tricky the code snippet below shows the process of building this class from a processed data frame:

Model build

The object detection model comes from PyTorch and is an implementation of Faster R-CNN. There are some great articles that explain Faster R-CNN but for the purpose of this article, it’s most important to know that it is a neural network architecture that excels in detecting objects (typically multiple objects) in an image.

Breaking down the name Faster R-CNN gives a good description of what to expect. CNN just means it is a Convolutional Neural Network, which is commonly used in image classification tasks. The added R means it is iterated further to become Region Based, useful for object detection as it first proposes regions where objects could exist. Finally, Faster is a bit complicated but just means that the network uses other layers of the network itself to try and build better region proposals. There will be some links to articles on Faster R-CNN at the end of this blog.

This project uses a pre-trained Faster R-CNN model and finetunes it for the dataset over 3 training epochs (only 3 epochs are used due to resource and time limitations). The precise model used is the Faster R-CNN ResNet. The finetuning is the tricky part as the train and test datasets, schedulers and training epoch code must be set up. Luckily PyTorch provides a lot of these helper functions in the install for torchvision and the Git Repo. For the full training code specific to this project, please check out my GitHub otherwise links to the PyTorch implementation will be below.

The training process should end up looking something like this!

Figure Six: Printed messages during training

Model evaluation

The model is assessed according to how well the bounding box is fitted as well as how well the character has been classified. Before the metrics are evaluated some of the model's predictions are given below!

Figure Seven: Network detections. Left to Right: The model correctly classified Lisa Simpson and Comic Book Guy but predicted Milhouse instead of Krusty

In the anecdotal examples above the model successfully constructed bounding boxes around all three characters, so is successful at detecting Springfield-ians! However, it did misclassify the Krusty example as Milhouse possibly owing to the eye mask resembling Milhouse’s iconic glasses?

In terms of metrics for the entire test set, the evaluation uses the common evaluation of IoU or Intersection over Union to determine whether a result was positive or not. The Intersection over Union is calculated by finding the intersection (or common area) of the predicted and true bounding box and dividing by the union (or combined area). Typically a value above 0.5 is seen as a true positive and those under 0.5 are false positives. Assessing the model by the same criteria gives the chart below.

Figure Eight: False positives vs. True positives for the test data

Despite having a lot of false positives the model still has performed well for almost 500 test images after training for just 3 epochs. In fact, lowering the IoU threshold just slightly to 0.4 results in a vast change in rates of true positives showing that when the model did miss the bounding box it did not miss drastically.

As for the classification of the characters in the bounding boxes, the model performed very strongly. Performing with an accuracy of 0.91! The counts of matched characters are given below:

Figure Nine: Counts of matched and unmatched characters

The confusion matrix also shows how well the model was at discerning the difference between each member of the show:

Figure Ten: Confusion matrix of predicted and true labels for the test set (labels only for ease of understanding, the character names can be found in the .ipynb file on GitHub)

The Faster R-CNN typically performs well on images with multiple objects to detect. The limitation of this data source is that the images contain just one character. So, to test the network with just one multiple-character image the Simpson family evaluation is given below, just for an added assessment:

Figure Eleven: Simpsons family with predicted bounding boxes

Conclusion

Overall I have really enjoyed getting hands-on with the incredibly powerful, PyTorch. Though it was a steep learning curve taking on image processing and PyTorch both at the same time. Some of the interacting components like datasets, data loaders and models are difficult to understand and the data structures for predictions and test data can be hard to unpack. That being said I am incredibly happy with the result the model has ended up out-performing my expectations!

In future, the developments in this work could be to inflate the training by adding more image transformations and to test other network architectures and backbones like YOLO which is gaining popularity. The above is out of the scope of this article but please explore if you want to take the work further!

I really hope you have enjoyed this article and that it has inspired you to get involved in a PyTorch project of your own. If you do decide to get involved in this or any other data science project please share your work in the comments or link to your own blog. Thanks for reading and as always please check out the links below for full-code and useful links.

Links

Calculating IoU Score:

Faster R-CNN and other Object Detection NN techniques:

Kaggle Dataset:
https://www.kaggle.com/datasets/alexattia/the-simpsons-characters-dataset

PyTorch Colab Docs:

Source Code:

--

--

Joseph George Lewis
Geek Culture

Data Scientist focused on reproducibility, package design, data vis, data ethics and natural language processing. (AP Data Science BSc Hons, First Class)