Recreating The Simpsons with a DCGAN

Jason Salas
7 min readAug 10, 2019

--

The original image of Marge versus the version of the same image my neural network attempted to recreate

Inspired by my progress (and mammoth failures) of my previous deep convolutional generative adversarial network for image synthesis of basketball shoes, I wanted to test the learning capabilities of the neural network I built against another domain, which was equal parts accessible and fun: The Simpsons. The data presents a neat set of computer vision challenges due to image format and structure, and is also easier to evaluate since we’re all so familiar with Springfield’s beloved residents.

As images are generated, we can visually see the progression and developing accuracy of learning character compositions, pick apart inconsistencies, and see when the system breaks down and is in need of tuning.

Here’s my GitHub repo, using a DCGAN. I’ll be adding more GAN architectures as research continues to hopefully produce better results.

Progress of training in the early stages — details are being captured and the overall idea of a Simpsons character is being understood

And maybe, if I’m lucky, the net will get so good at understanding what makes a proper Simpson that it’ll create a completely new character, to the point that fans on the show’s subreddit might not know it’s the results of image synthesis. It’s the Turing Test for cartoons — what I’m dubbing The Groening Test, after the show’s beloved creator.

Tweaking the models makes for better results

MODELS/ARCHITECTURE

I ran several experiments to try and get the images of decent quality. In implementing the DCGAN algorithm from the original paper, there are two popular approaches: using a single loop with a large number of iterations (in the tens of thousands that make a single pass), or using a loop nested within a loop to ensure batches are processed over a smaller number of epochs. I found better results in terms of image quality with the latter, at the expense of much longer training time. I trained each model on Nvidia Tesla K80 cloud GPUs via Google Colab in batches of 128 images.

A sample of images from the original dataset

The three DCGAN models I trained with the dataset of 9,800 images are:

1. Minimal layers, single loop, small images — a simple model using a series of single iterations over the data. The layers in the generator and discriminator models aren’t very complex. This architecture trains faster and produces results, but still has digital noise and pixelation. It’s a good starting point for prototyping. My training session was for 9,800 images at a 32-x-32 pixel size over a run of 85,000 iterations and took about two hours, updating around 2 million weights during backpropagation. Manually spot-checking the images determined that the best results came after the 40,000th iteration. After that point, the images stopped trying to recreate the characters and began creating hybrids, with elements of two or more characters being combined in a single image. Images at that point also began to be broken and fragmented, indicating overfitting due to saturation.

Simple model

2. Minimal layers with more filters, small images — a model using the same layers and filters in the previous method, but training on a nested loop of batches with a main loop of epochs as mentioned above. This greatly smooths out pixelation and trains relatively quickly. A run of 300 epochs took four hours, updating over 24,000,000 weights, using a resolution of 32-x-32 resolution. The generator model produced its best image after the 70th epoch.

More complex model trained for a long time — things are obviously out of whack

3. More layers, more filters, larger images — a deeper, more complex model captured more information and ironed-out the digital noise, against the dataset where the images were now 64-x-64 pixels for much improved clarity. The nested loop method again paid dividends. The combined GAN model updates more than 41,000,000 weights while learning. The resultant images still aren’t at the 300 dpi caliber as the original source, but they’re much improved. Optimal quality was achieved after the 100th epoch.

A deeper, more complex model produced smoother images with greater detail

IMPACTS OF CLASS IMBALANCE

Although this unlabeled dataset is an exercise in unsupervised learning, without doing any exhaustive exploratory data analysis the images indicates a disproportionate volume of pictures of the nuclear family of Homer, Marge, Bart and Lisa (but oddly not so much for Maggie). This certainly skews the learning towards features common to those dominant characters, and the majority of the more-accurate images tend to be of the main family.

Regrettably, everything’s NOT coming up Milhouse as far as the distribution of images

There’s evidence to suggest datasets like ImageNet with a large number of classes may adversely impact a GAN’s ability to learn because MNIST, CIFAR-10 and others tend to be more normalized in terms of objects having consistent orientation and in-frame positioning.

The classes of the dataset also intrigued me, being based on the cast. A cursory survey from 2011 estimated that there are around 457 characters throughout the series not counting celebrity cameos and other limited appearances, and it doesn’t look like the dataset contains any animals like Santa’s Little Helper or anthropomorphized members like Itchy & Scratchy or the Happy Little Elves. Basically, it’s the people.

The learned filters during training cross-correlate certain facial aspects of different characters to hilarious hybrid effect, like merging Lisa with Homer.

RESULTS

What was really surprising to me is that one of the last components to be learned is a character’s pupils. And by “learned”, it’s important to note that this means the model very late in the training run would draw pupils within a character’s eyeballs. This is particularly interesting considering this is one of the few aspects that’s consistent with each cast member.

There’s no concept of different eye color, just a few scant pure black pixels sitting within a large white eyeball region set against a larger region of yellow skin. It wasn’t the head shapes, hair styles, or mouths.

My theory is that because the number of pixels needed to make each pupil is so small, usually just a handful, therefore being harder for the filters to learn.

I’m not an artist or animator, so this wasn’t immediately intuitive to me, although it made perfect sense to me illustrator friends. So when Marge looks forward with her pupils are to the side, you get a completely different emotion if her lips are closed (seductive happiness) or if her teeth are clenched (anger).

It’s also interesting to observe that in my initial runs with nearly-identical architecture used on a dataset of JPG photos of real-life objects learns faster than it does on the Simpsons’ drawn cartoon characters. I wouldn’t have guessed this, given lighting, shadows, reflection, occlusion, skewing, and other elements common to photography. I need to run more tests to confirm whether there’s any significant difference in learning pictures of real-world objects versus artificial images from illustration.

Lastly, there’s evidence to suggest that prefab datasets like MNIST, CIFAR-10 and CelebA tend to produce impractical results because the images are all consistently oriented and positioned, whereas real-world data, like Simpsons Faces, has cast members all over the frames and at different levels of magnification. This makes for a slight challenge. However, the consistent skin tones of the cartoon characters makes this an easier task for the DCGAN algorithm to chew on than variance in actual real-life actors.

WHERE TO GO FROM HERE

I’m enjoying the learning I’m getting from this little experiment and the various iterations I’m putting it through. My next step is to apply the same layer architecture against the CelebA dataset and see how it compares. I’m also thinking of applying other GAN extensions to the idea, specifically a conditional GAN (cGAN), to give the system a constraint to query specific characters instead of the current random output, over which I have no control.

So hopefully, if I want to see a 5-x-5 matrix of images of just Marge, I can generate that instead of just 25 random images. We’ll see!

COMPARISON TO OTHER PROJECTS

While I didn’t enter the Kaggle competition and I’m using a DCGAN at the moment, the results my neural net produces are on-par with similar attempts in this space, and perhaps a tad bit smoother in terms of image quality. So we’re all making good progress! One thing that’s common across most submissions that are making headway is wildly unstable training.

There’s still a whole lot more learning that needs to happen, so I’m going to keep at it. It’s a lot of fun and there’s so much more I could do.

More soon!

--

--

Jason Salas

Machine learning, recommender systems, 360-degree filmmaking, college football rankings, movie stuff, general dorkery