peanuts.today | Cleaning and Pre-processing comic strip data set (Part 2)

Published in

Analytics Vidhya

5 min readNov 4, 2019

This is the second post in my series about exploring/working with the full dataset of Peanuts comic strips. For the introduction post, click here.

All code used to operate on the comic strip images can be found here.

NOTE: I have not published the raw dataset because this is a personal research project and I want to be respectful of any copyright/fair use rules and regulations that might apply to these images. I have made some images from the data set available both 1) here in the blog post to illustrate some of the challenges I faced and 2) on peanuts.today, where they are only accessible in a controlled manner to prevent image scraping.

With that out of the way, let’s get started!

The Raw Data

Working with the images posed some challenges given my goals of 1) extracting any available metadata in the image itself and 2) formatting the images in an appropriate way for displaying on the web.

To give some context on the variability of the images, here are 3 different images from the dataset that illustrate the differences that had to be dealt with.

Image 1 is an example of a Sunday strip. Rather than the typical 4 horizontal panel strip, the Sunday strips are “full page” and multiple rows make up a single comic. My processing scripts will need to be able to recognize and maintain the entire image as a single comic.

Image 2 is an example of what I call Weekday strips, meaning that they are the typical 4 panel row. My processing script will need to recognize that there are 3 separate strips/comics in this image and extract them out appropriately.

Image 3 also contains Weekday strips, but it also has a thick, black banner at the bottom of the image. My processing script will need to ignore/remove that strip before trying to process the image.

All images in the dataset also include pieces of metadata in the bottom left/right corners. This includes the page number (not that important) as well as the month and year of publication (very important)

Challenges

Based on a cursory look at the data above, some challenges become clear:

Differentiating between a Sunday and Weekday comic
Developing a methodology for recognizing individual strips in an image
Extracting metadata (e.g. page number, month/year) from image

Solutions

Thankfully, I was able to find relatively straightforward solutions to all the problems outlined above.

Comic Identification & Extraction

The primary distinction between Sunday and Weekday comics turned out to be the amount of whitespace between the panel rows. As you can see by comparing Image 1 and Image 2 (above), there is much less whitespace in the Sunday comics compared to the Weekday ones.

Therefore, I developed a function (`split_and_save_strips`) that takes an image and some tuning parameters, calculates the number of strips found within the image, and creates a new image for each strip found.

Function for extracting strips from image

Here’s a flowchart diagram to illustrate how the function works:

And to visualize it a bit better, this is what an original image would look like if it was “marked up” by the function:

Metadata Extraction

Extracting out the metadata from the source image turned out to be fairly simple. Since the text is printed consistently, I figured some kind of OCR tool would be able to successfully parse the text, which after some validation I could then load into a database table/CSV.

Rather than pay for any of the cloud-based OCR services (e.g. AWS Textract/Rekognition or GCP’s Vision API), I found that using the open-source Tesseract project suited my needs just fine. Some cleanup of the images was necessary, namely ensuring that the only part of the image that I passed to Tesseract was that which had the content. Including any extraneous data (e.g. part of the comic strips or the watermark) confused Tesseract and prevented it from returning valid text. The function that handles the metadata parsing/validation can be found here.

Geeky Asides

In addition to the use case-specific solutions related to image modification/extraction discussed above, I also had to figure out how to efficiently run these scripts against thousands of files. While my inner AWS architect wanted to bundle these scripts into some Lambda functions and build an image processing pipeline using S3 Event Notification triggers and maybe a Step Functions state machine, I found it easier to run the scripts locally on my laptop since 1) this is a one-time cleanup process and 2) the inconsistency of the images resulted in maybe 1–2% error rate which required manual intervention, which would have been more difficult to handle if files/scripts were not being run locally.

In the end, I was able to develop an efficient local pipeline using Python’s threading , queue, and concurrent.futures modules that you can review in the crop_images.py script, among others. Using this, I was able to take advantage of multiple threads and reduce the time for cropping/splitting images significantly. No script took longer than an hour to two to finish.

Summary & Next Steps

At the end of running the various scripts to crop, split, and extract metadata from the images, I now have a clean data set with a 1-to-1 correlation between image files and comic strips.

The next step in using this data is to upload it to some repository that I can use to access and search for data. That will be discussed in the next post. Thanks for reading!