Creating a meme stash with state-of-the-art machine learning
Business problem: I want to use my memes as reaction images which are superior to the normie memes of gif embedding services.
Solution: tagging my images based on their content and building a database of meme metadata.
This story will be told in many parts to focus on each individual part of the process of how the service was created and the design decisions involved.
- Deduplicate the images
- Detect text on the images
- Detect objects / classify image
- Find faces and their emotions
These steps each tackle difficult topics such as how similar can be two images before they are the same, how to detect white(ish) characters on a bright background and SO MUCH MORE
In this post, I will give you the gist of the idea.
Reading / Deduplication
Images can be found in many formats in the wild, but jpeg and png are dominating the internet. Nevertheless, the extension is misguiding whether the image can contain an alpha channel or not (RGBA, RGB) this is all well until you try to create a tensor from the image and crash like Icarus with an incomprehensible error message. So zeroth step converts all incoming images to RGB.
The images are read with the Pillow package but image preparations are done with the OpenCV package because it offers one-line solutions for adaptive thresholding and advanced binarization.
After reading the image the deduplication can commence. To find only unique images I used perceptual hashing from the ImageHash library. Perceptual hashing means: getting a hash that can describe an image uniquely from the point of characteristic. Resizing and simple transformation — even cropping can be bypassed with pHash because it reduces the image into a simpler form. For a more in-depth introduction please consult this great article.
Because the pHash is quite good at identifying unique images I have ordered the images into a dictionary with the hashes being the keys.
current metadata = { ‘image_pHash’ : { ‘path’ : ’path/to/file’ } }
Reading the characters on the image — OCR
After reading the 3 channel images I have found that for the best results (surprisingly most of the OCR — Optical Character Recognition, is done to books that have black characters on white backgrounds) the images need to be preprocessed for the algorithms to be somewhat accurate. To get the best results with an average meme: white characters and a vivid background that can be bright and contains many edges that can confuse the OCR.
There are also images that can be easily processed so I added a step that runs without any preprocessing.
Preprocessing steps for OCR
To get the most out of my OCR of choice — tesseract I have chosen four ways the image can be made more readable. All process starts with creating a grayscale image with a kernel of 80 (the image is iterated through finding local maximums adaptive thresholding for smoother grayscale)
- Simple Colour inversion
- Binarization with Otsu’s method
- Adaptive Gaussian thresholding + Otsu
- Stroke-width transformation
The rest of the preprocessing will be proposed in part 2