Low resolution image classification challenge

Yussi Eikelman
Analytics Vidhya
Published in
5 min readOct 3, 2019

The main idea behind this article is to examine the possibility of low resolution image classification.

A lot of technical detail is skipped to remain on the more interesting and valuable, in our view, flow of things. All of the technical can be found via link to the code at “Results & code” part at the end.

If you’re interested in a little of storytelling, you may continue from here, otherwise jump to the “solution” part.

Introduction

Imagine a huge set of low resolution images of celebrities derived from video sequences, which you can barely differentiate from a cherry pie.

And you might think that maybe there’s a way to make it easier, by using a simple (or not) convolution network to do the job. After a minute you might think: “ I know that it can be possible for convolutional neural networks to recognize humans very accurately, but still we’re talking about cases where statistically one can clearly identify what’s in the image in the first place.”

So you’re starting to think about adding more data (features) to the picture: In this hackathon it is given extensively - the physical pose estimation of the person in the images, which is accurate.

Now, what would you try out? I thought about a kind of combination of RNN and convolution based architecture: by using the CNN to recognize the picture and to advice to the RNN, which in its turn is trained to predict the next poses (moves) based on the given label (because it’s video sequences. The inspiration for this RNN method is took from here).

For example: If we have a video sequence of Chuck Norris in a movie and we have his pose estimation in every frame, we could make our RNN model learn what would be his next moves. So the CNN gives the label of Chuck for the first frame to the RNN, now it produces its estimation for his next move in the next frame. If these moves are correlated well enough with the next (visible) Chuck Norris uppercut — then we’re good (or not). Or something like that.

There was too much risk without much time in our disposal, because it all was a part of a hackathon challenge. We weren’t sure if we can rely on combining the pose estimation with CNN and how ones movements would be much different from the others, so we decided to look elsewhere for additional features (although we may try out this approach in the future).

Frankly, I know a thing or two about GANs, the only questions were their load and their reliability for this case, so there we went.

Solution

The beauty of GANs is in the idea: make two models — Generator and Discriminator, let them compete, and get the best results produced by the Generator. This idea proved itself as a good counterfeiter for various creations, especially visual ones.

We still can identify that these are two Mona Lisas.
It doesn’t matter who created the picture in order to understand who’s in it.

After a brief research we have found a reliable source for a super resolution GAN(SRGAN) architecture, that will change the resolution of an image by adding predicted features to it. You can read about this arch in this arxiv.

After the image takes its new shape, the classification process will arrive in the form of CNN.

The beauty of deep convolutional neural networks is in their ability to recognize features that in return can lead to better prediction results.

Convolution neural network visualization.
The original image.
The enhanced image by SRGAN.

Therefore the intuition:

If there is a specific set of features that helps to find out who is the person in the image, it doesn’t matter if the image is real or fake.

Can GANs generate this set of features? We decided to find out.

The hypothesis is:

SRGAN + CNN = better low resolution (now high) image classification.

Data & Preprocessing

The overall data set is ~ 500,000 images of shape (64, 64, 3) divided unequally between 100 celebrities videos and sequences.

Because of the variety in the number of frames in the sequences between the persons, as well as in the number of videos per person, we took the minimum number out of the maximum number of frames per person.

from the overall celebrity population find out:

min(max(num_samples in sequences)) => 24.

min(max(num_sequences in videos)) => 2.

min(max(num_videos)) => 2.

96 samples per person, 9600 overall for the train set.

We did this to be sure that no matter what, we will always be able to generate a random equally distributed data set and prevent overfitting.

The test set is composed of:

12 frames per sequence,2 sequences per video, 2 videos,

48 samples per person, 4800 overall.

Models

The input image to the SRGAN is (64, 64, 3) image, and the output is (256, 256, 3) image as seen in the “Solution” section.

The newly produced and sorted image sets are fed to the Deep CNN for training, validating, and testing.

Results & code

All of the code and detail are available here

We have achieved 80 and more percent accuracy and less than 0.15 Loss on the test data. We have managed to achieve our goal of low resolution image classification. This came at the cost of time consumed by the enhancement of the images for our CNN model, but it took less time for the model to train and produce such results in the end.

This is a mutual work with my colleague Inbal Weiss on how to deal with low resolution images for classification as part of a hackathon challenge.

We are still working on this project and we will be more than happy to receive any valuable insight into our work.

--

--