Neural Image Assessment (NIMA) for Image quality scores and Earth Mover’s Distance
This time something from my internship so far😃
During my 6 months Data Science internship at @1mg (amongst India’s top online pharmacies), I have been mostly engaged with the digitization of prescriptions received every day. At a point of time, we realized we need a system that can detect whether the Rx(prescription) received is of good quality or not. By good quality we meant it shouldn’t be:
Blurred, incomplete images
Prescriptions with very bad handwriting
Junk Images(random images like cat,dog)
As real-world data can be very messy(and not like Kaggle Competitions) Creating such a system can help us out in two ways:
We can create different solutions for different quality of Rx received
Only the best quality data can be used for training purposes
This was the time when NIMA i.e Neural Image Assessment came to the rescue.
It has its roots coming from Google which aims at giving scores to images depending on their aesthetics and technical qualities.
Before moving on, let's first understand what is Transfer Learning.
It has been observed that training a model takes a lot of time(could be some days!!). Transfer Learning may help us to save time.
But how?
If we can have a previous model which has been trained on almost similar data, what can be done is that the old model with the old weights can be trained with :
- Same structure but with new data. As we have got old weights, we may just need to fine-tune the model weights which will be less computationally expensive and we might complete our training in hours for which it could have been days!!!
- Add some layers on the top of the base layer and freeze the old weights from getting being trained. Hence only weights for the new layer/layers will be trained. Even the entire model can be retrained which will again be fine-tuning.
As the data in both cases is similar, weights for the base model might not change very drastically and hence only required to be fine-tuned which is the whole theory behind transfer learning.
Moving back to NIMA
NIMA uses either ImageNet, ResNet, or MobileNet as its base model, and by using the concept of transfer learning, it adds a fully connected layer for 10 class multi-classification and trains the data on this model. In the end, by using softmax activation, it calculates the probability for each score(0–9).
And that's it!!!
For the final score, the summation of the probability for each number multiplied by the number is taken i.e
Prob_0*0+Prob_1*1+Prob_2*2….Prob_9*9.
The loss functions that have to be optimized here are called
Earth Mover’s Distance.
Heard it for the first time?? YESSSSSSSS
Let's consider an example. Take India and Switzerland🏂
If someone asks you what is the distance between India and Switzerland, it would appear quite absurd as India and Switzerland are two regions and not points. But it isn’t absurd at all.
Then can we calculate the distance between the two regions?
Using Earth Mover’s Distance!!
It is the minimum cost(distance in our case) to move all the mass of one distribution(India) to another distribution(Switzerland).
In NIMA, what we will be considering is that the Ground Truth is one distribution (like India) and Predicted is other(Switzerland) and we need to calculate the minimum distance between these two distributions.
You can relate to this video or check here for more on Earth Mover’s Distance.
Why Earth Mover’s Distance?
The classes (0–9) are Ordinal in nature. That is, if an image is marked 9 in the train set and the model predicts 0 in one case and 8 in the other, we know 8 is much closer to 9 than 0. The loss when it is 8 should be less as compared to when it predicts 0. Hence we need to consider the order of classes as well here and that’s why EMD.
Complete mathematics can be explored above.
The major issue with NIMA is it requires labeled data. Google used Aesthetic Visual Analysis(AVA) dataset. Images in AVA are labeled by averaging scores of 200 professional photographers per image!!!
Hence if you want to train it for some specific images(Prescriptions in our case), you will need labels(scores per image) first.
For code implementation, this repo can be very useful:
After testing NIMA, it has been found that:
- aesthetic quality does weigh more than the technical quality of the image in the predictions.
- The more color and variety(more objects, texture, etc.), the more may be the score(aesthetics of the image increases).
This is strictly my observation after testing on prescriptions. Things may change for other sorts of data.
As mentioned in Google’s blog, NIMA predicted scores were almost similar to the scores given to images by the photographers for the AVA dataset!!!
Summing it all,
I feel NIMA can be a go-to model when it comes to Image quality scores. The last fully-connected layer may be changed from 0–9 to 0–99 or 0–5 or any range that we want. These scores can be used for ranking and we can get the best out of the lot as well and much more stuff can be done using these scores(like clustering, filtering, etc). But as everything comes with some problems, the same goes for NIMA as well. It might fail when we want to consider either technical or aesthetic qualities and not both. Also, for very specific data(like prescriptions), it might fail(a blurred image with a lot of colors, an Icon/symbol in a prescription leading to high scores which shouldn’t happen, etc). Though I still believe it is worth giving a shot!!