Image aesthetics quantification using OpenAI CLIP

Leveraging the Power of OpenAI CLIP for Quantifying Image Aesthetics in Python

Suresh R
12 min readOct 20, 2023
Source: DALL E-3

Recently, I embarked on a project that involved image classification and ranking for a website I was developing. However, I encountered a challenge: the images ranked at the top by typical rule-based systems were not of the highest quality and only had mediocre aesthetics. Believing that there was room for improvement, I turned to the internet for solutions. That’s when I stumbled upon this research paper, which discussed how OpenAI’s Zero Shot Classification model, CLIP, could effectively understand various aspects of image aesthetics such as colors, lighting, composition, framing, cropping, and more. Intrigued by the potential of this approach, I decided to implement the paper’s prompting methodology in hopes of achieving superior accuracy with relatively low effort.

Full code is present in this google collab:

What is CLIP?

CLIP, an acronym for Contrastive Language-Image Pre-Training, is a neural network developed by OpenAI in January 2021. It’s a multimodal model that merges the concepts of Natural Language Processing and Computer Vision, enabling interaction with images through words by leveraging the knowledge of the English language and the semantic understanding of images.

Functionality of CLIP, Source: CLIP: Connecting text and images (openai.com)

One of CLIP’s key features is its ability to perform “Zero-shot” classification on images. In essence, it can classify images it has never encountered during its training phase by extrapolating the concepts of the images it learned during training. For instance, if it was trained on a dataset comprising three types of cats: Siamese, Persian, and Maine Coon, and we introduce an unknown breed like an orange tabby cat, CLIP can classify it as a cat even though it has never seen this breed in its dataset. This is a testament to CLIP’s remarkable capabilities.

In a way, CLIP brings the flexibility of transformers, akin to ChatGPT 3.5, into the realm of Computer Vision.

Problem Statement

The goal of this project is to utilize OpenAI’s CLIP model to develop a method for objectively quantifying an image based on its aesthetics. We will be implementing this using Python in a Jupyter Notebook environment.

Methodology

!pip install numpy==1.24.4
!pip install pandas==2.0.3
!pip install Pillow==10.1.0
!pip install Requests==2.31.0
!pip install streamlit==1.27.2
!pip install torch
!pip install transformers

First, we will install all the necessary dependencies using pip:

  • Numpy: This library will be used for efficient numerical operations on arrays.
  • Pandas: We’ll use this for manipulating data in Excel-like tables.
  • Pillow: This will be our go-to library for image processing in Python.
  • Requests: This library will allow us to send HTTP requests using Python and handle the responses. We’ll use it to download images via URLs.
  • Streamlit: This will be used to create easy web apps for our machine learning and data science projects.
  • Torch and Transformers: These libraries will be used to interact with the CLIP model.

In the paper, there are 3 types of prompting methodologies:

  • Fixed prompting
  • Context-aware prompting
  • Ensembling

In fixed prompting, we utilise exactly two prompts , one for aesthetic images, one for unaesthetic images. Both prompts are formed using the string template “a [label] picture”, where [label] is either a positive or a negative word from our list of adjectives. Given these two prompts, we find the one more similar to the image using CLIP.

The authors describe fixed prompting as follows: Essentially, you would have two phrases — “an outstanding picture” as a positive prompt and “an atrocious picture” as a negative prompt. Then, you would compare the image you wish to score with each of these prompts and record the cosine similarity score for each. To compute the total score for an image:

Scoring methodology

So if an image is similar to the positive prompt as compared to the negative prompt then the net score of the image will be positive. Higher the score, the more aesthetic the image is.

For Context-aware prompting

Fixed prompts do not account for the content of the image. Instead of the generic prompt “a beautiful picture”, we hypothesize that it is better to include the content of the image, e.g., “a beautiful picture of a dog”. This specification of the text prompt moves the encoded prompt vectors closer toward the image vector, thus reducing noise in similarity measurements.

By providing context to an image, we can achieve more accurate scores. In the original paper, the authors used the 1000 class names of the ImageNet dataset for their list of adjectives. However, for our specific domains (in my case, hotels), we would need to create our own dataset of class names relevant to our domains. This can be easily accomplished using ChatGPT. Simply use the following prompt, adjust the domain as needed, and copy and paste the output into an Excel file.

Can you make me a list of {number_of_classes_needed} class names similar to the imagenet class names but which relates to the field of {domain}.

I have generated 1000 class names all relating to hotels as that is my use case and stored them in an excel file called Hotel_Classes.xlsx with the column name being ‘Col_Names’.

About Ensembling

Our ensembling approach is structurally similar to the context-aware prompts. However, we condense the 2,000 prompts down to two vectors by averaging all prompt vectors of each aesthetic label. The best labels are the same as for the context-aware prompts, but the results show that this method improves the performance while being computationally less expensive, since only two instead of 2,000 comparisons have to be done for each image.

Essentially, instead of performing 2000 cosine similarity computations, we average all the positive prompts into a single vector and all the negative prompts into another single vector. This allows us to only perform two cosine similarity computations, which significantly reduces the time it takes to generate a score.

Below is a table showcasing the results of all these methods on the Aesthetic Visual Analysis (AVA) dataset. This dataset comprises more than 250,000 images with annotations relating to the aesthetics of each image.

Results showcasing all the methods in the paper.

As you can see the Ensembling gives us better accuracy while reducing the number of comparisons needed to be done.

Code

from PIL import Image
import torch
from transformers import AutoProcessor, CLIPModel
import torch.nn as nn
import requests
from io import BytesIO
import os
import pickle
import numpy as np
import pandas as pd
device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)

This code will import the necessary model bin and config files through the Transformers library. Another way of using the model could be through Sentence-Transformers so if you are looking for another way of using CLIP, there you go.

def load_image_PIL(url_or_path):
if url_or_path.startswith("http://") or url_or_path.startswith("https://"):
return Image.open(requests.get(url_or_path, stream=True).raw)
else:
return Image.open(url_or_path)
def cosine_similarity(vec1, vec2):
# Compute the dot product of vec1 and vec2
dot_product = np.dot(vec1, vec2)

# Compute the L2 norm of vec1 and vec2
norm_vec1 = np.linalg.norm(vec1)
norm_vec2 = np.linalg.norm(vec2)

# Compute the cosine similarity
similarity = dot_product / (norm_vec1 * norm_vec2)

return similarity

This code provides us with some useful helper functions which will be used later down the line. load_image_PIL() can convert any URL whether it be a local file path or a URL from the internet into a PIL object which is very useful. cosine_similarity() is used to find the cosine similarity between any 2 NumPy vectors.

Source: https://aitechtrend.com/how-cosine-similarity-can-improve-your-machine-learning-models/
temp=pd.read_excel(r"Hotel_Classes.xlsx")
classes=temp['Col_Names'].tolist()
classes=[s.lstrip() for s in classes]
positive_classes=[]
negative_classes=[]
for i in range(len(classes)):
positive_classes.append(f"a outstanding picture, of a #{classes[i]}")
negative_classes.append(f"a horrible picture, of a #{classes[i]}")

This code imports all the class names we had generated through ChatGPT which we have stored in an excel file into python through pandas. We then convert the data frame into a list of all classes. After that, we create 2 lists called positive_classes and negative_classes. Then we iterate over the classes list and for every class name we insert “a smashing/horrible picture, of a #{class_name}”. The reason why we choose the adjectives “outstanding” for positive and “horrible” for negative is because that is what the researchers found was the most useful.

Our prompting results show that CLIP extracts features that can be used for the IAA task. These features correlate with text prompts describing the corresponding aesthetic value of the image. Overall, the labels “outstanding” and “horrible” are well-suited for this task. In an application, it might be possible to get acceptable results if only a pretrained CLIP model is available.

Also, the reason why we use #class_name is because CLIP was trained over a lot of internet data by OpenAI and a substantial part of the data was from places like twitter, Tumblr and so on where users post images usually accompanied by hashtags relating to that image, so CLIP learnt this behavior as well.

positive_inputs=processor(text=positive_classes, return_tensors="pt", padding=True).to(device)
with torch.no_grad():
positive_text_features = model.get_text_features(**positive_inputs)
negative_inputs=processor(text=negative_classes, return_tensors="pt", padding=True).to(device)
with torch.no_grad():
negative_text_features = model.get_text_features(**negative_inputs)

In this code, the positive prompts and the negative prompts are converted into their text features by passing them through the CLIP model.

positive_prompt_vectors = np.array(positive_text_features)
# Compute the average vector
average_positive_vector = np.mean(positive_prompt_vectors, axis=0)

negative_prompt_vectors = np.array(negative_text_features)
# Compute the average vector
average_negative_vector = np.mean(negative_prompt_vectors, axis=0)

Now we ensemble the 2000 vectors into 2 vectors for easier computation according to the research paper.

with open('positive_prompt.pkl', 'wb') as f:
pickle.dump(average_positive_vector, f)
with open('negative_prompt.pkl', 'wb') as f:
pickle.dump(average_negative_vector, f)

Now lets save these vectors into pickle files so that we could fetch them later when needed. That concludes the preprocessing stage. Simple, isn’t it?

Code for runtime

with open('positive_prompt.pkl', 'rb') as f:
average_positive_vector = pickle.load(f)
with open('negative_prompt.pkl', 'rb') as f:
average_negative_vector = pickle.load(f)
def predict(img_url):
image1 = load_image_PIL(img_url)
with torch.no_grad():
inputs1 = processor(images=image1, return_tensors="pt").to(device)
image_features1 = model.get_image_features(**inputs1)
image_vector=image_features1.numpy()
positive_similarity=cosine_similarity(average_positive_vector,np.transpose(image_vector))
negative_similarity=cosine_similarity(average_negative_vector,np.transpose(image_vector))
aesthetic_score=(+1*positive_similarity)+(-1*negative_similarity)
return aesthetic_score*1000 #Multiplied by 1000 just to make it easier to compare scores

We first fetch our vectors from the pickle files. In the predict() function, we first convert the image url passed into a PIL object using the function we defined earlier. We then use CLIP to convert the image object into image features. We convert this image features into NumPy(as they are currently tensors) and apply cosine similarity using the function cosine_similarity() which we defined earlier to get our score. The final score is calculated by multiplying the positive score by +1 and negative score by -1 and sum them together. We then finally return this score to the user. This will be the aesthetic score generated by the program.

import streamlit as st
st.header('Image Aesthetics Scorer')

uploaded_file = st.file_uploader("Choose an image...", type=['png','jpg','jpeg'])
picture_width = st.sidebar.slider('Picture Width', min_value=100, max_value=500)
if uploaded_file is not None:
image = Image.open(uploaded_file)
st.subheader('Input', divider='rainbow')
st.image(image, caption='Uploaded Image', width=picture_width)

# Call your function with the uploaded image
results = predict(image)

st.subheader('Results', divider='rainbow')
# Display the results
st.image(image, caption=results, width=picture_width)

Here is a little bit of Streamlit code which will help us make a demo of this program very easily. If you want to use this, make sure to replace the first line in the predict() function from:

def predict(img_url):
image1 = load_image_PIL(img_url)

to

def predict(img):
image1 = img

When you want to run this program, save the code as a .py file. Let's say app.py. Open up your command prompt (or wherever you are working with your python environment) and navigate to the place where app.py is stored. Run the preprocessing code once so that positive_prompt.pkl and negative_prompt.pkl are generated and in the same directory as app.py.

Then you can finally run the following command:

streamlit run app.py

This will pop out a browser window on the browser of your choice. This is how it should look when its fully functional:

There is a slider on the left which will affect the size of the output image. When you want to use the app, you just browse and choose a picture and the app will return a score.

Now let's finally do some aesthetics comparisons. Since my use case relates to hotels, I will showcase 3 instances where the program is useful.

2 Images of Hotel Bedrooms

As you can see, two pictures of bedrooms are displayed. The one on the left is from a 5-star hotel, while the one on the right is from a budget accommodation. Upon visual inspection, we can easily determine which one appears more aesthetically pleasing. But how would we quantify this? Let’s explore how the program quantifies aesthetics.

Bedroom aesthetic score comparison

As you can see, the model favors the 5-star bedroom, giving it a rating of approximately 35.5, compared to the budget accommodation, which received a score of only 19.89. This aligns with our own subjective assessment of the aesthetics.

Let's try out another set of scenarios:

2 images of Hotel Restrooms

Here are two images of restrooms from hotels at different price points. The image on the left is from a 5-star hotel, while the one on the right is from a 4-star hotel. From a subjective standpoint, I find the restroom in the 5-star hotel more appealing. The framing and composition appear superior, and the lighting seems to be better compared to the restroom in the 4-star hotel.

Now let’s see how the model rates the images.

Restroom aesthetic score comparison

It appears that the model aligns with my intuition regarding these images as the 5-star restroom got a score of 27.6 meanwhile the 4-star restroom only got a score of 10.9.

For one final comparison, look at the images below.

2 images of hotel gyms

Here are two images of hotel gyms from establishments at different price points. The image on the left is from a 4-star hotel, while the one on the right is from a 5-star hotel.

Subjectively speaking, I find the gym in the 5-star hotel more appealing due to its superior framing and the natural lighting that enhances its ambiance. Therefore, I would personally prefer the 5-star gym over the 4-star one.

Let’s see if the model shares this preference.

Gym aesthetic score comparison

As observed, the model significantly favors the 5-star gym over the 4-star gym, with the former receiving a score of 25.46 and the latter a score of -2.28. A negative score indicates that the model does not find the image aesthetically pleasing.

Closing Remarks

We have successfully demonstrated how to leverage CLIP’s ability to extract image aesthetics and create a scoring system for comparing various images.

Although this method did not yield the highest accuracy according to the research paper referenced earlier (0.756 through ensemble prompting compared to the maximum of 0.816 through fine tuning CLIP), the ratio of effort to reward more than compensates for it. Even the authors themselves suggest that using ensemble prompting would be suitable for an application if no other option was available.

Follow For More!

I try to implement a lot of theoretical concepts in the ML space, with an emphasis on practical and intuitive applications.

Thanks for reading this article! If you have any questions, I will be happy to answer them. Feel free to message me on my LinkedIn or my email for other queries.

--

--

Suresh R

Passionate about all things data science, machine learning and coffee ;) LinkedIn: www.linkedin.com/in/suresh-raghu