Multi-label classification for generating images descriptions

IvonaStefania
10 min readJul 28, 2023

--

High level overview

Computer vision is the type of Artificial Intelligence that mimics the human brain in order to identify and process images. One technique is multi-label classification of images.

Multi-label classification can save time by automating some processes. It has so many use cases, from automating labelling in e-commerce or healthcare to assisting visually impaired people.

The scope of this project is to generate the description of an image, based on a Multi-label classification model that was trained on a dataset of images and their description.

Description of input data

In order to obtain the input dataset, I created a platform, https://description.pics, where contributors can upload images and set descriptions for them.

The images are saved into Amazon S3 and the descriptions are saved into a table in a Dynamo database. When the dataset is big enough, the images and the descriptions table are downloaded and used to train the model.

The collected images can be downloaded from https://drive.google.com/drive/folders/1LcARea0fQz4y9lul623hvzDLuXBrDcN0 (images folder).

Platform architecture

The architecture of the application consists of:

  • ReactJS Frontend deployed on AWS S3
  • Flask Backend deployed on AWS Lambda
  • Dynamo DB on AWS
  • AWS S3 buckets (for storing images)
  • Jupyter Notebook (for training the model)

Strategy for solving the problem

The purpose of the model is to predict a description for an image, by returning the most relevant labels.

The collected descriptions of images would be split into tokens and each unique token (unique in the dataset) would be considered a label.

The model would then be created using Keras — Tensorflow packages and trained with the collected images and their labels.

Data preprocessing

The descriptions of images are split into words, which are lemmatised and filtered to be only nouns and adjectives.

def tokenize(text):
lemmatizer = WordNetLemmatizer()

text = text.lower()
text = re.sub("[^a-zA-Z0-9]", " ", text)

words_list = word_tokenize(text)
words_list = [lemmatizer.lemmatize(word) for word in words_list]
tags = nltk.pos_tag(words_list)

words_list = [tag[0] for tag in tags if tag[1].startswith('NN') or tag[1].startswith('JJ')]

return words_list

After that, for each token in the dataset, a column is created. For each image, the token column will contain a 1 if the token was present in the image’s description and a 0 otherwise.

The collected images are loaded and transformed into pixels arrays (with values from 0 to 255). The pixels values are divided by 255, in order to normalise them.

images_list = []
for i in range(0, df.shape[0]):
img = image.load_img('./images/images/' + df['Id'][i])
img = image.img_to_array(img)
img = img/255
images_list.append(img)

The images uploaded by contributors may have different sizes. We have to make sure that the images used to train the model have the same size, so we have to resize them first.

size = getMinSize(images_list)
images_list_resized = [image.smart_resize(img, size) for img in images_list]

Data exploration

The dataset collected from description.pics contains only 150 images which have been given comprehensive descriptions. First of all, this is a small number for training any model.

Second of all, large descriptions means high number of tokens. After the tokenization of all the descriptions in the dataset takes place, 292 labels are found. Currently, there are more labels than actual images, so by now we don’t expect to have good results of our model.

I made a visualization of the number of occurrences per number of labels, so that we could understand better.

The maximum number of occurrences that a label has is 10 and the numbers of occurrences are very imbalanced. More than 70% of the labels have only 1 occurrence, as it can be seen in the bar chart below:

Now let’s talk about the matrices that we’ll be using on training the model.

X is the matrix of features and consists of arrays of pixels for each image. The size of X in our case would be (IMAGE_WIDTH, IMAGE_HEIGHT, 3). IMAGE_WIDTH and IMAGE_HEIGHT are the common width and height of the images in pixels. An additional number is added because coloured pictures (RGB) are composed of 3 dimensions for red, green and blue components.

Let’s take a look at X.

y is the output matrix and consists of arrays of labels occurrences for each image.

X = np.array(images_list_resized)
y = np.array(df.drop(['Id', 'Description', 'Name', 'UploadDate'], axis=1))

The two variables need to be split into train and test subsets.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)
X_train = np.asarray(X_train).astype(np.float32)
X_test = np.asarray(X_test).astype(np.float32)
y_train = np.asarray(y_train).astype(np.float32)
y_test = np.asarray(y_test).astype(np.float32)

Modeling

The model was created using Keras and uses 2D convolution layers.

model = Sequential()
model.add(Conv2D(filters=16, kernel_size=(5, 5), activation="relu", input_shape=images_list_resized[0].shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(filters=32, kernel_size=(5, 5), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(filters=64, kernel_size=(5, 5), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(filters=64, kernel_size=(5, 5), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(y_train.shape[1], activation='sigmoid'))

model.summary()

Now let’s see talk about this algorithm.

Sequential: groups a linear stack of layers

Conv2D: the first convolution layer takes an input feature array of size (367, 400, 3) and computes 16 filters over it.

  • filters: specifies the number of filters to be used in the convolution. Typically, filters encode specific aspects of the input data.
  • kernel_size: specifies the height and width of the 2D convolution window. In out case, we apply windows of (5, 5).
  • activation: a function that is applied after each convolution operation. In our case, we apply a relu (Rectified Linear Unit) function, F(x) = max(0, x).

MaxPooling2D: extracts the max value of each window (in our case 2x2 windows). Max pooling reduces the computational cost by reducing the number of parameters.

Flatten: After applying all the convolution layers, we end up with an output of shape (19, 21, 64), which we have to feed to the Dense layer, which processes 1D arrays. Flatten layer transforms 3D arrays into 1D arrays.

Dense: Finally, we apply the sigmoid activation function and return 292 outputs.

When creating the model and adding the layers, complications can occur if the developer does not pay attention to the sizes of the matrices. input_shape must be equal to the size of the images in pixels, plus adding the third dimension corresponding to the RGB color separation. All images MUST have the same size, otherwise an error will be thrown. The final Dense function must output a 1D array with the size equal to the number of labels.

After adding the layers, the model is compiled using adam optimizer.

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

The last step is training the model using the training data, on 10 epochs and a batch_size of 64.

history = model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test), batch_size=64)

Predicting a description for an image

After the model is trained, it is saved into a file and uploaded on the server, so that the users of https://description.pics can use it for predictions.

The uploaded image is transformed into pixels, resized and passed to predict function, so that the top labels would be returned.

pred = model.predict(np.array(resized_image))
labels, values = getTopLabels(pred, all_tokens)

Metrics with justification

Accuracy is one of the most popular metrics for measuring how correct is the prediction of a model. It is very intuitive and easy to understand and is recommended for these types of problems.

Unfortunately, due to the fact that this platform is at its beginnings, the dataset is very small (by now approximately 150 images) and highly imbalanced (each image has a different number of labels and the number of labels is very large comparing to the number of images), so the accuracy is very close to 0, as it can be seen by running the evaluate function.

model.evaluate(X_test, y_test, verbose=0)

[0.16804741322994232, 0.03703703731298447]

Also, we can see that in the plots.

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(10)

plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

The model will be fitted again after the dataset will be more consistent.

Hyperparameter Tuning

In order to find the best parameters for the model, a GridSearch needs to be performed. I focused on the number of epochs and the batch_size, as in the following snippet:

# create model
model = KerasClassifier(model=create_model, verbose=2)
# define the grid search parameters
batch_size = [10, 20, 40]
epochs = [10, 50]
param_grid = dict(batch_size=batch_size, epochs=epochs)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X_train, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))

Obviously, good results couldn’t be expected, given the fact that the dataset is too small, but it looks like the best result (1% accuracy) was found for 50 epochs and a batch_size of 10.

Best: 0.010417 using {'batch_size': 10, 'epochs': 50}
0.000000 (0.000000) with: {'batch_size': 10, 'epochs': 10}
0.010417 (0.014731) with: {'batch_size': 10, 'epochs': 50}
0.000000 (0.000000) with: {'batch_size': 20, 'epochs': 10}
0.000000 (0.000000) with: {'batch_size': 20, 'epochs': 50}
0.000000 (0.000000) with: {'batch_size': 40, 'epochs': 10}
0.000000 (0.000000) with: {'batch_size': 40, 'epochs': 50}

Training the model on a different dataset — comparison

Until the community of description.pics grows and the dataset becomes big enough, I wanted to use a more consistent dataset for training and testing the model: https://www.kaggle.com/datasets/raman77768/movie-classifier. This dataset contains 7867 pictures and 25 genres.

The following visualization are shown the number of occurrencies per each genre (label).

Clearly this dataset is also pretty unbalanced, because of ‘Drama’ and ‘Comedy’ labels, which have a number of occurrences much higher than the other genres.

The dataset contains posters of movies, together with the corresponding genres list. By loading train.csv file into a data frame, we can see that the labels are already transformed into columns containing 1s and 0s.

After training the same model on this dataset, the accuracy seems to be better, but it’s still pretty low, due to the small size and the imbalancing of the dataset.

model.evaluate(X_test, y_test, verbose=0)

[0.37161681056022644, 0.26671260595321655]

Let’s perform the same hyper parameter tuning on this dataset. Like for the previous dataset, we will tune the epochs and the batch_size.

# create model
model = KerasClassifier(model=create_model, verbose=2)
# define the grid search parameters
batch_size = [10, 20, 40]
epochs = [10, 50]
param_grid = dict(batch_size=batch_size, epochs=epochs)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X_train, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
Best: 0.074790 using {'batch_size': 40, 'epochs': 10}
0.061694 (0.008971) with: {'batch_size': 10, 'epochs': 10}
0.061520 (0.005142) with: {'batch_size': 10, 'epochs': 50}
0.066862 (0.007241) with: {'batch_size': 20, 'epochs': 10}
0.065139 (0.004026) with: {'batch_size': 20, 'epochs': 50}
0.074790 (0.004530) with: {'batch_size': 40, 'epochs': 10}
0.061865 (0.003444) with: {'batch_size': 40, 'epochs': 50}

Turns out that by training the model on a more relevant dataset, that contains less labels, the accuracy increases.

Conclusion and improvements

As a conclusion, we saw that building a good convolutional model using Keras and tensor flow depends very much on the training data. Having the same pattern of layers and using the same activation functions may lead to different results on different data sets.

Here I compared the accuracy obtained by training the model on 150 images, both on the dataset collected from description.pics and the movie posters dataset. This proves that having the labels more balanced throughout the dataset improves the accuracy of the model.

I believe that this model could have important applicabilities on automating some processes, like object detection in self driving cars or product labelling in e-commerce.

The real difficulty of this project is obtaining a balanced and consistent dataset. In the future, after increasing the number of collaborators on https://description.pics, the model will be trained again and the accuracy will surely increase.

--

--