Data Preparation Hack to Deal with Duplicate and Unorganized Images

Sanjay G
Subex AI Labs
Published in
4 min readJan 11, 2022

Data preparation is considered one of the most important as well as the most time-consuming steps in building an ML/DL model. This article attempts to help you efficiently solve one of the image data preparation challenges.

Problem:

Let’s say, you’re building a deep learning model that needs to be trained on face image data for some application. You specifically want a single face image per person. But unfortunately, you find that the current data you’re given:

  1. Has got a lot of duplicate face images of the same person.
  2. These duplicate images are not arranged or organized together. They’re spread out randomly in the data.
  3. There is no proper naming convention for image names that helps you to identify unique individuals.

A few reasons why we could end up in this situation could be that the data creation was generic and was not aimed for a specific use case; or that these images were sampled from a video containing multiple and random occurrences of unique individuals.

What do you do in such cases? It would be very naive and rather a tedious process to manually inspect and remove duplicate images. It would be a near-impossible task if the number of images is huge.

Let’s see how we can use deep learning to solve this problem!

Solution Approach:

  1. Use a pre-trained image classification model such as VGG-16 to extract image feature vectors for all the images.
  2. Apply a clustering algorithm on these extracted image feature vectors. This will group feature vectors of near-identical or similar images under the same cluster.
  3. Now, each cluster represents a unique individual. We can easily pick one image per cluster which would get our ultimate job done.

This logic works because, the images that are similar or even identical, produce feature vectors that lie close together in that feature space. You could also consider these feature vectors as image embedding vectors.

Let’s look at the above steps in detail with code. The code is kept generic so that you can conveniently use it as per your needs.

1. Feature extraction

Pre-trained image classification models such as VGG-16 can be easily accessed through Keras. Here’s the code to get the model ready:

from keras.applications.vgg16 import VGG16
from keras.applications.vgg16 import preprocess_input
from keras.preprocessing import image
image.LOAD_TRUNCATED_IMAGES = True
import numpy as np
model = VGG16(weights='imagenet', include_top=False)

In the last line, we’ve passed an argument ‘include_top’ with the value ‘False’. This excludes the final classification or fully connected layer. We don’t need this layer since we’re only interested in the features extracted by the previous layers. The classification layer is only intended to make decisions and classify the image based on the features it receives.

Assume that the variable ‘all_image_paths’ is a list containing the full path for all the images for which you want to extract the features. We iterate through this list to load, preprocess and extract the image features and store them in another list called ‘feature_list’.

feature_list = []
for image_path in all_image_paths:
img = image.load_img(image_path, target_size=(128, 128))
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = preprocess_input(img_data)
features = np.array(model.predict(img_data))
feature_list.append(features.flatten())

Note the ‘target_size’ argument passed to the ‘image.load_img’ function. In this case, it is (128, 128) but you can change this as per your requirements and all the images read, will be resized to the specified dimension before further processing. The ‘model.predict()’ call is what will give us the image features. We then flatten and append them in the ‘feature_list’.

2. Clustering

Let’s use K-means clustering on all the feature vectors of the images. The value of ‘K’ or the number of clusters will be the number of unique individuals (assuming you already know this). In case, you don’t know this, use the elbow method to determine the optimal value of K, and that will be the number of unique individuals. After that, you can again fit the K-means model with this optimum value of K.

from sklearn.cluster import KMeanskmeans = KMeans(n_clusters=100, random_state=0).fit(np.array(feature_list))labels = kmeans.labels_

In the above code, the variable ‘labels’ is an array that contains the label or cluster ID for each image feature vector. This array preserves the order of the input data given, so the labels correspond, in the same order, to the image paths present in the ‘image_paths’ list defined earlier.

Now with this information, you can either pick a single image for each cluster-ID, or you may also just choose to store all images under the same cluster-ID in different directories. The goal of separating multiple images of unique individuals is achieved!

What are some other variations you can try?

  1. You could also try and extract features from networks designed to perform well for faces.
  2. In case if the dimension of the resulting feature vector is huge and also in cases where the number of images is huge, use PCA to reduce the dimensions and use these low dimensional vectors for clustering.
  3. If the elbow method to find the value of K isn’t optimal for you, use other clustering methods like DBSCAN which doesn’t require any prior knowledge of the number of clusters.

As you might have realized by now that this technique not only applies to face images but any image class in general.

I hope you found this image data preparation method useful. Do share the article if you liked it. Cheers!

--

--