Introduction to Exploratory Data Analysis for Image & Text-Based Data

Mansi Choudhary
Analytics Vidhya
Published in
11 min readApr 9, 2021

Today in this post we are going to do a simple but thorough EDA for Shopee-Price Match Guarantee, one of the competitions going on Kaggle. Exploratory data analysis is the first step towards solving any data science or machine learning problem. It is the simplest way of getting familiarity with the data available at your disposal. The more you know the data the closer you get to solving the problem.

So, let's get our hand dirty and enter the world of Exploratory data analysis. This problem is a mix of both text data and image data hence it is going to be fun exploring it.

Step1: Understanding the Problem Statement-

Before you begin to solve the problem, first of all, you need to understand the problem statement well. So, let us begin with that first. You can get the complete problem statement and data description here:

We have a set of images and their product category. Based on this information available, we need to find similar products to the given product.

For Example:

For a given product A we need to find other 50 images that are most similar to A. To do this we have some information about each of the product. We will look into this available information in the latter part of this post.

Step2: Analyzing the Dataframes:

Train Data:

First of all, we will read the dataframes available, analyze their shape and size and the kind of data available.

reading data into pandas dataframe
printing shape of dataframe
printing first few rows
information about each column in dataframe

As important it is to find data insights it is also important to write observations at each step.

Observations:

  1. We have 34,250 train images. In our dataframe we have the image title as a text feature.
  2. We also have image_phash and label_group as other features. Similar posting_id is what we need to predict for each image.

Test Data:

Reading Test dataframe
analyzing shape of test data
analyzing first few rows of test data
Analyzing Information About Each Column

We have now seen our test and train dataframes. Now we need to find some insights from the dataset to see if there is any kind of hidden pattern or relationship between columns. As we have just 5 columns analyzing each column won’t be difficult. Hence, we will begin with the same.

Step3: Exploring Data Columns:

Label_group:

We will start with label_group. This column is not given to us for test dataset. Also, it is given that the images in the same label_group are the most similar. Hence, in a way this is like our target label. So, let us explore more in this area:

Grouping by Label_group to see count for each of them
checking duplicated label groups
number of unique label groups

Plotting Number of Images in Each Label Group:

Barplot number of images in each label for first 50 label_groups

Observations:

  1. From the above information, we can see that for any label_group the maximum number of products belonging to a similar label_group are 51.
  2. There are 23236 duplicate label groups in our train dataset as many products share the same label_group.
  3. In all there are 11014 unique labels. As the number of labels are very large this cannot be treated as a simple classification problem.

Title:

Now, we have given the titles of each image as well. We will explore this area and find out how similar these tiles are for similar products/images. It is a good possibility that images with the same or similar titles are also similar and should fall in the same category. Let us find out the chances of this possibility and see if it really holds:

For analyzing the title, we first begin with analyzing the length of the title. Hence, we find out the number of words in each title.

Adding new column title_length
Checking for the minimum title_length

We see that the maximum number of words in any title is 61(not a very large number) and the minimum is 1.

For a clearer idea let us draw a word cloud on the titles for label group (by randomly picking) and see if we find something. As the length of the title is not very large we can get a good view from the wordcloud.

Wordcloud for images from group_label: 1163569239

We can see that the occurrence of 4 words is much more than the other words present in the title. Also, there is another variation of word SCARLETT as Scarlet which we also see. The same goes for whitening. Hence, for analyzing titles it is important to take all the words in same case.

Analyzing title_lenght for a particular label_group

We see that there is no as such relation between the title_lenght and the the label_group.

The series of operations for title_length that we did for one label_group, in the same way we can do for a few more label_groups.

Title Analysis for a Different Label_Group:

Wordcloud for images from group_label: 2357508171

Observations:

  1. Products belonging to different label_groups have quite different words. Hence, it can be concluded that title plays a major role in this problem.
  2. Also, for products belonging to same label_group there are few words appearing more frequently this once again supports the above statement.
  3. There is no relation between number of words in title for images that belong to the same label_group.
  4. The hash_value is widely scattered for images in the same label_group.

Image_Phash Analysis:

Image_phash gives us the hash_vale corresponding to each image. Let us first analyze for a few examples if the image belonging to same label_group have images of same of different hash values.

Image_phash for label_groip 235708171
Image_phash for label_groip 1163569239

We have checked that images belonging to the same label_group might have different hash_values. Let us check if the images with same hash_value has different label_groups or not.

Checking distribution of images with respect to hash_value
Checking if images from different label_groups have same hash_value or not

Observations:

  1. There are cases in which the same hash_value images belong to different groups. But it can be said that the majority belong to the same label_group only.
  2. We are given a perceptual hashing value, hence for calculating the similarity between two images we will consider hamming distance.
  3. Hash_value is also an important feature for a given image. We need to use it carefully with other available features to find out most similar images

Helper Function for Image Hash Analysis:

Building Image Matrix Based on Hash_Value for first 1000 images

We are looking for similar product images. So, here to begin with what we can do is for every product take the 50 most similar product based on the value calculated here.

This can be done using the following code:

Now for clear visualization, we plot from this dictionary as for each product we got the top most 50 products. Here the information might not be very accurate because instead of considering the whole dataset of 38k rows we are considering just first 1000 rows.

Observations:

Based on these the results are not very promising. This might also be because we are only taking a sample of images. Let us try with first 10 images taking their matching hash_value with all the images in the dataset.

Now, let us see how this works if we perform this analysis a randomly picked image and compare the hash_value with all the images in the dataset available.

For Index 2937:

Wow! this time the result is far better. Hence, we have to utilize the complete data wisely. Comparison with a subset won’t help.

As further analysis you can also check if the images that are found similar based on hash_value has the same label or not.

Observations:

  1. We get good results but only in the case when there are images with same hash_value. Hence, one could rely on this technique till some extent. This also clearly shows that hash_value is an important parameter to consider here.
  2. We see that for none of the top 50 similar images based on hash_difference is bringing the image of the same label. We will repeat these steps with Hamming distance and see if we find something.

Image Analysis:

Plotting Images with same Label_Group:

Image sample with same label_group

Plotting Images with same Hash_Value:

Observations:

  1. Hash_value of each image is an important feature for any given image and it helps in grouping similar images together.
  2. For perceptual hashing hamming distance is the main measure of calculating the similarity between two given images.
  3. The shorter (smaller value) the value of hamming distance the more similar are the images.

Analyzing Image Similarity Based on Title:

As we saw earlier that title is an important feature. Let us see how good are the results we get based on title comparison. Before calculating similarity based on title we will perform some simple text transformation techniques.

As we saw earlier that all the titles should be in same case, we will do accordingly. For calculating title based similarity we are using fuzzywuzzy library.

Helper Function:

function for calculating fuzz ratio

For Index 0:

Top Similar Images to Image at Index0 based on fuzz ratio

For Index 10:

Top Similar Images to Image at Index10 based on fuzz ratio

For Index 30:

Top Similar Images to Image at Index30 based on fuzz ratio

One Important Observation:

The similar images for the product at index_30 based on title didn’t look appropriate. Hence, we run a check with label_group. And, what we observe is a big error:

We see that all the images that were found similar to the image at index 30 based on title belong to a different label_group. To find out what is happening we further see all the images that has the same label_group as image at index 30.

And, this gives us our error. We see that the term Moina Kaftan exist in both the titles but because the complete string was not same we get a lower fuzz ratio. To solve this we need to identify the separator and look up for the maximum ratio. For achieving this we write the improved version of the function that is calculating fuzz ratio as shown below:

Now we calculate once again for index 30 with this new function:

Now we see that all the images that were found similar to the image at index 30 based on title belong to a different label_group. To find out what is happening we further see all the images that has the same label_group as image at index 30.

And, this gives us our error. We see that the term Moina Kaftan exist in both the titles but because the complete string was not same we get a lower fuzz ratio. To solve this we need to identify the separator and look up for the maximum ratio. For achieving this we write the improved version of the function that is calculating fuzz ratio as shown below:

Now we calculate once again for index 30 with this new function:

We got thesimilar product now. Hence, addition of the separator logic fixed our problem.

Observation:

  1. Title is an important feature. We should not ignore words in the title. The higher the fuzz ratio the more similar the images are turning out to be.
  2. Not only title we found out that image_hash_value is also an important feature.
  3. We haven’t considered the images themselves to calculate similarity hence, we can do that also.
  4. We need to come up with a method that will combine these parameters and bring us the top similar images in an optimum way.

Last Words..

Well, this was pretty simple but we got few amazing insights. This shows the power and magic of performing exploratory data analysis. I hope you like this article.

For complete code refer: https://www.kaggle.com/mansichoudhary3296/shopee-1

References:

  1. https://www.appliedaicourse.com/
  2. https://www.kaggle.com/c/shopee-product-matching/code

--

--

Mansi Choudhary
Analytics Vidhya

Certified Data Scientist and blogger who look forward to learn new inventions happening daily in the field of Machine Learning and Data Science.