Product Placement in Retail Stores

Samarth Khanna
9 min readAug 19, 2021

--

Part 1

How many times does it happen, that you enter a store, fill your cart with all kinds of stuff, just to realize that you have it at home/don’t want to spend THAT much on just candy/just don’t feel like it anymore. Also, what you don’t feel like is setting out on the quest to find the rightful place of an item in a store that is, quite often, longer than your average distance covered per day (oops?). So, you drop it where it is.

Now, I don’t mean to be a guilt monger but store owners spend hours of their time trying put things back into place. Considering that an average grocery store has ~40,000 items, making this task easier for store owners can save a lot of time and effort. This might be the reason why large companies are investing in technology to optimize this process, by virtue of which I had the good fortune of working on this case study.

If you’re looking for ways to use photographs from your store to detect which products are misplaced, this (two-part) blog is definitely for you. Otherwise, if you’re simply interested in how to stack multiple Machine Learning and Deep Learning models on top of each other to solve a problem using image data, read on! (Although it might take a while :P)

Note: If you’ve read this part or want to jump straight away to modeling (I would recommend going through this one first), you can find the next part here, and down below as well after you’re done reading this one.

The Problem Statement

“We would like you to implement a solution for product verification in retail stores. Specifically, given a set of product spaces each containing some product images, write a solution to alarm a store operator whether a product is kept in its correct product space or not. A product that doesn’t belong to the correct product space is called a Plug.”

As you would have guessed, the task at hand is classification. Given an image of a product kept in a product space, we have to classify whether it is a plug or not. However, the classification itself can be done on two levels, as we would discuss in a bit. Some other things/ use cases we have to keep in mind are:

1) How fast does the user want their results?

2) How often will they check for them?

3) Is it more important to minimize effort or to leave no product out of place?

4) What kind of device will they use for this purpose? What computing constraints will that device have?

Let us try to understand the data provided before deciding on the exact tools and metrics we will use for our system.

The Data

Description

We have 3 files:

1) embeddings.csv — contains feature embeddings, sized (1280,1) of images of items in a retail store. These have been extracted from one of the final layers of a state-of-the-art image classifier trained on the ImageNet dataset. I extracted this through numpy.genfromtxt to get an array of shape (13008, 1280).

2) product_spaces.csv — contains the identifier for the product space corresponding to each item. The shape is (13008, 1) after expanding by one dimension. This seems to be a categorical feature, that identifies a particular product space for each entry

3) plug_labels.csv — indicates whether the item is in the right product space. 0, if the given product space is right, 1, otherwise. All misplaced objects are called plugs. The shape is (13008, 1) after expanding by one dimension.

I have combined these three tables into a single data frame with column names ranging from ‘1’ to ‘1282’.

Let us dive deeper to understand it better.

Note: I cannot make this data available to the reader as it is the property of the corporation that gave me this task.

Data Preprocessing

Here are the steps I followed to get the data in the format I wanted for my analysis. You will find the explanations for some of the steps in the next section:

1) Splitting the above-created data set into the original three parts to perform individual operations on each.

2) Creating a dictionary that contains the number of occurrences for each product space.

3) Creating a dictionary that maps the 298 product spaces to 298 index values.

4) Using the index values as a single feature for product spaces in a new set called ‘cat_data’.

5) Normalizing all columns in cat_data.

6) Normalizing the embedding values for each column.

7) Creating one-hot-encoded vectors for each product space. The resulting shape would be (n_samples, 298). The encoder returns a sparse matrix, which has to be converted to dense before joining back with rest of the data.

8) Horizontally stacking the normalized embeddings, one-hot-encoded product spaces, and plug labels to create a data frame called ‘ohe_data’.

9) Giving names to all the columns ranging from ‘1’ to ‘1579’. (1280 embedding features, 298 product space features + 1 plug label feature).

Exploratory Data Analysis

How many plugs are there in the whole set?

The data is grossly imbalanced. Specifically, there are 471 plugs and 12537 non-plugs. We will have to rectify this during training.

How many unique product spaces do we have? What are the values like?

The values of product spaces are of the order ~ 10¹⁰. There are 298 unique product spaces. We will encode this as a categorical variable using one-hot encoding.

What do the other features look like?

The feature embeddings have both positive and negative values and lie between (-10, 10).

Are the plugs and non-plugs separable using this featurization?

For this, I plotted the TSNE plot in 2 dimensions with labels as plug labels. Here is the plot for ‘ohe_data’:

Some amount of separation can be observed between clusters of plugs and those of non-plugs (the majority). It can also be observed that the number of clearly separable clusters is of the order of the number of unique product spaces. Hence, it won’t be unwise to try to separate plugs from non-plugs using classifiers that use neighbor information.

Now let us create the same plot for ‘cat_data’:

There is no significant difference between clusters of the same and different classes in this plot as compared to the previous one. Hence, we should try both feature sets and determine what works better in practice.

Can all the product spaces be distinguished on the basis of the embeddings?

We will check if all the product spaces can be separated using the image feature embeddings. Here is the TSNE plot:

Quite clearly, we are able to separate the items based on product spaces significantly well in two dimensions themselves. This is a motivation for us to train a model to learn the relationship between image features and product spaces, predict the product space for any item, and check whether it is correctly/incorrectly predicted.

Data sets used throughout

The two ways of using this data

1) Using product_space as a feature and plug_label as the target. This way we are checking, for any given item (embeddings) and product space, if the placement/pairing is correct. The plug label (target) is 0 if yes, otherwise it is 1. We will try this in the following two ways:

a) Using the indices of product spaces as the input along with embeddings. The total input shape would be (13008, 1281) in this case.

b) Using the one-hot-encoded representations of the product spaces along with the embeddings. The total input shape will be (13008, 1578) in this case.

2) Using product_space as the target. We will only use only non-plugs while training this model. We will train a deep neural network to learn the relationship between items and product spaces. After it learns this, we will use the test data (already set aside) to predict the product spaces for each item. We will then check whether it is the prediction is matching in the case of non-plugs and whether it is different in the case of plugs. The shape of the input will be (13008, 1280) and that of the output will be (13008, 298).

The shape of the output in both cases will be (13008,). We will use these approaches for our machine learning models as well as with a neural network architecture.

Here are the data sets that will be used at some point or the other:

1) X

Complete data to be used during training. Derived from ohe_data. Used for upsampling the training data, contains 4000 plugs upsampled from 377 plugs with replacement. Final shape -> (14029, 1579).

2) x_unb, y_unb

Data separated out from ohe_data before upsampling, to be used for testing. Final shape -> (2602, 1579), (2602,). Contains 2508 non-plugs and 94 plugs. This split is made to mimic the proportions given in the original set, which should be reflective of the real-world scenario.

3) X_cat

Complete data to be used during training. Derived from cat_data. Used for upsampling the training data, contains 4000 plugs upsampled from 377 plugs with replacement. Final shape -> (14029, 1579).

4) x_unb_cat, y_unb_cat

Data separated out from cat_data before upsampling, to be used for testing. Final shape -> (2602, 1579), (2602,). Contains 2508 non-plugs and 94 plugs. This split is made to mimic the proportions given in the original set, which should be reflective of the real-world scenario.

5) x_train, y_train

Derived from train_test_split of X. All corresponding models will be trained on this. Shape -> (11223, 1578), (11223,).

6) x_train_cat, y_train_cat

Derived from train_test_split of X_cat. All corresponding models will be trained on this. Shape -> (11223, 1281), (11223,).

7) x_cv, y_cv

Derived from train_test_split of X. Cross-validation will be conducted using this data, for all corresponding models. Shape -> (2806, 1578), (2806).

8) x_cv_cat, y_cv_cat

Derived from train_test_split of X.Cross-validation will be conducted using this data, for all corresponding models. Shape -> (2806, 1281), (2806).

9) x_sep_train, y_sep_train

Contains all the points from x_train that get predicted as plugs by the Gaussian Naïve Bayes model (GNB). The shape would vary depending on the predictions. Here it was (5137, 1578), (5137,). y_sep_train is simply a copy of y_train.

10) x_sep_cv, y_sep_cv

Contains all the points from x_cv that get predicted as plugs by the GNB. The shape would vary depending on the predictions. Here it was (1298, 1578), (1298,). y_sep_cv is simply a copy of y_cv.

11) x_sep_unb, y_sep_unb

Contains all the points from x_unb that get predicted as plugs by the GNB model. The shape would vary depending on the predictions. Here it was (711, 1578), (711,). y_sep_unb is simply a copy of y_unb.

12) x_nb, y_nb

Copy of x_train + feature containing predictions made by GNB on x_train. y_nb is simply a copy of y_train.

13) x_nb_cv, y_nb_cv

Copy of x_cv + feature containing predictions made by GNB on x_cv. y_nb_cv is simply a copy of y_cv.

14) x_nb_unb, y_nb_unb

Copy of x_unb + feature containing predictions made by GNB on x_unb. y_nb_unb is simply a copy of y_unb.

These splits are a key part of our strategy. We will use these sets at different stages of modeling to ensure better and more balanced predictions.

Please read on here if you have found this post relevant and are interested in an in-depth analysis of different techniques used to solve this problem.

Here are the resources that really helped me create this work.

References

GitHub Link: Please click here.

LinkedIn profile: Please click here.

--

--

Samarth Khanna

Machine Learning | Data Science | Software Engineering