Product Placement in Retail Stores

Samarth Khanna
22 min readAug 19, 2021

--

Part 2

So good to see you here! This next part would surely not disappoint 😊

In the previous post, we understood the raw data provided and made a few changes to suit our task. Now, we will explore ways to accomplish our original task, and select the best strategies for the same.

Metrics and Losses

Here are the various losses/metrics used with the reason they were selected:

1) Log-loss/Binary Cross-Entropy

This is regarded as a reliable metric for binary classification tasks that gives an indication of how confident a model is while making predictions.

2) Accuracy score

It will give us an idea of the overall performance. However, it is susceptible to misleading results due to class imbalance (as can be observed in the baseline model).

3) Area Under Receiver Operating Characteristic Curve (roc_auc_score)

A metric that helps us optimize towards a balance between True Positive Rate and False Positive Rate. Accounts for class imbalance and gives a result that indicates a problem in case the minority class is not being handled well.

4) F1 score

Another metric that is very useful for binary classification, especially for imbalanced classes, as it strikes a trade-off between Precision and Recall.

5) Confusion matrix

This gives the clearest representation of the scenario after prediction, with the exact number of correctly/incorrectly classified points for each class. This will help us selected the strategies we have to use moving forward.

6) Precision (matrix)

It is used to display the values of Precision for all the classes involved.

7) Recall (matrix)

It is used to display the values of Recall for all the classes involved. All these matrices will help us decide if we are missing too many plugs, classifying too many-non plugs as plugs, or any other error that is being produced.

8) Categorical Cross-Entropy

The most popular metric for multi-class classification. We will need this for our task of predicting product spaces from image feature embeddings.

Onto the models!

Machine Learning models

Baseline Model

If we classify all products as non-plugs (majority) then the following are values we will get for various metrics:

Accuracy: 0.964

ROC-AUC: 0.5

F1 score: 0.0

We should try and aim for overall accuracy (and other metrics) better than this. However, considering the fact that our main focus is to alert the retailer about all the plugs, False Positives may be allowed, in order to increase the number of plugs detected and reduce the number of false negatives (positive being where there is a plug, in this context).

General Approach

1) (Fitting model -> calibrated classifier -> calculating metrics) for all hyperparameter (hp) values in the hp range,

2) Selecting the best value of hp and fitting the model with that value,

3) Plotting the confusion matrices from the function above,

4) Interpreting results.

Calibrated Classifier

This classifier is used on top of the basic classification model in order to generate probability values for each prediction instead of hard predictions. This will be required to calculate the two-class log-loss and roc_auc_score.

The two ways to perform this calibration are through ‘sigmoidal’ and ‘isotonic’ regression. We have mostly used the former.

Logistic Regression

This is the first model one would test for binary classification to check whether the data is linearly separable or not.

It is fast and can handle data with high dimensionality well. We tried various fits of this model, both on ‘cat_data’ and ‘ohe_data’. It was observed that the performance was much better while using ohe_data.

The hyperparameter we will vary is C, which is the inverse of regularization strength.

Here are the results for cat_data:

The best fit was observed for C = 10000.

Not bad.

The good news is that our approach may indeed work out. We are able to correctly detect ~60% of plugs and only 4% of non-plugs are incorrectly classified. However, we still are getting a lot of False Positives and False Negatives.

Let us now try with ‘ohe_data’.

The best fit was observed with C = 10.

Much better!

The number of Falsely classified points has reduced on both sides. This in itself won’t be too bad a solution.

Let us now check other metrics for this model as well. We will refit the model for this purpose. Here are the results:

Accuracy: 0.9596

ROC_AUC: 0.9535

F1 score: 0.5749

The Recall is reasonably good but the Precision still has scope for improvement. We shall have to explore further models for this. We don’t want the store employees making unnecessary rounds, do we?

Naïve Bayes

Naive Bayes has been extensively used for text classification with a large number of features. We are trying two types of NB models to see if this problem can be successfully solved with the same principles:

1. Multinomial Naive Bayes — even though this method is more suited for discrete distributions, it does work adequately well in practice with tf-idf features as well.

2. Gaussian Naive Bayes — This is meant for continuous distributions as it is, which is why we can expect positive results.

Multinomial Naïve Bayes

For all the fits, we are varying the value of ‘alpha’ which is the parameter for Laplace smoothing. Here are the results of various fits.

Looks like the model is performing better with small values of the smoothing parameter.

Seems like this did not work out as well as we had hoped. Both False Positives and False Negatives have increased as compared to the values Logistic Regression was producing. Let’s try out ‘ohe_data’ and see if there is any improvement.

Considering the plugs, we can say there is a significant improvement after using ‘ohe_data’ instead of ‘cat_data’. We have a decent plug detection rate but the False Positives are too many, hence the precision is very low.

Here are the values of various metrics for this model (after re-fitting):

Accuracy: 0.8912

ROC_AUC: 0.9191

F1 score: 0.331

The number of False Negatives has decreased even more in this fit but the number of False Positives has increased. We may explore the possibility that there is a trade-off between the two values in this data.

Gaussian Naive Bayes

We are training this model with var_smoothing = 0.01. The performance is given below:

Accuracy: 0.7598

ROC_AUC: 0.8645

F1 score: 0.2236

Very interesting results:

1. The Recall has improved even more, to a very impressive level (~0.96).

2. The Precision has decreased even more, to a very troublesome level (~0.127).

3. We can think of a strategy in which we use the fact that most of the plugs are detected by GNB, and somehow filter out the non-plugs to return a result with high values of both Precision and Recall.

4. Both the Naive Bayes classifiers are extremely fast, which takes away the reservation against adding too many steps to the pipeline.

We shall have to explore further.

Support Vector Machines (SVC)

The SVC class is based on the libsvm library which implements the kernel feature.

The given features will be mapped to a much higher dimensional space for training.

As this is a medium-sized dataset with many features, this classifier is an appropriate choice.

It also handles sparse features well. As we have seen better performance for ‘ohe_data’ in both previous models, we will use only that data from now on.

The number of False Positives has reduced significantly, to as low as 7. However, we have missed more plugs as compared to the previous two models. Let us fit the model again with C = 1000. The performance is given below:

Accuracy: 0.9812

ROC_AUC: 0.9566

F1 score: 0.7293

Very promising results.

For the first time, we are seeing something close to what is desired.

On the other hand, the time taken for training and for prediction is significantly more than the previously used models. Still, a prediction time of 1 minute for ~2600 items is not too slow either. Depends on what the retailer wants.

Now, for experimental purposes, let us try and fit the model with C = 1. Even though we might compromise on the overall accuracy, let us see if the results are favorable in some way. The performance is given below.

Accuracy: 0.9427

ROC_AUC: 0.9566

F1 score: 0.5209

Fascinating.

Looks like our understanding of the trade-off between Precision and Recall might be correct.

Again, if the retailer does not mind checking some extra items that aren’t actually plugs, in order to miss less of them, this solution would be more suitable for them.

Let us explore further.

K Nearest Neighbors

This classifier fundamentally finds the closest ‘n’ neighbors of the query point using one of ‘ball_tree’, ‘kd_tree’, and brute force search methods, in the vector space.

KNN has been used at other places for learning the similarity between items using image feature embeddings. Our hope is that it will work here as well.

We are varying the values of K from 1 to 15 in the following fits.

Lovely! Overall, these seem to be the best results so far.

Out of 2602 points, only 24/94 plugs were missed and only 34/2508 non-plugs were classified as plugs.

Just to experiment, let us see if the number of plugs missed reduces if we change K to 5 and used weighted distance. The performance is shown below:

Accuracy: 0.9792

ROC_AUC: 0.9406

F1 score: 0.7188

Overall, the performance seems to be marginally better. However, the number of plugs being missed has increased by 1. We can conclude that a small value of K will give us good results.

Random Forest

This is one of the most advanced machine learning models that handle all kinds of features well. Random sampling ensures less susceptibility to overfitting.

Once trained, these models can yield predictions very fast as it simply involves checking some if-else conditions. Let’s take a look at how the model performs on our data.

(alpha = n_estimators)

These results are disappointing. Although there are absolutely no False Positives, a majority of plugs are getting missed. This kind of model would not help our case.

Gradient Boosting

Under this category, we will use the following 2 classifiers:

1. XGBoost (XGBClassifier)

2. LightGBM (LGBMClassifier)

Both of these classifiers work on the same principle of sequentially improving models where a new model learns from the previous model’s mistakes.

They are often used with data having the order of features that we do and provide very strong results.

Prediction is also fast with these models.

XGBoost

First, we will try and find the most suitable parameters using RandomizedSearchCV, as there are a lot of hyperparameters that can be tweaked.

The best parameters are:

{‘subsample’: 1, ‘n_estimators’: 100, ‘max_depth’: 10, ‘learning_rate’: 0.05, ‘colsample_bytree’: 0.3}

The performance of the model with the specifications given above is as follows:

Accuracy: 0.9735

ROC_AUC: 0.9501

F1 score: 0.4964

This result is better than Random Forest but is still not as good as the models we have seen before. It seems like a trend in tree-based classifiers to be biased towards non-plugs, at least the way we are training them.

LightGBM

Again, we will first find the best parameters of this model for this task.

The best parameters are:

{‘subsample’: 1, ‘n_estimators’: 500, ‘max_depth’: 10, ‘learning_rate’: 0.05, ‘colsample_bytree’: 0.5}

The performance of the model trained using the parameters given above, is as follows:

Accuracy: 0.9758

ROC_AUC: 0.9521

F1 score: 0.5532

This result is better than XGBoost but is still not the best we have seen. We will have to think of different approaches.

Stacked Models

We will use the following combinations of models in the hope of getting better results:

1) Gaussian Naive Bayes + Random Forest

2) Gaussian Naive Bayes + K Nearest Neighbors

3) Gaussian Naive Bayes + Support Vector Machines

4) Gaussian Naive Bayes + Deep Neural Network (Model 2)

5) Gaussian Naive Bayes + Deep Neural Network (Model 3)

Gaussian Naive Bayes + Random Forest

As discussed before, we have observed that GNB is biased towards plugs and RF is biased towards non-plugs.

In this stacked model, we will use GNB to predict plugs and will add the predictions as a feature to the original data.

This data will then be passed into a Random Forest classifier.

Our expectation is that the added feature would prevent the Random Forest Classifier from generating too many False Negatives.

These are the confusion matrices generated after prediction by a GNB classifier trained on x_train with var_smoothing = 0.01.

It is clear that all the plugs are being correctly classified. We have a lot of false positives, which RF will hopefully take care of.

The next step is to generate predictions for x_unb and add those as a feature to the same. This new data is passed into Random Forest, and the predictions of this model are compared with y_unb. The details of the performance are shown below:

Accuracy: 0.9735

ROC_AUC: 0.965

F1 score: 0.4651

Although there is a slight improvement in the direction we had expected, as compared to when we pass the original data into RF, it is not significant enough for us to consider this as a solution. It is possible, though, that the task we were trying to achieve, was successful for certain points.

Gaussian Naive Bayes + K Nearest Neighbors

We will pass the data we used for the RF classifier in the previous case, into a KNeighborsClassifier with K = 1. This data contains the GNB predictions as a feature. The pipeline is the same with only a replacement of Random Forest by KNN. The performance is shown below.

Accuracy: 0.9789

ROC_AUC: 0.8713

F1 score: 0.7208

The number of False Negatives did not reduce and the number of False Positives increased. It seems like some points that would have been correctly classified as non-plugs without the additional feature have been classified as plugs. However, this change is not useful for us.

Gaussian Naive Bayes + Support Vector Machines

The purpose of this stacking is to specifically solve the problem of extra False Positives generated by GNB.

We will pass only the points (from the original data) that were predicted as plugs by GNB, into SVC.

SVC is a good option when the number of training instances is of the order of the number of features.

We hope it will be able to learn a distinction between the False Positives and True positives generated by GNB. If it is able to separate these two categories, we will have a strong result.

We are training the SVC model on x_sep_train and evaluating the metrics on x_sep_unb:

Accuracy: 0.9395

ROC_AUC: 0.9366

F1 score: 0.7152

Although this combination has eliminated most False positives, it is resulting in 40/94 False negatives. Overall, this model will not be of much help to us.

By this point, we’ve explored a variety of machine learning models and their combinations. There still seems to be some room for improvement.

Enter, Neural Networks

Deep Learning models

Model 1

This model uses the original data with product spaces as one-hot-encoded features in order to predict if an item-space pair is a plug or not.

The loss used is ‘binary cross-entropy’. The metric used is ‘accuracy’.

The optimizer being used is SGD (learning rate = 0.01).

The model architecture/parameter count is shown below:

Here is the progress of the model:

Clearly, the performance of the model on the train and validation sets is very good. The performance of this model on x_unb is as follows:

Accuracy: 0.9792

ROC_AUC: 0.9489

F1 score: 0.7188

This is a very promising result.

All metrics are amongst the best encountered yet.

Let us explore a little more in order to improve these metrics even more.

Gaussian Naive Bayes + Model 2

This model will use the data which has the feature containing the predictions made by GNB.

The loss is again ‘binary cross-entropy’. The metric is again ‘accuracy’.

The optimizer used is SGD (learning rate = 0.01).

The model architecture is given below:

The first layer after the input has twice as many neurons as the input, which works well in practice. The size of the next layer is the closest power of 2 which is less than the value in the first hidden layer, which also works well in practice. All subsequent layers also have the number of neurons in decreasing power of 2. There is one output neuron with sigmoid activation.

Given below are the graphs for training and validation accuracy and loss. The validation accuracy reaches a value > 0.99, which is very good.

The performance of this model on x_nb_unb is shown below:

Accuracy: 0.9808

ROC_AUC: 0.9558

F1 score: 0.7449

Best result so far!

Our overall accuracy, roc_auc_score, and f1 score are the highest.

This pipeline would also be fast considering less training and prediction time for both models involved.

There is one observation at this point, though. Most of the models (that are performing well, are misclassifying around 20–30 plugs. One could explore whether these are the same points. If that is true, further study can be done into why all models are getting confused with these points. Let us try to confirm this hypothesis.

On comparing the false negatives generated by this model and the ones that were generated by KNN, I observed that there is a lot of overlap between these two sets, which suggests that there is some common property of these items/spaces that is confusing the models. A deeper study can be conducted on this topic.

Gaussian Naive Bayes + Model 3

This model is going to try and separate the non-plugs from the plugs among all the points predicted as plugs by GNB. It will only take relevant data points as input, the same as those taken for GNB + SVC.

The loss used is ‘binary cross-entropy’. The metric used is ‘accuracy’.

The optimizer is SGD (learning rate = 0.01).

The model architecture is given below. This architecture was chosen after experimentation with variations.

The progress of the model looked like this,

The performance of this model on x_sep_unb is shown below:

Accuracy: 0.9283

ROC_AUC: 0.9093

F1 score: 0.7018

The performance of this model is decent.

By and large, it is able to separate the points as required. However, the overall result is not significantly better than model 2. The misclassified plugs are 34/94 and misclassified non-plugs are 21/2508.

This stacking can also be considered as a solution.

Up till now, we were employing the first strategy of using the data (binary classification). We will explore the alternate approach (multiclass classification) in this section. Please refer to part 1 for more details.

Model 4

This model will take in the image feature embeddings as input, one-hot-encoded product spaces as output, and try to learn the item -> product_space mapping.

We will then predict the product space for each query point in the test set and use the plug labels to determine whether the prediction is permissible.

For this, the training (and cv) data we will use will consist of the first 11000 non-plugs. This data will be upsampled to twice the size to ensure that no class has only 1 training instance and for better training.

The test data will consist of all (471) plugs and remaining non-plugs.

Given below is the model architecture:

The first hidden layer contains a number of neurons equal to the power of 2 which is less than twice the number of inputs. The rest of the layers have a number of neurons in reducing powers of two, and the last layer has 298 outputs (number of product spaces) with activation as SoftMax.

The loss used is ‘categorical cross-entropy’. The metric used is ‘accuracy’.

The optimizer used is Adam (learning rate = 0.001).

The progress of this model can be found below:

The accuracy of this model seems to be very high. We can assume that it has learned the mapping of item -> product space, adequately. Here are the results on the test data:

Accuracy: 0.8466

ROC_AUC: 0.7474

F1 score: 0.6316

Not so thrilled.

The ratio of misclassified non-plugs is much higher than that in the previous approach and the Recall is only ~0.56.

We can try and increase the number of training instances.

Model 5

Now, we will pass almost all non-plugs into the model and will detect whether the predicted product spaces are different from those given for plugs.

100 non-plugs are still kept in the test set as a sanity check for whether the model is actually predicting the correct product space properly.

The model architecture is identical to the one used in Model 4.

The training and cross-validation curves for CCE loss and accuracy can be seen below:

Again, the accuracy of the model is high enough. Given below, is the performance of the model on the test set:

Accuracy:  0.6322
ROC_AUC: 0.7574
F1 score: 0.717

The accuracy of the classification of non-plugs has improved understandably. However, the accuracy for plugs has shown a slight decrease.

This means that the product space given in the data is being predicted by the model as well. A hypothesis arises that the model is also learning the same confusion a person would be facing who has misplaced the item. This can be tested through a deeper study of plugs.

Conclusions

1) We have tried to solve the problem by using the data in two ways. As per our observations, using the product spaces as one-hot-encoded features is better.

2) Even after learning a proper mapping between items and product spaces (evident from the accuracy), the type-II error is prevalent in the predictions. This could be due to the fact that misclassified items are very similar to the items found in the product spaces that the neural network is predicting for them. This might explain why they were wrongly placed by humans in the first place.

3) There seems to be a trade-off between the Precision and Recall for this problem, which is concluded after seeing similar results for multiple models.

4) Some plugs seem to be especially susceptible to type-II error.

5) Overall, multiple models are giving satisfactory results, individual models can be selected depending on the requirements of the store manager. These models are:

a) KNeighborsClassifier with K = 1. If the retailer does not have a requirement for very low latency, this model can give good predictions overall.

b) Support Vector Classifier with C = 1000. If the retailer does not mind missing <= 30% of the plugs, this solution can be used.

c) Support Vector Classifier with C = 1. If the retailer does not mind checking many False positives in order to have a Recall of ~ 86%, this solution can be used.

d) Gaussian Naïve Bayes (var_smoothing = 0.01) + Model 2. This is a good solution overall and can be used without reservations apart from the fact that the number of False Negatives is not the best we have seen.

e) Gaussian Naïve Bayes (var_smoothing = 0.01) + Model 3. The same goes for this solution. However, there are a few extra False positives in this solution compared to the last.

Further Exploration

  • As observed above, certain plugs seem to be extra susceptible to misclassification. Given exact images for the item and information about product spaces, one can study why the type-II error is arising, in the case of both approaches of using the data.
  • More data for plugs can be generated by intentionally misplacing items and using the images of the same.
  • A Few-shot learning (FSL) based approach can be explored, where we build a prototype for each product space, based on the embeddings provided. We can use the similarity/distance of any new image embedding to identify the product space it actually belongs to, and in turn, decide whether it is a plug or not.
  • Different degrees of upsampling can be tried out for different models.
  • More diverse compound architectures can be explored with better computational resources.
  • Different architectures for image classification can be used for generating feature embeddings or directly mapping to product spaces. Various techniques such as augmentation and pre-processing can be used to selectively reduce errors.
  • A multi-input convolutional and dense architecture can be used to combine image features and product space features in the same model to predict plugs directly.

Thank you for reading up to this point. A lot of effort went into this case study and I will be glad if my findings are useful for you in some way. Kudos for your patience!

Here are the resources that really helped me create this work.

References

GitHub Link: Please click here.

LinkedIn profile: Please click here.

--

--

Samarth Khanna

Machine Learning | Data Science | Software Engineering