Can We Predict a Price Adjustment in an Online Supermarket by Using Machine Learning and Econometrics?

Part 2: Machine Learning approach

Published in

Analytics Vidhya

9 min readOct 2, 2019

Photo by Joshua Rawson-Harris on Unsplash

This series walks through my Master’s dissertation, which was not only my first end-to-end machine learning project but also it was a bridge between my econometrics and machine learning study. My dissertation tried to predict the firms’ price change behaviour. This story, followed by Part 1 — econometrics approach shows how I tackled this problem with machine learning. If you keen to have a look, the full text is available on ResearchGate.

Content

ML approach — Time-series Classification Problem
Imbalanced data
Results
Conclusion and Next steps

ML approach — Time-series Classification problem

Before going into deep discussion, let me briefly look back what I was trying to achieve. In my dissertation, I tried to build a model that predicts whether the prices of items will be adjusted today (although now I think it would have been better if I predict price adjustment for tomorrow). I got the dataset of daily price in online supermarkets in America and South American countries. Part 1 of this series showed that using econometrics methods to understand this problem. As a result, we could see that the price of the items and length since last price adjustment are one of the most important features.

To set this problem as a machine learning question, I need to create a target variable. As we want to predict the price adjustment for today, I created one compare today and yesterday’s price (please have a look into part 1). Now we divide the dataset into train and test sets. As this is the time-series classification problem, those sets have to be sequential, otherwise you will have data leakage from the future and won’t be able to create a correct model. Following shows roughly how I split the dataset.

Also cross validation for time-series model has to be dealt with caution for the same reason. I implemented 10-fold cross validation and it was used to adjust some hyperparameters.

Imbalanced Data

As we have known by intuition, the prices in the online supermarket are not often be adjusted (at least not like every day). Yes, that intuition is correct, in fact, prices were adjusted only around 2% of the time. This means that our binary target variable has a severe class imbalance. First when I built a model and run the test, yayyy I got 98% accuracy! Then when I checked the confusion matrix, my model was merely predicting only a false. To make our model be more educated, we need to handle the imbalanced data and there are several ways to deal with it.

Use a tree-based algorithm
Resampling
Use appropriate metrics

Tree-based algorithms can, from their characteristic, perform quite well even with an imbalanced dataset. But could be even better when you combine with resampling methods (well it’s not always better but just generally speaking and also linear models can do better than tree-based algorithms, but again just a general case). There are several ways of resampling but mainly, Oversampling, Undersampling, Synthetic sample generation and combination of them.

In my dissertation, I implemented Randon oversampling, undersampling, Synthetic minority oversampling technique (SMOTE), Edited nearest neighbour rule (ENN), SMOTE + ENN, SMOTE + Tomek links and Adaptive Synthetic (ADASYN) and followings are the results.

We can notice that some patterns from those figures above, for example, many positive classes were created by using k-nearest neighbour algorithm in SMOTE method and ENN did not remove many negative classes.

Brief introduction of the above sampling methods for those who are curious about what are those.

Random oversampling:

This algorithm randomly chooses the minority class, and generate new samples with replacement. Since it creates overlapped samples, this method tends to have overfitting.

Random undersampling:

This algorithm randomly selects a majority class, and remove it to make a balance of class distribution. The main disadvantage of this method is that it is possible to remove potentially useful information. To overcome this drawback, Wallace et al. (2011) suggested that undersampling combined with bagging so that one can highly likely prevent from removing useful information (this method is available in imblearn.ensamble). However, this method increases false positive significantly in practice because of sampling bias. Pozzolo et al. (2105) suggested using Bayes Minimum Risk theory to find a correct threshold for classification.

Synthetic minority oversampling technique (SMOTE):

SMOTE is proposed by Chawla et al. (2002) to overcome the drawback of random over-sampling, and it performs quite good in the literature (Chawla, 2003). SMOTE creates synthetic minority samples by interpolating a randomly chosen minority and its k-nearest neighbours. Thus, overfitting problem is less likely to happen, and since it generates k-nearest neighbours of minorities, synthetic samples can be distributed into the majority class.

Edited nearest neighbour rule (ENN) — ENN is also one of the methods that apply k-nearest neighbour algorithm (Wilson, 1972). Removing the majority class is done by choosing any observation that is different from the class of at least two of its three nearest neighbours.

SMOTE + ENN:

This method is a combination of SMOTE and ENN. After SMOTE algorithm is implemented, ENN algorithm removes majority classes. The benefit of using this method in comparison to other mixture methods, namely SMOTE + Tomek links is that ENN can remove more observations than Tomek links so that it can achieve deeper data cleaning.

All of those resampling methods are available in Python API called imblearn. Following is a snippet of sample codes.

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=123)
x_resampled, y_resampled = ros.fit(x_tr, y_tr)

Note: You have to be careful that you only apply a resampling method to your train dataset.

As I mentioned above, if I use the accuracy metric (number of correct prediction/number of the total sample), we can’t measure the model’s true predicting power. Therefore, I used the following metrics in the dissertation.

Precision
Recall
Average Precision (AP) Score
Cohen’s Kappa
(F1-score)

Precision shows “how accurate” is your positive class’s prediction, and recall shows “how much” of actual positive class the model could catch. These can be seen in a confusion matrix and can be calculated as below.

Using those two metrics, you can calculate AP score.

where R is recall, P is precision and n is nth threshold. Rest of metrics I mentioned above are also pretty handy to use, Cohen’s Kappa takes into a fraction of class distribution account, thus it can correctly measure the performance of imbalanced data. I didn’t use F1-score for my dissertation but it is standard metrics and very useful.

Those metrics can be calculated by using scikit-learn.

from sklearn.metrics import precision_score, recall_score, average_precision_score, cohen_kappa_score, f1_score
precision_score(y_true, y_predicted)
recall_score(y_true, y_predicted)

Results

Here I will show some of the results I got from those analyses. First, let’s have a look into how resampling methods affected the model performance. I built a models by using some simple algorithms like logistic regression and decision tree. I used the almost same features as econometrics analysis (the price of the items, days since last price adjustment, day of a week, a month of a year, a season of a year, category of an item and item price bin. I used resampling methods to create “balanced” datasets then train on those, then test those models in the test dataset. Below is the comparison of model performance with and without using resampling.

The left-hand side is the results of the logistic regression and the other side is the results of the decision tree model. At the glance, you may notice that the decision tree model performs quite well with or without resampling, yet overall trained on resampled dataset improved their scores compared to trained on the imbalanced dataset. Also you may notice that the scores are quite low, they are around 0.03 even the best score for both AP score and Cohen’s Kappa. This shows the difficulty of this problem and also there is a lot of space of the improvement for this model.

As a case study, I aggregated the dataset to weekly data. So the target variable became whether the item price will be adjusted within a week. In this dataset, the class imbalance was solved at some degree (the positive class was 30% of a whole data). I applied SMOTE to resample and used Adaboost then both precision and recall were around 65%, AP score was 0.55 and 0.42 for Cohen’s Kappa, which are quite decent scores.

Also checked feature importance when I built decision tree models. I created separate models for each country and their feature importances are similar to each other. As we also found in econometrics approach, item price, duration and item category came to one of the most important features.

Conclusion and Next Steps

This series of stories have walked through end-to-end machine learning research. This dissertation used both econometrics and machine learning to investigate firms’ price adjustment behaviour in online supermarkets in America and South American countries. The econometric analysis combined with machine learning explains why prices are adjusted, and which predictors are more informative to predict price changes. On the other hand, machine learning analysis focuses on the prediction of price changes. Machine learning analysis improves its prediction power by using the econometric analysis result.

For future research, adding several more variables are expected. Existing literature demonstrates some facts that this thesis does not cover such as the linkage between price changes and wage changes (Klenow and Malin, 2010). Also as item prices are closely related to some economic indexes, we can incorporate those features to capture more state-dependent based behaviour. In terms of machine learning method, since this dissertation did not implement any complicated models nor even so much feature engineering, there is quite a lot of space of improving. Personally I’m quite interested in using probability calibration (Bayes Minimum Risk theory) combine with random undersampling as Pozzolo et al. (2015) suggested. This might be a good topic for the next blog post.

Wrap up

In this post,

introduced how to deal with time-series dataset
introduced how to deal with the imbalanced dataset
showed my results of a comparison of model performance with and without resampling

If you found the story helpful, interesting or whatever, please click the 👍 button :) Also if you have any question, feedback or literally anything, feel free to leave a comment below. I would really appreciate it. Also, you can find me on LinkedIn.

Reference

Byron C. Wallace ; Kevin Small ; Carla E. Brodley ; Thomas A. Trikalinos. (2011). Class Imbalance, Redex. IEEE 11th International Conference on Data Mining

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357.

Chawla, N. V. (2003). C4. 5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In Proceedings of the ICML, volume 3, page 66.

Klenow, P. J. and Malin, B. A. (2010). Microeconomic evidence on price-setting. In Handbook of monetary economics, volume 3, pages 231–284. Elsevier.

Pozzolo, Andrea Dal., Olivier Caelen, Reid A. Johnson, Gianluca Bontempi. (2015). Calibrating Probability with Undersampling for Unbalanced Classification. IEEE Symposium Series on Computational Intelligence

Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, (3):408–421.