Suggesting the price of items for online platforms using Machine Learning

A Machine Learning (Regression) case study based on the Mercari Dataset on Kaggle

Divyansh Jain

Published in

Analytics Vidhya

21 min readJul 18, 2020

Business Task
Use of Machine Learning/Deep Learning in this task
Evaluation Metric
Exploratory Data Analysis
Feature Engineering
Existing Solutions
My Experimentations
Final Model
Summary, results, and conclusions
Future Work
References

1. Business Task

The objective of this case study is to suggest an appropriate selling price to a seller who wishes to sell his/her product (usually pre-owned) on the online platform, Mercari, which connects the sellers to the buyers.

This case study is based on the famous Kaggle Competition held in 2018: Mercari Price Suggestion Challenge (https://www.kaggle.com/c/mercari-price-suggestion-challenge)

Source: https://www.kaggle.com/c/mercari-price-suggestion-challenge/

Mercari is an online selling/buying platform (similar to OLX in India) where the users can sell/buy used products, so that, the seller gets an appropriate amount for the product which is now not much use to him/her and the buyer gets the product at a lower cost as compared to the market.

As part of this competition, we have to suggest an appropriate price to the seller for the product he/she wishes to sell on the Mercari platform. Here, by “appropriate”, we mean that the price should not be so high that no one buys that product and it should not be so low that the seller is not able to earn a significant profit.

The seller enters the details of the product he/she wishes to sell, like the product’s name, a short description, category, brand, shipping status, and the condition of the product.

When the seller enters these details, the Mercari platform returns an appropriate selling price of the given product to the user.

So, the task here is to return the price based on the entered details.

2. Use of Machine Learning/Deep Learning in this task

As a Machine Learning solution to this problem, we need to build a model that takes the product’s details as input and as output, predicts the selling price.

This can be modeled as a Regression problem in Machine Learning, where the task is to predict a real-valued output (price) for a product based on its similarities to already sold products.

Machine Learning is particularly helpful in this scenario since a new user might not know at what prices usually the products are sold on Mercari and thus, he/she may not be able to decide the price accordingly. The Machine Learning model uses the knowledge learned from the previous products sold on the Mercari platform and suggests an appropriate price to the user.

Dataset Source and Description:

The dataset used for training the model is the Kaggle competition’s publically available dataset (https://www.kaggle.com/c/mercari-price-suggestion-challenge/data).

The dataset consists of nearly 1.5 million rows with each row consisting of the following fields:

train_id: A unique ID identifying the given listing
name: This is the main title/name of the product as listed by the seller (text format)
item_condition_id: A number in the range of 1–5 denoting how good (or bad) the item’s condition is.
category_name: Text field describing the categories (and subcategories) in which the given product can fall into. It is in the form of a hierarchical structure, category1/subcategory1/subcategory2… like Men/Tops/T-Shirts.
brand_name: Text field describing the brand of the product (like Louis Vitton)
shipping: Binary variable (1/0) where 0 denotes shipping fee is paid by the buyer (Eg. $10 + shipping extra) and 1 denotes shipping fee is paid by the seller (Eg. $11 with free shipping)
item_description: The complete text description of the item. This includes all the details that the seller wants to list so that the buyer is persuaded to buy the item. This is the most important part of the whole data as most of the time the buyer decides whether to buy the product or not by reading its description.
price (to be predicted): The target value to be predicted as part of the business problem (in US Dollars)

An example data point looks like:

3. Performance Metric

The performance metric used here is Root Mean Squared Log Error (RMSLE). If x_i is the predicted price and y_i is the actual price, then the RSMLE between these is given by:

Source: https://medium.com/analytics-vidhya/root-mean-square-log-error-rmse-vs-rmlse-935c6cc1802a

You must be wondering, what is the need for such a complex formulation? Why can’t we simply use Root Mean Squared Error (RMSE), which is mostly the default evaluation metric for Regression tasks?

Actually, RMSLE is a very good metric in this case due to the following properties:

Robustness to outliers: There are some data points with a very high selling price (jewelry items, antiques, etc.) in our dataset. The log term in the metric helps to bring all the prices to a more uniform scale and thus prevents the model from being biased towards outliers.
Calculates relative error instead of absolute error: A prediction of 8 against 10 and a prediction of 800 against 1000 incur the same penalty when we use RMSLE instead of RMSE. This is due to the logarithmic property, log(8) — log(10) = log(8/10) = log(800/1000) = log(800) — log(1000)
Incurs a larger penalty for the underestimation of the actual variable than the overestimation

The last point is particularly helpful for us, because, if the algorithm predicts a price of $9 for a product that could be easily sold for $10, it is a loss for the seller and he/she may not visit the platform again. On the other hand, if the algorithm predicts a price of $ 11 for the same product, in case a customer buys it, both the seller and the Mercari App gain more profit, while, if it is not sold, the seller can always reduce the price later.

In the above case, the RMSLE for the underestimation case will be 0.105, while, that for the overestimation case is 0.095.

The detailed explanations for these properties are there in this great blog: https://medium.com/analytics-vidhya/root-mean-square-log-error-rmse-vs-rmlse-935c6cc1802a

4. Exploratory Data Analysis (EDA)

Source: https://libquotes.com/john-tukey/quote/lbu2t3k

The above quote is by John Tukey, a great American mathematician and the creator of “Box Plot”.

It summarizes the importance of EDA in the context of Data Science. As someone rightly said, “The better you know your data, the better you’ll be able to learn from it”.

4.1 Distribution of the target variable “price”

Probability Density Function (PDF) of the target variable, “price”

We notice that the product prices follow a heavily skewed distribution.

The majority of products have a lower price, but, there are few products with extremely high prices. The average price is 30, while the maximum listed price is close to 2000.

This heavily skewed distribution indicates the presence of outliers. We have to confirm this before we start building Machine Learning models.

We plot the Cumulative Distribution Function (CDF) of “price” to check this.

Observation: 90% of the items have a selling price of less than $200

On further analysis, it was found that 99.9% of items have a price of less than $450. Also, it is observed that the maximum listed price is close to $2000. This clearly indicates the presence of outliers in the data and hence we need to take some measures to solve this issue.

One simple workaround is to use log(1+price) instead of price as the target variable (1 is added to avoid the input of log being 0 when the price is 0). The given transformation helps us in two major ways:

Reduces the effect of outliers in the data

Observation: The distribution has now become much symmetric as compared to the earlier one

2. When we train Machine Learning Regression models like Linear Regression, if we optimize for Mean Squared Error (MSE) on the log-transformed target variable, we indirectly optimize for Mean Squared Logarithmic Error (MSLE) on the original target variable, which directly optimizes for the evaluation metric of our task (Root Mean Squared Logarithmic Error). This will help a lot in selecting the optimal hyperparameters for any model which we train.

Hence, we will use log(1+price) instead of price as our target variable from now on.

Now, we perform EDA on various features of the data, to see, if we find something interesting and trust me, we will :)

4.2 EDA on Item Condition and Shipping

Item Condition is a numerical number from 1–5 indicating the condition of the item to be sold.

Observation: 98% of the items have an item condition in between 1–3.

We’re not explicitly given whether 1 indicates bad or good, but, since the Mercari platform mostly deals in secondhand products, 1 should indicate the worst condition and therefore, 5 the best condition. To verify this, we see, how the price (or precisely, the log price) varies with the item_condition:

Observation: There is not much variation in the price with item_condition, so, this might not be an important feature.

Still, items with item_condition = 5 have a higher price than others. This indicates that item_condition=5 implies the best condition.

Shipping

The shipping status is a binary variable, with the value 0 indicating that the shipping cost is paid by the buyer and 1 indicating, it is paid by the seller. We check the distribution of the shipping variable and find that, in 55% of the items, the shipping cost is paid by the buyer.

Let us see how this affects the price:

Observation: Listed price is higher when the shipping cost is paid by the buyer.

This is contrary to what we think: Listed price should be less if the shipping cost is paid by the buyer (as he/she is additionally paying the shipping cost).

4.3 EDA on Brand Name

Brand Name plays a decisive factor in deciding the price of an item in real life. We confirm this through EDA on brand_name.

43% of the items do not contain the brand_name value.

Observation: 7 out of 10 brands in the above plot are Fashion Brands.

This indicates mostly clothing products are sold on the Mercari Platform. We check the variation of price with the top 10 most occurring brand names:

Observation: There is a lot of variability in price with brand names since the overlap between the box plots is less.

This is in line with what we thought. Hence, this is a potentially important feature. Among the top 10 brands, Apple has a significant variance in the price, while, Forever 21 (fashion brand) has quite a low variance in the price.

4.4 EDA on Category

The category field consists of a hierarchical format in the form of main_category/sub_category_1/sub_category_2/…

df_train['category_name'].head()

We perform some preprocessing on this field so that we can separate the different taxonomies and analyze these better.

Preprocessing Categories

Datapoints after pre-processing the category field

There are 11 unique values in the general_cat field

Histogram for most frequently occurring general categories

Observation: Nearly 50% of all the items present are Women’s products

The price variation among these is:

Observations:
1. General Category does not impact pricing drastically
2. Still, Men’s products have a slightly higher selling price
3. Handmade products have a slightly lower selling price
4. There’s a large variance in Electronics items’ selling price

There are more than 100 unique sub_category_1 items and more than 800 unique sub_category_2 items and hence we skip these for visualization tasks.

4.5 EDA on Item Name and Description

These two feel like the most important fields in the data, because, these are the text fields and hence contain the most valuable information.

We perform some text preprocessing (removal of special characters and stop words) on these and then visualize the preprocessed text with the help of WordCloud.

WordCloud on Item Name

WordCloud gives us the most frequently occurring words in the given text

Most frequently occurring words in the item name

Observations:
1. Victoria and Secret which are the words of the 3rd most occurring brand in the text occur a lot in Item Name, which indicates that people tend to use brand names in Item Names.
2. Michael, Kor, Apple and iPhone also occur frequently. Since these brands have a high price, we can say that people tend to use expensive brands’ names in the Item Name so as to attract customers, which is understandable.
3. People also use terms like “free shipping” to attract customers’ attention. Terms like “brand new” are also used to attract attention.

WordCloud on Item Description

Most frequently occurring words in Item Description

Observation:
Words like brand, new, free, shipping, great, condition are used to attract customers to buy the product

One thought that comes in my mind is that, does the length of the description also matters? If the item is costlier, do I need to write a longer description?

Observations:
1. Items with short description length tend to have a smaller price.
2. Items with higher prices tend to have a longer description but the converse is not always true.

5. Feature Engineering

This is the most important aspect of Machine Learning as it is rightly said, “Your model is as good as your features are”.

The main goal of Feature Engineering is to generate new features from existing data such that these are more useful for the given task. The performance of Machine Learning models depends significantly on how useful the engineered features are. This improvement is not that significant in the case of Deep Learning models. This is because Neural Networks can generate useful features automatically. The major pre-requisite of training a good Neural Network is that there should be enough data. If this is satisfied, a very simple Neural Network can give better performance than a Machine Learning model trained on engineered features. We’ll see this in the later part of the blog.

Preprocessing

Basic preprocessing like replacing NaN values with an empty string, keeping only those rows with prices between $3 and $2000 (rules of Mercari platform) was performed on the data. One interesting strategy that worked brilliantly was to merge different text fields to form combined text fields. This helped to control the number of text features generated using TF-IDF vectorization (we’ll see this later) and also the quality of these features as all textual information was present together (this idea is borrowed from the Winners’ solution to this problem: 1st place solution | Kaggle)

Combining Text Fields

Some basic text preprocessing was applied to the data which included decontraction, removal of special characters, emojis, stopwords, conversion to lower case, and lemmatization.

Text Preprocessing

Engineering new features:

1. Sentiment Scores

How the seller writes the description of the item to be sold should determine how much the buyer is impressed. A review that sounds “positive” (optimistic) and “detailed” (subjective) should attract the buyer’s attention. So, we calculate the sentiment scores for the text description using the package TextBlob, since this offers both, polarity and subjectivity features in its result.

Sentiment Scores using TextBlob

2. Historical Price Statistics

This set of features is inspired by the Kaggle Kernel: https://www.kaggle.com/gspmoreira/cnn-glove-single-model-private-lb-0-41117-35th

The idea behind these features is to find historical price statistics like Mean, Median, etc. for a group of items having a common category, brand, shipping condition, etc. By this, we can know the general trend that previously purchased items follow and we can use the average price of these items to predict the selling price of a new item falling into the same group.

For example, If the average selling price of Men T-Shirts of POLO Brand (which were of average condition and shipping fee was paid by seller) was $10, then it is likely that a newly added item with the same features will have a price close to $10 and hence this is potentially a useful feature.

Historical Price Statistics Features

3. Length of Text Description

In the EDA, a positive correlation was observed between the length of the description text and the selling price of the item and hence this may be a potentially useful feature.

df['text_length'] = df['text'].str.len()

We should always check the usefulness of the engineered features. It can be checked by measuring the correlation between the given feature and the target variable (price). The higher this correlation, potentially more useful the feature should be. Another alternative is using Scikit-learn’s inbuilt feature selection methods like SelectKBest.

Note: Both of the above are statistical methods are may not always be in line with the machine learning task we’re solving. For more accurate selection, we should use techniques like Recursive Feature Elimination or even better, SelectFromModel (where feature importances are calculated by training actual machine learning models like Gradient Boosted Decision Trees on the given features which are inherently capable of generating feature importances). But, these come at a high cost of time and computational resources and hence were not used in this case study.

Correlation Matrix between features and target variable (price):

Plotting Correlation Matrix

Using SelectKBest to find the best features:

Feature importance by SelectKBest

From both, Correlation Matrix and SelectKBest results, it was observed that the historical price statistics features can be quite useful in predicting the price for a new item, and hence we will include these features while training the models.

6. Existing Solutions

Some of the Kaggle kernels with good featurization/modeling techniques are:

Sparse MLP: This is the winners’ solution to this Kaggle problem. They used interesting feature engineering techniques and ensembles of Sparse MLPs trained on sub-datasets to achieve the 1st position with RMSLE as 0.387.
Ridge Model: This solution uses a simple Ridge model trained on TF-IDF features for text and One Hot Encoding for categorical variables. It is an extremely simple, elegant, yet, a good solution to this problem. RMSLE obtained is 0.47.

3. CNN Glove Single Model: This solution uses the pre-trained word embeddings for text and One Hot Encoding for categorical features to train a single CNN model. This solution is unique because in contrast with everyone using ensembles, in this solution, one single deep learning model is used. Also, the feature engineering techniques that are used here are very innovative (Eg.: The historical Price Statistics feature). The RMSLE score obtained is 0.41 (~35th position).

7. My Experimentations

Data Preparation

A 75-25 random split is performed on the original training data to generate the new training and cross-validation data respectively.

Feature Encoding

After performing EDA, Feature Engineering, and going through existing solutions, one thing was very clear, the text is the most important part of the solution and the results majorly depend on how well we convert text to features.

While encoding for the categorical variables: item_condition and shipping remained the same (One Hot Encoding) throughout the modeling process, encoding for the category, brand_name, item_name, and item_description is done by joining these fields together and creating two combined text fields, so that while vectorization, all the related information is captured together.

Combining text fields

This is also useful in limiting the number of features, which would be way large if category and brand_name were One Hot Encoded. On these combined text fields, Sparse and Dense vectorizations are tried.

For Sparse vectorization, I chose TF-IDF as it works better than simple Bag of Words representation in most cases.

TF-IDF vectorization (Sparse)

For dense vectorizations, an average Word2Vec model on 300-dimensional word embeddings (from Word2Vec model trained on Google News Data) for individual words is used to vectorize the two combined text features.

Tokenizing the text and loading the pre-trained Word2Vec model

(The above snippet is a sample, in the code, the same procedure is done for both “name” and “text”, i.e., the two combined text fields for both Train and CV dataset)

Generating average word vectors

The training and cross-validation data matrices, on which all further models will be trained are generated by stacking all the above features (item_condition, shipping, encodings (sparse/dense) on text features) together.

Creating Data Matrices

Also, as discussed in the EDA part, we’ll use the log-transformed price as the target variable while training all the models.

Log Transform of the price variable

For calculating the RMSLE between actual and predicted values, the following function is used:

Root Mean Squared Log Error Calculation

Modeling

The following models are trained using different featurization for the text:

1. Ridge Regression

The simplest model we can think of for a regression task is Linear Regression + Regularization (to control overfitting). Here, we observed that L2 Regularization (Ridge Regression) worked better than L1 Regularization (Lasso Regression) as the Lasso model did not converge even in a long time (possibly due to the very high number of features generated from sparse vectorization)

Ridge Regression model

After hyperparameter tuning on alpha, the best hyperparameter found is alpha = 1 and it is used to train the final ridge model:

Ridge Regression best model’s training and prediction

The RMSLE obtained is 0.455 which is a decent score considering the simple model used.

TF-IDF (sparse) vectorization is chosen for ridge regression instead of dense Word2Vec representation, as simple linear models work better with higher dimensional features as it becomes easy to find a hyperplane in higher-dimensional space.

The Dense embeddings suffer from a major problem which we will address later.

2. Gradient Boosting Decision Trees:

One of the most powerful and widely used Machine Learning models for Regression tasks is GBDT. The famous library XGBoost is used to train GBDTs.

For GBDT, we could either choose sparse or dense representations for text. My intuition was that since sparse representations generate a very large number of features, tree-based models like GBDTs would not perform well here. So, we will try both sparse and dense featurization for text here.

Sparse featurization (TF-IDF)

Scikit-learn’s RandomSearch was used to find the best set of hyperparameters

Training XGBoost Model

Here, RandomSearch is performed on a limited set of hyperparameters as each of these models took about 3 hours to train.

The final RMSLE obtained is 0.49, which is worse than that obtained by using Ridge Regression. Also, while the ridge regression model took less than a minute to train, this took 3 hours.

We expected that tree-based models will not perform very well on a large number of features. So, we also try Dense vectorization for the XGBoost model.

To our surprise, the XGBoost model when trained on dense average Word2Vec features gave an RMSLE of 0.55, this is not what was expected.

(Note: RMSLE is Root Mean Squared Log “Error”, so, the lower, the better)

Reason for the ineffectiveness of pre-trained dense embeddings

On further analysis, we found the root cause of why Dense vectorizations gave bad results irrespective of the machine learning model used.

As a general trend, dense vectorizations like Word2Vec tend to give better results than simple sparse vectorizations like TF-IDF. This is because these dense embeddings are themselves derived using a neural network (CBOW/Skipgram) and hence these are more powerful than the sparse representations. Then, what is the reason for this strange behavior?

After analysis, we found that most of the words in the vocabulary were brand names, slang, and uncommon words that people use in reviews specifically, and hence these words were not present in the vocabulary of pre-trained Word2Vec since it is pre-trained on Google News Dataset.

While calculating the average Word2Vec, it was observed that only 30% of the words in the review vocabulary were present in the Word2Vec model’s vocabulary. Thus, for all the other words, a vector of all zeros was used while calculating the average word vector. This was the reason behind the ineffectiveness of Word2Vec embeddings.

3. Some other alternatives

I also tried to use dimensionality reduction techniques like Truncated SVD so that the most useful information (in lesser dimensions) can be extracted from the high number of sparse dimensions produced by TF-IDF. This would preserve the information and also enable us to train more powerful tree-based ensembles like GBDTs (since these work better with lower-dimensional representations). This was not possible because the dataset consisted of 1.5 million rows and since the time complexity of performing Truncated SVD is O(n³) and also the memory requirements were way too high (more than 16GB RAM was needed to get effective representations of sparse data).

One other choice was to train a CBOW/Skipgram Neural Network on the text corpus from scratch to generate word embeddings specific to the corpus we have and then train a machine learning model using these word embeddings. This would overcome the problem of uncommon words present in reviews. I tried this, but, the time and computational complexity were way too high to generate these word embeddings from scratch. Hence, this idea also had to be dropped.

Hence, the only choice left was to use Sparse Embeddings like TF-IDF and since these are generally very high dimensional, linear models like Ridge Regression produced better results than tree-based ensemble models like GBDTs, since, it is easier to find a hyperplane in higher dimensions.

Using only the existing features for training all the models

Also, it was observed that while using the Ridge Regression model, the performance was much better with the original features (not including the price statistics features). The reason for this can be, that since, these features are very much correlated among themselves (price mean and price median have a correlation of 0.99), linear models will not work well, since, this violates the fundamental assumption of these models (independent features). For the tree-based models, there is no such restriction, however, these also performed better with the original features only. The reason behind this can also be, that we calculated the average price for the items with the same brand, category, item_condition, and shipping_status. This may have caused our model to learn data specific to the training set and hence cause overfitting (since GBDTs are already very prone to overfitting) while leading to a decrease in test performance. This reason might be valid for the linear model as well (since the training error for both the models was very low, but, the test error was very high, indicating a strong overfitting tendency).

Hence, we used the original features, i.e., combined text fields (name and text), item_condition, and shipping_status for training all the models.

4. Multilayer Perceptron on TF-IDF vectorization of text data

We can also train a Neural Network on the given data, since, we have ample data. The most simple Neural Network is a Multi-Layered Perceptron (MLP). The winners of this competition also tried this approach and hence I thought to give this a try as well.

The featurization used was the same as that used for the machine learning models (sparse featurization using TF-IDF for combined text fields and one-hot encoding for item_condition and shipping fields).

Since Keras 2.0 accepts sparse inputs directly, it became very convenient to train the neural networks.

MLP Model-1

One very simple architecture that I could think of is:

Building MLP Model-1

This is a very simple neural network with only 2 hidden layers having 256 and 128 hidden units respectively. ReLu activation was used for the hidden layer neurons and linear activation was used for the output neuron.

It was trained using the Adam optimizer for 3 epochs only, with an initial batch size of 256 and doubling the batch size after every epoch (this idea was inspired by the winners’ solution of this Kaggle competition).

Training MLP Model-1

We use the trained model to evaluate performance on the training and the cross-validation data:

Predictions from MLP Model-1

We obtain a test RMSLE of 0.41 with this very simple MLP. This is very good as compared to the test RMSLE obtained earlier using the ridge regression model, 0.455.

This is the power of deep learning which can be leveraged when we’ve ample data. A simple MLP with only 2 hidden layers was able to perform implicit feature extraction and thus give a very good test RMSLE. One thing was of concern, the model was overfitting to the training data. But, since the test performance was very good, we ignore this problem. Since this simple model resulted in a very good performance, we tried a more complex model to see if we can get better performance.

MLP Model-2

Building MLP Model-2

This model is more complex than the previous one, consisting of 5 hidden layers with 1024, 512, 256, 128, 64, and 32 neurons respectively. All other aspects such as activation functions, optimizer, batch size, number of epochs, etc. are the same as the previous MLP model.

Training MLP Model-2

Predictions using MLP Model-2

We obtain a test RMSLE of 0.40 with the above model. This is also very good as far as performance on unseen data is considered, but, this model also suffers from the problem of overfitting, which is very common in neural networks. We build a weighted ensemble of the above two models in an effort to reduce the overall RMSLE and get better performance.

8. Final Model

MLP Model-1 and Model-2

A weighted ensemble of the two MLP models

In the above snippet, we perform a weighted average of the outputs of the two MLP models by trying out different values for the “weight” in the weighted average. We select the best weight as the one, which gives the lowest test RMSLE, in this case, we obtain w_best = 0.4. The final predictions are calculated as follows:

y_final = w_best*y_pred_test_mlp_1 + (1-w_best)*y_pred_train_mlp_2

We get the final test RMSLE as 0.398.

9. Summary, results, and conclusions

The RMSLE scores for all the above-mentioned models are as follows:

The best performance (lowest RMSLE) is obtained by the weighted ensemble of the two sparse MLP models as mentioned earlier.

When this model is tested on the unseen test data on Kaggle (3.5 million rows), the RMSLE obtained is 0.405, which would result in top 1% of the leaderboard for this competition.

10. Future Work

We can work on the following areas for further improvement:

Train a Skipgram/CBOW neural network to get dense embedding for each word specific to the given text corpus and then use these embeddings to train a tree-based ensemble model like GBDT.
Try other neural networks like LSTM/1-D CNN models that are specifically designed for handling sequence data (like text data in this case).
Perform extensive hyperparameter tuning for XGBoost models using GridSearch/RandomSearch.

11. References

Link to my profile:

The complete code can be found on this Github Link. You can connect with me on Linkedin. I can also be reached at divyanshjain.19@gmail.com.

Thank you for reading through this blog. I hope you have a great day :)