Mercari Price Suggestion Challenge- A Machine Learning Regression Case Study

Published in

Analytics Vidhya

19 min readJan 24, 2020

This is my first medium story. Hope you have fun reading it as much as I enjoyed writing it for the audience

The modern age is the age of machine intelligence. These are such buzzwords that has taken the world by storm and almost every avenue has enjoyed the flavor of Machine Learning in some way. Today, I am going to take you through a real-world data science problem which I have picked from Kaggle’s live competition and will demonstrate my way of solving it. This case study solves everything right from scratch. So, you will get to see each and every phase of how in the real world, a case study is solved. Before I talk about my approach of solving the problem, I shall briefly walk you through the problem statement of the case study in question.

Problem Statement

It can be hard to know how much something’s really worth. Small details can mean big differences in pricing. For example, one of these sweaters cost $335 and the other cost $9.99. Can you guess which one’s which?

An example of product features in Mercari

Product pricing gets even harder at scale, considering just how many products are sold online. Clothing has strong seasonal pricing trends and is heavily influenced by brand names, while electronics have fluctuating prices based on product specs.

Mercari, Japan’s biggest community-powered shopping app, knows this problem deeply. They’d like to offer pricing suggestions to sellers, but this is tough because their sellers are enabled to put just about anything, or any bundle of things, on Mercari’s marketplace.

Challenge to solve:

Given details about a product like product category name, brand name, and item condition, Can you build an algorithm that automatically suggests the right product prices? Quite challenging, right?

But if solved rightly, it can eliminate human interference in giving price suggestions of a product and speed up efficiency of the shopping app. That’s when Machine Learning comes to play.

Mapping the real world problem to a Machine Learning Problem

Type of Machine Learning Problem:

For a given item, we need to suggest the price of that item given its different features like category, name, brand name, item description etc. 
The given problem is a Regression problem as it will return the price of an item which is a real-valued value.

Error metric: RMSLE (Root Mean Square Logarithmic Error)

Real world/Business Objectives and constraints

Objectives:

Predict the price of an item given its condition, description and other related features.
Minimize the difference between predicted and actual price (RMSLE)
Try to provide some interpretability

Data

Data Overview:

Get the data from : https://www.kaggle.com/c/mercari-price-suggestion-challenge/data

Data files :

train.tsv

All these are tab-separated files

Each row of the train.tsv file has the following attributes/features that lists details about a particular product.

train_id or test_id - the id of the product
name - the name of the product
item_condition_id - the condition of the product provided by the seller
category_name - category of the product
brand_name: brand name of the product
price - the price that the product was sold for. (This is the target variable that you will predict) The unit is USD.
shipping - 1 if shipping fee is paid by seller and 0 by buyer
item_description - the full description of the item. Note that we have cleaned the data to remove text that look like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]

Input features: train_id, name, item_condition_id, category_name, brand_name, shipping, item_description

Target Variable: price

We will build various supervised machine learning regression models and see which succeeds in solving the given mapping between the input variables and the price feature in the best way. Let’s begin in a step-by-step manner.

Step 1: Exploratory Data Analysis

The very first step in solving any case study in data science is to properly look and analyze the data you have. It helps to give valuable insights into the pattern and information it has to convey. Statistical tools has a big role in proper visualization of the data. Even though it is considered not a very important part of solving a problem, but successful data scientists and ML and ML engineers spend maximum part of solving a problem by analyzing the data they have. Proper EDA gives interesting features of your data which in turn influences our data preprocessing and model selection criterion as well.

At first, we begin by importing essential libraries that we would be needing to solve our problem.

Loading the data:

To load the data, we only need the train.tsv file. We will load this into a panda dataframe.

Loading train.tsv into a dataframe

I had a dataset of 1482535 products along with their prices. The first 5 rows appeared as shown above.

Seeing some information about the data.

Checking for null values in the data set:

One has to check for missing values in the data set before using any machine learning model.

Shows which column in the data has null values

We can see that the columns ‘category_name’, ‘brand_name’ and ‘item_description’ have null values. There are different techniques of dealing with missing values like removal of the rows with values missing, remove the feature with a high percentage of missing values or fill the columns with missing values by some other values. Here, we have chosen to fill these missing values in the data by some other value.

Filling missing values in our data:

The function fill_missing_values fills all the columns ‘category_name’, ‘brand_name’ and ‘item_description’ that have null values with some optional values.

As you can see in row number 0, the column ‘item_description’ that has null value initially has been replaced with a non-null value.

Let us now do univariate analysis of every feature in the data set and look for hidden relationships(if any).

Price:

This the distribution of the ‘price’ variable over all products.

Price Description

So, from the describe table we can conclude that:

25% of the products are priced below 10$, 50% of products are priced below 17$ and 75% of products are priced below 29$.
Also, the maximum price that any product has is 2009$.

Let us look at the histogram plot of the ‘price’ variable to get a better idea of its distribution.

It can concluded that the distribution of the ‘price’ variable is heavily right-skewed. As the ‘price’ variable followed a skewed distribution , in order to make errors on low price product more relevant than for higher prices, the metric of evaluation in this competition is Root Mean Squared Logarithmic Error (RMSLE). Thus, I applied the log transformation to the price target variable, to make this assumption available for model training.So, we have to scale down the ‘price’ feature to log scale.

Changing the ‘price’ scale to log scale

The distribution doesn’t look skewed anymore on changing it to log scale.

Shipping:

The shipping prices of products are either ‘1’(buyer charged) or ‘0’(seller charged). Let us see how the shipping feature is distributed over all data points.

Here, we can see that for items which have lesser price, the shipping had to be paid by the buyer for profit reasons. Also, as the price increases, we can see that the shipping charges have been paid by the seller. This trend is what is usually observed when we buy products online and our ‘price’ value is less than a certain threshold for free shipping.

Item category:

Let’s look at the item categories that the items span over mostly. For this, we will look at the 10 most frequent item categories in the product list.

From the statistics it can be said, that Women apparel has the maximum number of items followed by any other category.

The category names are listed by ‘/’ delimiter which tells about the main category, sub-category 1 and sub-category 2 of the products. Therefore, to get better idea of each product, we will do feature engineering here and split the category name into 3 different columns namely, ‘general cat’, ‘subcat_1’ and ‘subcat_2’.

category_name splitted into general_cat, subcat_1, subcat_2

After splitting the category_name column, the unique items that I have in each of the newly formed columns is listed below:

General Category:

Let’s find out which of the products rank the highest in terms of frequency of occurrence.

From the histogram above, it can be said that Women products occur with the maximum frequency, followed by Beauty products. The 3rd largest general category is owned by Kids products.

Subcategory 1:

In ‘Subcategory 1’, there are about 114 unique categories. Since it is difficult to visualize all the categories, we will have a look at the top 10 items with highest frequency in Subcategory-1.

As women products occur with the maximum frequency in General category , the category with the highest frequency in Subcategory 1 is Athletic Apparel which conveys good idea about the observation in the main category. Second most popular item in the Sub category 1 is makeup which also aligns with the idea that women products occur with maximal frequency.

Subcategory 2:

In ‘Subcategory 2’, there are about 865 unique categories. But for better visualization and understanding, we will plot top 10 items in Sub-category 2.

It is known that Women products has the maximum frequency in General category and Athletic Apparel occurs with the highest frequency in Subcategory 1. So, this clearly aligns with the idea that ‘Pants,Tights,Leggings’ occur with the maximum frequency in Sub category 2.

Brand name:

There are many unique brands in the data set, but we will have a look at the 10 most popular brands by frequencies of their sale.

For the brand name column in which there was no brand information given, we have filled those cells with the value ‘Not known’. And most of the items, the brand name has not been listed which can be deduced from the histogram plot. Second to it, most number of items have ‘Pink’ and “Nike” as brand names.

Item Description:

To understand useful patterns from the ‘item_description’ feature, we will visualize the data. One good way to visualize text data is to plot words clouds. This will help us to see the words that occur most frequently in the description column.

Here, we have plotted a word cloud with 300 most frequently occurring words. The words that occur with a larger font compared to the others have occurred more frequently than the others.

Step 2 : Basic Feature Engineering and Preprocessing:

We have tried to explore meaningful patterns in the data by studying the column ‘item_description’. For instance, the length of description can hold some pattern in influencing the price or the shipping or the data. We have also experimented with additional features like ‘Sentiment score’ of the item description column to understand what sentiment the description of an item conveys. In real world, an item description with a positive sentiment will wage a higher price than an item with a negative sentiment. So, we have done feature engineering and explored these features too.

At first, we need to preprocess the column ‘item_description’ as it has text values. In ML real-world problems, whenever you encounter textual data, preprocessing the text data is essential to extract useful information before any model is applied. Usually, the common preprocessing steps are:

Converting all words to lowercase.
Removal of stop words
Removing punctuation and special characters.
Removing unwanted multiple spaces
Handling Alpha-numeric values and so on.

These are some commonly used preprocessing techniques but it keeps changing from the nature of underlying data and the purpose that we are going to solve. So, here is a code snippet of the preprocessing that we have done on the column ‘item_description’.

Item description Length:

Here, description length vs price has been given. From the plot, it can be said as length increases, price charged becomes lesser and lesser. Most of the items with lesser description length have more price value.

Sentiment score of Item Description:

We will compute the sentiment score for each each of the item descriptions present in out data set.

Computing sentiment score of text in item description

The SentimentIntensityAnalyser package from Vader lexicon is for computing sentiment scores. It returns a list of four values(positive,negative,neutral,compound) for each of the item descriptions.

Plotting Correlation matrix:

A correlation matrix is basically a covariance matrix that is a very good technique of multivariate exploration. We have so many features already present in our data set. In addition to that, we have created two additional features, ‘item description length’ and ‘sentiment score of item description’. To visualize if one feature has a strong correlation with another feature, we shall plot correlation matrix. A positive value of correlation suggests a stronger association and vice-versa.

Correlation matrix of features in data set

From the above table, it can be said our newly created feature ‘description_len’ shares fair correlation with the target variable ‘price’ of an item. Also, sentiment scores shares some correlation with the ‘price’ variable. Hence, we will be including them as additional features in our feature list.

Hence, the features that we will select for modelling are ‘item_name’, ‘brand_name’, ‘shipping’, ‘general_category’, ‘subcategory_1’, ‘subcategory_2’, ‘item_description_length’, ‘sentiment_score_item_description’ and ‘item_description’.

Splitting into train and test data set:

Before applying any ML model, we will split our data set into train and test. We shall fit on the entire train data and use test part for prediction purpose. Also, we shall evaluate which model performs the best on test data based on our evaluation metric RMSLE.

Handling Categorical Features:

One-Hot Encoding:

Before applying any machine learning model, our data must fed to such models in proper format. We often encounter numerical,categorical values for data in column. In our problem, most of the data in the columns is of categorical nature. Hence, they have to be converted in proper format to extract relevant information. though there are many ways to handle categorical data, one of the commonest way is to do One-Hot Encoding. For those of you who find this term ‘alien’, we will describe it briefly before we go to the subsequent stages.

From the figure, it will be easy to understand what One-Hot Encoding basically is. In the column ‘WorkClass’, there are three different categories of values ‘Private’, ‘State-gov’, ‘Federal-gov’. To convert this categorical data into one-hot encoded vector, we check the number of unique categories present in our train data. Since there are 3 unique categories of work class, any value in the work class field will be converted to a 3-dimensional vector. The work class value with respect to that particular row will bear the value ‘1’ in one field of the vector whereas all other fields of the vector will be marked ‘0’.

The One-Hot Encoding is specifically done with respect to train data to avoid data leakage issue. We do not include test data into it because test data is unseen to us. So, if any category appears new while testing, we will ignore that value while converting to One-Hot encoded form.

So, we have ‘Name’, ‘Brand Name’, ‘General Category’, ‘Sub category 1’, ‘Sub category 2’ as columns with categorical values. We will convert these categorical values into one-hot encoded form. The code snippet of how to do it is illustrated below.

Name:

Similarly, Vectorizations for Brand Name, General Category, Sub category 1, Sub category 2 have been done.

Handling Text-Features:

Just like categorical features, we also have to treat our text features in the data. The column that bears text feature values in our data is ‘Item Description’. There are different ways of converting text features to numerical format like BOW, TF-IDF, Word2Vec. In this problem, we have computed TF-IDF vectors for all the words in our item description values.

Item Description:

This is the code snippet of how you can perform TF-IDF vectorisation on text features.

Handling Numerical Features:

Numerical features can be tricky to handle due to different range of minimum and maximum values for different types of features. In this section, we will handle the numerical features before feeding into our machine learning models.

Since numerical features can be of different scale values, we have normalized our features and assigned an uniformity so as not to train models with scale differences between features. Scikit Learn has an inbuilt Normalizer library to normalize values in a given range. Below are the numerical features and code snippets for processing numerical features.

Length of Item description:

In similar manner, normalization on Sentiment score of an item in description has been performed. You can refer to the GitHub repo for details on it.

Shipping and Item Condition id:

The columns ‘Shipping’ and ‘Condition id’ are also categorical variables and we have adopted an easier way to One-Hot encode it. Pandas contain get_dummies method which is of the fastest ways to do one-hot encoding on categorical variables. It’s literally one-line of code to obtain the vector representation of categorical values and the output returned is also of dataframe type. The algorithm in the ‘get_dummies’ method will recognize the categorical features and will obtain the one-hot encoded values automatically. Here is how to do it:

Merging all features in a matrix:

After having handled all types of features, we are just step away from our machine learning modelling. For using our converted features, we will merge all of them in a matrix and will convert the matrix to compressed Sparse Row format stack arrays in sequence horizontally (column wise). One thing to be noted is to check the shapes of the train and test feature matrices. Here is how you can combine all features in a matrix using hstack of Scikit Learn library. Hstack creates a concatenated form of arrays in sequence horizontally.

Step 3- Machine Learning Modelling:

After having done with the statistical analysis and cleaning of data, we also did some feature engineering and added two new features- ‘length of item description’ and ‘sentiment score of an item’ from the already existing features. We also handled categorical, text and numerical features and merged into train and test data matrix. We are now ready to apply machine learning algorithms on our prepared data matrices.

Machine Learning Modelling and Best Parameter Selection:

There are many machine learning algorithms to try when you are working on a specific problem. You never know which ML model would work the best for you before hand. However, domain knowledge and expertise with practice do play a role sometimes. But the best choice for anybody who wants to master his skills is by experimenting. So, I experimented with a couple of regression models and checked which performs the best. It is also very important to hyperparameter tune your models when you use them. The set of hyperparameters that you choose for a ML model is crucial, so keep that in mind.

I have experimented with four machine learning regression models:

Ridge Regression
Stochastic Gradient Descent Regression
Random Forest Regression
Light Gradient Boosting Regression

I have hyperparameter tuned each of these models. From the best hyperparameters that was obtained by training on the train data set, it was used to predict the price values on test data set and compare the RMSLE of each of these ML models. I have tried with a wide range of hyperparameters because the optimal set of hyperparameters varies from problem to problem, hence experimentation is the key to success.

The hyperparameter tuning technique that we will apply is Random Search with K Fold Cross Validation for some models and Grid Search with K Fold Cross Validation for the others. The reason of choosing Random Search in some of the models is because of its faster training speed compared to Grid Search as the later checks on all configuration of parameters.

For model building, we will be using the Scikit-learn library of Python. It is extremely easy to build models with Scikit Learn because of its simplicity. We have illustrated about the modelling of proposed algorithms in the next few sections.

Linear Models:

Ridge Regression:

The RMSLE of Ridge Regression is: 0.48

Stochastic Gradient Descent Regression:

The RMSLE of Stochastic Gradient Descent Regression is: 0.49

Non-Linear models:

Random Forest Regression:

The RMSLE of Random Forest Regression is: 0.49

Light Gradient Boosting Machine Regression:

Mostly on big data sets, using tree-based algorithms yields the best results. In this case, we have used LightGBM in place of XGBoost because of the following reasons:

faster training speed and efficiency
low memory usage
higher accuracy
good performance with large data sets

The RMSLE of LightGBM Regression is: 0.46

Ensemble:

The final step was to figure out which model gave the best results. We are using Ensemble modelling on all the 4 regression based algorithms that we have used for prediction to be used in our final submission file. Since RMSLE is the indicator metric for our problem, we will be giving weights to each of the algorithms based on their RMSLE values. Since, LightGBM gives the best results and hence we gave maximum weight-age to it while calculating the final predicted price of an item in ensemble modelling.

Results :

Out of all the models built, LightGBM gives the lowest loss of 0.46 which is a non-linear model. However, Random Forest Regressor which is a non-linear model did not perform at par with the linear models. One of the simplest linear model Ridge Regression gave the second best performance after LightGBM which is 0.48. It only elucidates the fact why we should also try simple ML models to look for better accuracy.

Final Submission Score:

So, here is my leader board score on Kaggle after post submission.

Summary:

The above table is the summary on the train.tsv data set. But the evaluation in the competition would be on their submission data set. So, I tried submitting using the best hyper parameters from each of these models. But the key point to note that almost all the individual ML models had comparatively higher values of RMSLE, whereas model generated by using Ensemble modelling gave the lowest RMSLE values and hence is my final model for this case study.

Ensemble based approaches being really popular for Kaggle competitions is something that I read before. but now with this case study, I also got a first-hand experience of the same.

Improvements to Existing Approaches:

I have tried feature engineering and included two new features : ‘length of item description’ and ‘sentiment score of an item’ in our modelling. Of all the online solutions that I have went through, most of them actually did not add ‘sentiment score of an item’ as a feature engineering technique which I think is an improvement on my way of solving the problem. I have also extensively hyperparameter tuned all my models with a huge set of feature combinations which is definitely an improvement over online solutions.

Future Work:

As a scope of future work, I wish to engineer more features and wish to improve the accuracy of my models.
I also wish to experiment with more hyperparameters and try different machine learning algorithms to further improve my solutions.

Conclusion:

This was my first self-case study on Machine Learning and also my first Kaggle competition submission. Though it was a late submission, I got a pretty decent Kaggle score which I think is great for a beginner. I got to learn tonnes of techniques while working to improve accuracy on the machine learning modelling. It always feels great to read from books and blogs about the methodologies, but unless you do it on your own and learn things from scratch, you won’t get a good idea about how you can solve this in practical. So, my suggestion to any aspiring data scientist would be to start participating in Kaggle Competitions and never let the thirst of learning new things die within you.

This concludes my work. Thank you for reading!

While some code snippets are included within the blog, for the full code you can check out this Jupyter Notebook on Github. I hope you learnt something new through this read!

Follow me for more such articles and implementations on different real-world problems on data science!

References:

You can also find and connect with me on LinkedIn and GitHub.