Mercari Price Suggestion Challenge

Published in

Analytics Vidhya

24 min readMar 28, 2020

It’s My first Data Science Experience in kaggle environment. While the challenge itself was fun and the experience was magical, and intuitive. I hope you realized that you have learned something at the end of the blog.

Price suggestion and recommendation of Rescale products on E-commerce Websites is quite common now-a-days when Machine Learning and Deep Learning came into existence. Mercari is one such example of it.

Mercari is an online shopping marketplace which is powered by one of the biggest community of Japan where users can sell pretty much anything.

The community wants to offer price suggestions to the sellers but is a tough task as the sellers are enabled to put just about anything, or any bundle of things, on mercari marketplace. Because the price of a product depends on it’s brand and the product usage. It’s quite common to get different prices for the same products with different brands.

For example, one of these sweaters cost 335 dollars and the other cost 9.99 dollars. Can you guess which one’s which? Probably Not!!

But what if I build a system that automates everything so that it removes human efforts in suggestions and increasing buyers in the market. That’s where the Machine learning and deep learning comes into handy.

Business Problem:

Understanding to the given Problem Statement is the most important and challenging part in the journey of Data Science. So without wasting time let’s explore to the business problem of Mercari.

The Mercari Community wants us to build an algorithm that suggests the right product prices for shopping app from product name, user inputted text descriptions of the product, category name, brand name, item condition, and shipping information.Here pricing is about finding an intermediate between supply and demand. Isn’t it fun?

Now I will walk you through the fun behind the challenge and the possible solutions I approached to deal with this problem.

Source Of Data:

While it’s a kaggle problem the data can be extracted from the kaggle itself.The Kaggle link will be mentioned below. Here you will have five files where train.tsv is the train data and test_stg2.tsv is the test data for the final submission and sample_submission_stg2.csv is sample submission file.

https://www.kaggle.com/c/mercari-price-suggestion-challenge/data

Existing Approaches:

I will provide a link to a video which tells about the approaches followed by Paweł and Konstantin, the first prize winners of this competition.

https://www.youtube.com/watch?v=QFR0IHbzA30&t=2998s

I) They have nicely explained about optimization stuffs like using declarative instead of imperative statements.

II) For the text preprocessing they have used stemming, BOW, one hot encoding for categorical features.

III) They have used LSTM, CNN with relu as an activation unit with adam as an optimizer and including word embeddings for textual data as a solution to this challenge.

You can also check the below link for some similar solutions by sijunhe.

https://sijunhe.github.io/blog/2018/03/02/kaggle-mercari-price-suggestion-challenge/

First Cut Approach:

Having a domain knowledge and understanding the patterns of data really helps in solving ML problems. So, As an Initial step I would like to explore more about the data through various plotting techniques in Exploratory Data Analysis(EDA) for better understanding of the data. I also want to know the distributions of the target variables.

For Feature Extraction I would like to use Term-Frequency inverse Document Frequency(Tfidf) and Bag of words(BOW). I also want to introduce new features to the data like sentiment analysis, count of stopwords etc..,

I would like to make a try of Linear, lasso, ridge, LBMs and simple MLPs as my base models. And as a toss I would like to experiment with ensembles of my models.

Univariate Analysis:

Sub_category1:

We can see that the top three main categories of the products are women, Beauty and Kids. And Nearly 6 lakhs of Products have women as their main category.

similarly doing the same with the sub_category2, sub_category3 we got the below results.

Sub_category2:

Taking the top 10 sub_category2 we have the following analysis:

From the above bar plot we can see that nearly 12 lakhs of products are Athletic Apparel. Makeup and Tops & Blouses are the next top two repeating sub categories.

Sub_category3:

Taking the top 10 subcategories in sub_category3 we have the following analysis:

Pants, Tights, Leggings, Face and other categories are the top three repeating things in subcategory level 3.

From the above analysis It is clear that the dataset contains most of the products related to womens the most like cosmetics, dresses and some related accessories of womens. Now let’s explore with other features in the data.

BrandName:

Nike and PINK are the two brands which are the most common brands of the products almost equal in proportion. Victoria’s Secret is to the next in the competition. As we already know that most of the products doesn’t have brand in the data then obviously unknown brand will be in the top count among all these brands. But as a visualization part i skipped that one.

Name:

Taking the top 10 names of the products which are most repeating in the data.

Bundle, Reserved and Converse are the top three names of the products which are repeating the most. Nearly 2000 Bundle products are their in the data.

Item_condition_id:

Item_condition_id with value 1 is the most repeating id in the data. Nearly 60 lakhs of data has products with 1 as a condition id . And item_condition_id with 5 is the least repeating id among all the products in the train data.

Shipping:

We can see that most of the shipping fee is paid by buyers only.(55.3%)
44.7% of the product’s whose shipping fee is paid by sellers.

Price(target_value):

A complete description on the price which is our target value

In the left plot we can see that the price feature is right skewed heavily more like it is following right skewed distribution so as to make less errors on low prices more relevant than for higher prices, the competition has taken Root Mean squared logarithmic error(RMSLE). Thus we have taken log(price+1) of the target variable(price) in this data set.

Bivariate Analysis:

Shipping vs Price:

In the above pdf plot shipping with 0 has higher peakedness than the shipping with 1 and both the pdf plots are almost merging with each other.

Branded and Unbranded Products vs Price:

As I already expected branded products have high peakedness than the products with low peakedness that means branded products will relatively have high prices than the products with no brand.

Item_condition_id vs price:

The 50th Percentile of products with item_condition_id 5 is higher than the products with other condition id’s. Almost all the plots have the same range except products with item_condition_id with 5 as a value.

Chapter-3 : A solution through Machine learning Regression models:

As at the starting of the blog I have stated that we will use ML and DL models for this solution. Let’s make a try of different machine learning models on this data.

Step_3–1 : Feature Engineering on textual data:

As of now we are done with analyzing the data. Now the interesting part comes out that is feature engineering.

We have seen the textual data i.e item_description where I have tried using various feature engineering hacks but their I have found four methods to get interesting patterns in my data. So I added those features in my data which will be discussed below clearly.

Number of stopwords in item_description
item_description length
is_branded
sentiment score analysis through vader lexicon

I) Number of stopwords:

Let’s count the number of stopwords in the item description. This will be our new feature.

Counting number of stopwords in item_description

Now before going to the next feature engineering hack we need to preprocess the data. Note that I haven’t used preprocessed text in the above function because in preprocessing step it will remove all the stopwords hence we did counting of stopwords before preprocessing the data.

Text Preprocessing:

Preprocessing of the textual data is the most important step we need to do before applying feature engineering.

The basic text preprocessing involves the following steps:

i) replacing shorthands with full forms. Here the words like won’t will be replaced with will not etc..,

ii) We will replace string literals like \r, \\, \n with empty strings

iii) removal of characters other than alpha numeric characters.

A text is a combination of punctuation marks, special characters, white spaces etc.., Which are no more useful in model training. So we will remove all these characters.

iv)Removal of stopwords:

Usually text data contains stopwords which are no more useful as features as they are just to make a complete meaning in the english language
Hence it is necessary to remove stopwords which are not useful for the regression model.
One way to do that is by using nltk (Natural Language Tool Kit)

Now we are done with text preprocessing to visualize the preprocessed text we will use Wordcloud here to see the words which that occur most frequently in the item_description. The most repeated word will be have larger size than any other word.

visualization of preprocessing text using wordcloud

From the above wordcloud (brand,new,free,shipping,description,yet) are the most common words in the item description.
Sellers are using new,free,shipping,description words to advertise their products to the buyers.
Now let’s see the top 25 most repeating words in the preprocessed text.

new and size are the two top words that are most repeating in the item description.
Nearly 45 lakhs of products use new in their item description.

As the next feature engineering task let’s use item_description, is_branded and sentiment analysis as our new features,

II) Item_description length:

A complete description of description_length feature.

III) Is_branded:

We can see that most of the products don’t have brand.That can be used as a feature for our data.
We know that a product with different brands vary with their price. This is based on the company which it is producing.But A good brand will have a good price compared to the same product of different brand.
Therefore two similar products with different brands(known brand,unknown brand) can help us to know the price of the product.

IV) Sentiment Score Analysis:

Sentiment Score Analysis is often used as a feature engineering hack dealing with textual data.
It tries to identify and extract opinions within a given text.
Sentiment Analysis is a tricky part but it comes into handy by using nltk in python.
It is going to return four values: positive, negative, neutral, and compound.

To know more about this follow the given link:

https://www.geeksforgeeks.org/facebook-sentiment-analysis-using-python/

How Sentiment Score Analysis helps us in our task??

More often a positive description product may charge high. similarly a negative description product may charge low.
That means their is some correlation with the description and the price(target value) in our data and it signs a good vibes for our task.

As we can see their will be four values positive, negative, neutral and compound. Now these four will become our new four features in the data.

Step_3–2 : Splitting the Data:

Splitting up the data is mainly useful for hyper parameter tuning part of the machine learning. As every task ML plays a key role in model training and to make our model fairly well on test data, It is important to tune model hyper parameters.

And for that task we need data which is often taken from train data in small portion like 1–2% basing on the size of training data and can be referred as cross validation data or simply validation data.

In the given dataset I have found 831 products with zero price. But their will be no product in the market with price≤0. They might be outliers or human errors. So I have removed those products from the data.

Step_3–3 : Feature Extraction:

As we are done with feature engineering now it’s the time for feature extraction. Our data is a combination of categorical, numerical and textual features. But we can only feed numerical data to any ML model. Let’s see how we can achieve this using feature extraction techniques.

Vectorization can be used for textual data. Let’s use Bag of words(BOW) for categorical and Term- frequency and inverse document frequency(TFIDF) for textual data.

Vectorization of Categorical Features(One Hot Encoding):

We have sub_category1, sub_category2, sub_category3, brand_name, name as our categorical features.

One hot encoding can be achieved through sklearn implementation of countvectorizer. The code snippet to do this is available below.

One hot encoding for categorical feature- sub_category1

Note that fitting of countvectorizer should be done only on train data.

Do the same for other categorical features like sub_category2, sub_category3, brand_name and name features.

Vectorization of Textual Features:

Their are different for converting textual features into numerical format like TFIDF, word2vec, weighted average word2vec, using glove vectors etc..,

As we did simple Bag of Words for categorical features we will use TFIDF for our textual feature(item_description).

Below is the code snippet to achieve this task.

I have taken only the top 5000 features here where each feature has a frequency of minimum 10 documents. I have also used bi-gram for item_description.

Step_3–4 : Handling Numerical Features:

After adding new features to our data we have total 6 numerical features. (positive, negative, neutral, compound, description_length,count_stopwords)

Feature_scaling:

Since numerical features contains different scale values(some can be very high , some can be very low and some can also be outliers rescaling of these features is highly essential in solving optimization problems. These can be achieved in many ways by using standardization, normalization, minmaxscaler etc..,

I have used Standardization that means values will follow standard normal distribution by fitting values of each features with mean=0 and standard deviation=1.

Similarly we will do the same for the other numerical features also. Have a look at the github repository for the complete code guidance.

Binary Features and item_condition_id:

Now we are left with item_condition_id, is_branded, shipping features. These are also a type of categorical features. We will convert it into vector representation using pandas.get_dummies() and finally make this into sparse matrix.

It can be done using sklearn. This is how things get easier for us without writing explicit code for it and literally doing it in single line.

Concatination of all the features:

We have got the numerical format of all the categorical and textual features. Now we will merge all these features into a final sparse matrix which will be feeded for our Machine learning regression models.

Step_3–5 : Regression Models:

Now we all set for applying regression models to solve our problem. But which model to apply on our data??

Machine learning is the fastest growing field in the world. Everyday there will be a launch of bunch of new algorithms. Some of them may work and some may not work on the data. Their is no such ML algorithm that gives the super result then all the existing models. If it exists then all the models will be gone in dustbin. Basing on the Prior Knowledge, domain experts, from the problem statement and even from the first prize winners one chooses the algorithm to tackle their problem.

As a experiment I have used four linear regression models and one boosting model:

Linear Regression
Ridge Regression
Lasso Regression
SGD Regressor
Boosting Models like LGBM

Atlast we will also try ensembling of best models out of these.

Need of Hyperparameter tuning:

Hyper parameter plays an important role in model predictions because using hyper parameter tuning we can protest our model from getting underfit and overfit.

So we need to pick those values(hyper parameter) in such a way that both train and test errors are low and coinciding with each other.

For hyperparameter tuning purpose we are using gridsearchcv with 3 fold cross validation data.

I) Linear Regression:

Linear regression is a basic, simple and commonly used type of predictive analysis. The overall idea of linear regression is to examine two things:

Does a set of predictor variables do a good job in predicting an outcome (dependent) variable?
Which variables in particular are significant predictors of the outcome variable, and in what way do they–indicated by the magnitude and sign of the beta estimates–impact the outcome variable?
These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables.

RMSLE of Linear Regression: 0.4612 on train and 0.4693 on cv data

Their is no such thing of hyper parameter tuning in linear regression since the model itself finds a plane that best fits to the data.

Well it’s a good score with a simple model but let’s try out some other model that will improve the metric by performing hyper parameter tuning.

II) Lasso Regression:

least absolute shrinkage and selection operator simply LASSO is a regression analysis method that performs both variable selection and regularization

Lasso Regression is similar to linear regression but in addition to linear regression it performs shinkage.
Unlike linear regression LASSO has hyper parameter tuning with hyper parameters: alpha

I have got alpha = 1e-09 as the best value. So I again trained the model with the best alpha value.

RMSLE of Lasso Regression : 0.4642 on train and 0.4699 on cv data.

After Doing everything we got 0.4699 RMSLE which is roughly equal to the RMSLE of linear regression.
But Their is a small difference between train and cv error as compared to LR that means the model is less overfitting compared to LR model.

III) Ridge Regression:

Ridge Regression is a technique for analyzing multiple regression data that suffer from multicollinearity.

It reduces the model complexity by coefficient shrinkage.
It is also a linear model.
This Regression model also have hyper parameters in it {alpha , solver}.We will be using of two solvers here namely cholesky and lsqr as a purpose of hyper parameter tuning.

I have got alpha = 10 and solver = ‘cholesky’ as best values. so let’s train the model with these values.

RMSLE of Ridge Regression: 0.4637 on train and 0.4667 on cv data.

The result is slightly better than above two models.

IV) SGD Regressor.

Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear).

the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule i.e.. Learning rate.
It is an iterative method for optimizing an objective function with suitable smoothness properties. Hence it is also called as optimization algorithm.

I have got alpha = 1e-09 and learning_rate = ‘adaptive’ as best values. So let’s train the model with these values.

RMSLE of SGD Regressor: 0.4625 on train and 0.4688 on cv data.

V) LGBM Regressor:

As of now we have tried four simple linear models. So now let’s try a boosting model called LGBM.

Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm.

Faster training speed and higher efficiency than the other models we trained on the dataset.
We can see that it occupies less RAM.
It is supporting Parallel learning and it is compatible to higher datasets.

We can see that their are many hyperparameters to tune in lgbm. After doing everything I got ‘learning_rate’: 0.1, ‘max_depth: 15, ‘n_estimators: 200’, ‘num_leaves: 75’, ‘boosting_type’: ‘gbdt’ as my best values.

It has the better performance than any other model we have trained upto now.

RMSLE of LGBM Regressor: 0.4427 on train and 0.4590 on cv data.

Merging Results of All the Models( Ensembling ):

As a promise of using ensembling in our model let’s use this a toss model.

I have taken four best models and for each I added weights basing on their performance. Here is the code snippet for it.

Since lgbm is working well I added 0.6 weight and basing on less overfitting nature of lasso I added 0.2 and for ridge and linear I added 0.1 to each of these.

So the Final RMSLE of our solution through Machine learning Regression model is 0.4487 on cv data.

Chapter-4 : A Solution Through Deep learning concept of MultiLayer Perceptron(Neural Networks)

Why Neural Networks?

Neural networks are a specific set of algorithms that have revolutionized machine learning. They are inspired by biological neural networks and the current so-called deep neural networks have proven to work quite well. Neural Networks are themselves general function approximations, which is why they can be applied to almost any machine learning problem about learning a complex mapping from the input to the output space.

A multilayer perceptron (MLP) is a feedforward artificial neural network that generates a set of outputs from a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed graph between the input and output layers. MLP uses backpropogation for training the network.

It’s a deep learning method.
Unlike any machine learning model MLP doesn’t need explicit feature engineering tasks as the model itself adapts to the patterns in the data through back propagation. Interesting right?
Let’s Use an MLP model and check whether it improves RMSLE metric or not.

Now we will again load the data and will split it into train(for training) and val(for validation) as we did above.

The thing we are doing everything from the beginning is that for training a MLP may not work with the features that we introduced in the ML models. And I experienced this fact from this case study.

Step_4–1 : Handling Missing Values:

We can see that I have replaced the missing values with the empty strings. And from the concept of first prize winners I concatenated the name, category_name,brand_name and item_description into one feature and name and brand_name into another feature.I also normalized the item_condition_id feature.

Instead of taking price I have taken log(price+1) and finally I preprocessed these values using minmaxscaler.

Step_4–2 : Feature Extraction:

Here we will use TFIDF vectorization as a feature extraction tool for both ‘total_name’ and ‘total_txt’ features.

As I have stated that MLP model itself adapts to the patterns in the data through the weights, all it needs is lots of data to be passed through the network. I have choosen 1lakh features for each of the vectorization and I also included Bi-grams of textual data.

If you are using colaboratory then we can’t take more than 22 lakhs of features as the Colab crashes (I personally experienced this many times while working with this case study) and notebook will automatically gets disconnected and we need to run everything from the beginning.

item_condition_id and shipping features:

As we used pd.get_dummies() in ML we will use the same here also.

Error metric:

For model evaluation we will be using RMSLE (Root mean squared logarithmic error) metric.

Step_4–3 : Building Model Architecture:

We can use different types of Neural networks architecture to train our model. We are going to discuss about the four famous architectures as of today.

I) Perceptrons:

Considered the first generation of neural networks, perceptrons are simply computational models of a single neuron. They appeared to have a very powerful learning algorithm and lots of grand claims were made for what they could learn to do. However, the perceptron learning procedure is still widely used today for tasks with enormous feature vectors that contain many millions of features.

II) Convolutional Neural Networks(CNN):

Machine Learning research has focused extensively on object detection problems over the time. But There are various things that make it hard to recognize objects like segmentation, lighting, Deformation etc..,

Convolutional Neural Networks can be used for all work related to object recognition from hand-written digits to 3D objects. The real value of CNN came out in the competition of ILSVRC-2012 competition on ImageNet-a dataset with approximately 1.2 million high-resolution training images.The winner of the competition, Alex Krizhevsky (NIPS 2012), developed a very deep convolutional neural net of the type pioneered by Yann LeCun. These has made them so popular.

III) Recurrent Neural Networks(RNN):

RNNs are very powerful, because they combine 2 properties:

distributed hidden state that allows them to store a lot of information about the past efficiently, and
2) non-linear dynamics that allow them to update their hidden state in complicated ways. With enough neurons and time, RNNs can compute anything that can be computed by your computer.

IV) Long Short Term Memory(LSTM):

Hochreiter & Schmidhuber (1997) solved the problem of getting a RNN to remember things for a long time (like hundreds of time steps) by building what known as long-short term memory network. They designed a memory cell using logistic and linear units with multiplicative interactions. Information gets into the cell whenever its “write” gate is on. The information stays in the cell so long as its “keep” gate is on. Information can be read from the cell by turning on its “read” gate.

Our dataset is a based on textual features so I am building a simple MLPs to train neural networks. Here is the code snippet of it.

I have tried running the model for few more epochs but the rmsle on both Train and CV is going worser.
Hence I trained the model four four times with fitting each model three times with different batch sizes.

Now for each of these models I made individual model predictions of the data.
And finally I ensembled all the four model’s result to reduce the RMSLE metric.

RMSLE of Ensembled MLP: 0.4068 on train and 0.4179 on cv data

Result of all the Models:

After training various regression models of Machine learning and MLP of neural network I have got 0.4179 as my best RMSLE score.

summarizing the results of all the models

Final Submission Score:

As of now we have used all these models on train data(train.tsv.7z) but for final submission we have separate data(test_stg2.tsv). I submitted the price predictions of the emsembled mlp model which is the final solution to our problem.

Here is my final leaderboard score on kaggle after submitting the submission file though one of the kaggle kernel of this competition.

Final submission score on leaderboard of this competition

Chapter-5 : Conclusions:

The final solution to our problem statement:

As the main constraint of the given problem statement is to reduce rmsle metric. After training different ML models on the data we find a least RMSLE of 0.44 on cv data.
Further applying MLP the rmsle reduced to 0.41 hence the solver of this problem is MLP.

About Model Training:

I tried training the model for further 5–10 epochs what i observed from the results is rmsle is going worse. Hence i limited the training to one epoch.And i achieved 0.44 rmsle.
After that i fitted the same MLP model for three times with different batchsizes in the multiples of 2 and the rmsle is reduced to ~0.42.
Now I did ensembling of all the four similar trained MLP models and for each datapoint i predicted the price using those models and finally taken the mean out of those predicted prices.
As a result I achieved 0.41 rmsle on the cv data.
ML and DL is a task of experimenting with different approaches so I have tried various ways to achieve this task.

My Improvements to Existing Approaches:

Instead of taking individual features I combined features to get two new features and out of this I extracted 2 lakhs of new features through vectorization. And I also took Bi-grams of these features which worked fairly well for this data. I have also found outliers in the data where the product prices are equal to zero, so I removed all the data points whose product prices are less than or equal to zero while training the model. All these things improved my model error metric over online solutions of this competition.

Future work:

As a future work I would like to experiment with the new ML regression models by trying out new engineering features to the models.
I also want to try Convolutional Neural Networks for this problem. I wish to make a try of different hyperparameters for the faster convergence of the model.

References:

Well This is my first Kaggle experience where I learned the real world scenarios of applying ML and DL. As an aspirant of becoming Data Scientist brings me here. This includes my final work for this competition. Thankyou for reading to my article.

You can check the .ipynb for full code snippet of this case study in my Github repository.

Follow me for more such articles and implementations on different real-world case studies in data science! You can also connect with me through LinkedIn and Github

I hope you have learned something out of this. There is no end for learning in this area so Happy learning!! Signing of bye :)