Mercari Price Suggestion Challenge — An End-to-End Machine Learning Case Study

Published in

The Startup

24 min readDec 12, 2020

“People don’t buy what you do, they buy why you do it.” ― Simon Sinek

Have you ever tried selling your old smartphone, but cannot figure out what would be a fair price to sell it? What if you undervalued it and you are at loss? Or what if you set the price so high that nobody’s ready to buy it?

Product price suggestion is a problem that many retailers are trying to solve, especially with the advent of online selling platforms where there are a plethora of products that can be sold very easily. Mercari is one such online shopping marketplace, which is Japan’s biggest community-powered shopping platform.

Mercari wanted to offer pricing suggestions to its sellers. This is tough because their sellers can sell just about anything, or any bundle of things on Mercari’s marketplace. To understand why this problem is tough to solve, consider the following two products that are put on sale.

One of these sweaters cost $335 and the other cost $9.99. Can you guess which one is sold at what price? Small details can mean big differences in pricing.

This is exactly the problem that Mercari challenged the Kaggle community to solve, with a Prize Money of $100,000.

Contents Summary

Business Problem
Source Of Data And Its Overview
Machine Learning Problem Formulation
Exploratory Data Analysis (EDA)
Feature Engineering
Traditional Machine Learning Models
Deep Learning Models
Deployment
Experiments that did not work well
Future Work And Scope For Improvements
References

1. Business Problem

Mercari wants us to build an algorithm that automatically suggests the right product prices from the user-inputted text descriptions of their products, including details like product category name, brand name, item condition, etc.

Business objectives and constraints

The goal is to solve the problem of suggesting the appropriate price of products to online sellers.
No latency constraints, because we would like to suggest a highly accurate price to the seller, even if it takes a reasonable amount of time.

Now I will walk you through the approaches that I’ve used to solve this problem step-by-step, and how I got my score in the top 2.77% (at 66th position) on the Kaggle leaderboard; in the Silver zone. The competition has ended long ago, so technically I could not make my way into the leaderboard. Although it was a great opportunity to solve this problem and see where I would have stood in the leaderboard during the competition. Make sure to read it till the end!

2. Source Of Data And Its Overview

The data was uploaded by Mercari on Kaggle,

The training file (.tsv) consist of a list of product listings which is tab-delimited.

train_id — the id of the listing
name — the title of the listing
item_condition_id — the condition of the items provided by the seller. From 1 to 5. 1 being ‘new’ and 5 being ‘poor’
category_name — category of the listing
brand_name
price — the price that the item was sold for. This is the target variable that we will predict. The unit is USD.
shipping — 1 if shipping fee is paid by seller and 0 by buyer
item_description — the full description of the item.

The size of the train.tsv = 322 MB.
Total number of product listings = 14,82,535

3. Machine Learning Problem Formulation

Type of Machine Learning problem

The prices (target) are in real number. Hence, this problem falls under a regression problem.

Performance Metric

The metric that I will be using is Root Mean Squared Logarithmic Error (RMSLE). Lower the score, better the performace.

The choice of this metric is due to the following reasons,

It is robust to outliers.
It is scale invariant.
It will penalize under-estimation more, than over-estimation. This makes sense from the business perspective, because ideally the company would like to sell the products at a reasonably higher price. This also increases the customer (seller) satisfaction as well as the commission that the company might get for each sale. Another reason is that, the price distribution is highly skewed to the right (this we will see in the EDA). Meaning there are a lot of low priced products as compared to very high priced products.

4. Exploratory Data Analysis (EDA)

EDA is the most important part before doing any type statistical or mathematical modeling. EDA helps us to understand the data and get better insights. Only after understanding the data, we can apply meaningful feature engineering to the data.

Checking Null values in the dataset

As you can see there are a lot of null values in the brand name column. The same is insignificant in the categories and item description columns.

Analyzing Price (Target column)

Plotting the PDF of the price column gives us the following distribution,

Probability Density Function [PDF] of prices

We can see that the price distribution is highly skewed to the right. We can also see that most of the prices are between $0 to $250. There are very few items which are priced above $250. The distribution looks similar to a Pareto distribution.
We will get a clearer picture of the price distribution, when we check the prices at several percentiles.

From the prices at the above percentiles, we can see that 99% of the prices are less that $170. The higher prices can be termed as outliers. Now, there are two things that can be done.

We discard the outliers and our model never learns to predict higher prices.
We do not discard the outliers and keep them in our training dataset, and our model learns to predict the higher priced products too. BUT… this would hurt the overall model performance, because after all we are considering outliers in our training.

My suggestion would be to take option 2. We can keep the higher priced products, but as most of the products are lower priced, we can use some kind of metric that would help the model learn the lower priced products more than the overpriced ones, but not totally discarding the higher priced products.

Here our metric RMSLE comes to the rescue. This metric is typically used when the data is highly skewed (especially to the right, like in our case). This is because of its log operation, and its mathematical properties like scale invariance and robustness to outliers, to name a few.

Removing out of range prices
When I did some research on Mercari’s website, I found that the items can be priced between $5 — $2,000. And in our dataset there are certain products whose prices are less than $5 (even $0) and prices with >$2000 dollars. I removed those product listings from the dataset.

Applying box-cox transform to the prices

As the original price’s distribution looks like a pareto distribution, I thought of applying a box-cox transform to it. When we apply a box-cox transformation on a pareto distribution, the result will be a normal distribution. Following is the distribution of prices with box-cox transformation.

Effect of box-cox transformation on prices

Well, this distribution does not look like a perfect normal distribution. Actually the thing is that, a box-cox transform will “try” to make it a normal distribution, but does not guarantee it.

Applying log transform to the prices

A simpler way of transforming a pareto distribution to a normal-like distribution is by applying a log transformation.

Further EDA will focus on the log transformed prices wherever necessary, and not on box-cox transformation. The reason for it will be discussed later. But to give you a spoiler, let me just say that the box-cox transform did not work well on the machine learning models (that will be discussed in later sections).

Analyzing Shipping Column

No big variation is observed in the number of items sold with seller paid shipping [1] or buyer paid shipping [0]. However, it is clear that the number of items where the seller pays the shipping is less than those where the buyer pays for it.

PDF and box plot of shipping data against log price

From the PDF of prices as per the shipping, we can see that if the price of the product is very low (less than 6 dollars), then the shipping is mostly paid by the seller.

antilog(~1.8 dollars) = 6 dollars

Analyzing Item Condition

There is a significant imbalance in the item condition categories, especially in category 4 and 5. It looks like there are very few items in ‘Fair’ [4] and ‘Poor’ [5] conditions.

PDF and box plot of item condition against log price

We can say that the prices are evenly spread for each of the item condition category.

The effect of item condition on price (seen from the box plot) shows that, the median prices are highest for ‘poor’ conditioned products, followed by ‘new’ condition products. This might be due to the fact that those products in poor condition are of such category where the prices are usually higher, for eg: electronic category like laptop, sports category like hoverboard, etc. So their native prices are itself very high.

Analyzing Item Name text

We can see that most of the sellers are mentioning the brand names and the name of product in the item name section. We can see that they are also highlighting the free shipping info in this section in order to attract the buyers. As seen earlier, there are a lot of missing values in the brand name column. So, we can try some feature engineering hack to guess the missing brand name from the item name.

Analyzing Item Description

Word cloud of item description text data

We can see that most of the sellers are using the words which describes the condition of the product like ‘brand new’, ‘good condition’, ‘worn out’, etc. We can use this info to do sentiment analysis on it, and add the sentiment scores as one of the features in the dataset. We can see that they are highlighting the shipping details in the description like ‘free shipping’. They are also mentioning the brand names in the description.

Analyzing Brand Name

There are 4810 unique brand names. We can see that in the top 50 most mentioned brands, most of them are apparels and some are electronics.

Analyzing Categories

The categories are mentioned in three parts (slash separated) like this,

So I decided to divide this categories column into three separate categories. Like this,

No. of unique values in main category: 11
No. of unique values in Sub_category1: 114
No. of unique values in Sub_category2: 871

General category

[If I use the term ‘Main category’, then it would be synonymous to ‘General category’]

After dividing the category into three parts, the Women category leads the general category with being 44.81%, followed by Beauty category with being 14.02%. Sports category has the least count.

Violin plot of General categories against log price

From the above graph we can see that there is a little difference in the distribution of the log prices for every general category. Relying solely on the general category to make a price prediction will not be a good idea. We should use it in combination of other categories features.

Sub-category 1

Sub-category 2

As suspected earlier from the brand names, most of the products are seen to belong to the apparels category that too for women. Only ~6000 categories are missing (‘No Label’ data). As the number of missing categories is very less, I don’t think doing some special feature engineering to fill them is worth it. I will keep it as No Labeled data and see how our model is performing.

Analyzing expensive brands

The expensiveness of a brand is determined considering the median price of each brand. So the prices that you see in the following graph are the median price of the corresponding brand. A more pythonic way of explaining this is as follows,

Top 15 brands Word cloud

Top 15 brands’ Item description word cloud

Top 10 expensive brands for each category

Note that the prices below are the median prices.

For each general category, I took the median prices of each brand, and found top 10 brands for each category.

Top 10 expensive brands for each category

Analyzing interaction effects of more than one feature on the price

Brand & Condition -> Price

x-axis: brands; y-axis: item condition; color temperature: prices

This interaction of the top 15 expensive brands (as per median price) and item condition shows that, these expensive brands don’t have any items in ‘poor’ condition. Most of them are sold with condition 2 and 3.

The brand ‘demdaco’ is the costliest brand, followed by ‘auto meter’ and ‘proenza schouler’.

Brand & Category -> Price

x-axis: brands; y-axis: general category; color temperature: prices

Category & Condition -> Price

x-axis: general category; y-axis: item condition; color temperature: prices

The interaction between the category and item condition shows that, Men, Women and Home category products are sold the costliest in the condition 1 (‘New’ condition). Electronics are sold highest with condition 4 and 5 (‘Fair’ and ‘Poor’ condition).

Category -> price

Observing the interaction of general category with the prices, it is seen that the Men category has the highest priced products.

From this subcategory and price interaction, it is seen that Camera & Photography is the costliest subcategory, followed by Computers & Tablets. The cheapest ones are Paper Goods and Quilts

5. Feature Engineering

“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.” — Prof. Andrew Ng

We will do the following feature engineering steps on the dataset.

Filling all missing data with some default values
Fill missing brand names
Pre-process text data
Split categories into three parts
Adding Item Description and Item Name word counts
Add cheap or expensive brands feature (binary data)
Vectorizing the text and categorical data

Filling all missing data with some default values

This can simply be done in pandas by using the fillna() function.

filling missing values with a default value

Fill missing brand names

During the EDA, we saw that the sellers are mentioning the brand name of the item in the Name section and Description section. So, I could extract the missing brand name from the either of them. However, I chose to use only the Item Name, because I found that in some descriptions the sellers are comparing the products with other brand names. So chances would be that I would extract the wrong brand name.

It is better to have no data rather than having wrong data.

We will extract the brand name as follows,

Get the list of existing brand names

Create a dictionary of brand_name -> category. This will be used to check whether the “guessed” brand name is of same category as the brand name that we are filling. This is for extra caution.

Pseudo code for filling the missing brands,

for each brand and category in dataframe:    
    if brand is missing:
        for each brand in existing_brands:
            if brand exists in item_name_text\
               and category == brand_names_categories[brand]:
                    return brand

Using this I could fill 1,35,274 missing brand names. That is around 27% of originally missing brands.

Pre-process text data

The following steps were done for pre-processing the text.

De-contract the text. For example, don’t -> do not; I’m -> I am; I’ll -> I will
Remove punctuations and extra spaces
Remove stop words
Apply Stemming on the text

Split categories into three parts

It’s pretty easy to split the category into three parts using a single line of code,

Adding Item Name and Item Description word counts

The length of the text can be a deciding factor for the price of the product.

It is always a good idea to standardize the numeric data when we are using it for machine learning tasks. Hence, I also standardized them, so that they lie between -1 and +1.

Add cheap or expensive brands feature

This is totally a new feature that was added which will tell us whether the brand is an expensive brand or not. If you think logically then, expensive brands costs higher than cheaper ones, duh! This feature helped me a lot in predicting the prices. Before showing you the algorithm, let us understand how one aspect of the products helps us a lot in deciding its expensiveness.

Understand that, the expensiveness of a product is decided not just from the price of the product, but also from the category of the product. For example, a T-shirt for Men category of $200 price is very [very] expensive. However, a same priced laptop which is of Electronics category is considered a steal! So, you see, the category of a product shifts the price bracket of the product.

Now talking about our dataset, each category has so many items of various prices. There has to be a price threshold to determine that the product is expensive. What price should be considered for each category? For that, I chose to go with the 95th percentile median price for each category. The 95th percentile was chosen after a doing analysis on how the prices change after each percentile, and 95th percentile seemed to be a decent threshold.

First I created a dictionary with general categories as keys and their respective 95th percentile prices as threshold values.

Creating dictionary which has threshold price for each category

Now I had to create a list of expensive brand names. Keep in mind that this list will be derived from the training dataset. The algorithm to create such list is as follows,

Creating a list of expensive brands

Now we can use this list to create a new feature in the dataset, where we will check the existing brand name, and if that brand name exists in this expensive_brands list, then we will put a value of 1, otherwise 0. This list of expensive_brands is static in nature.

Vectorizing the text and categorical data

A machine does not understand ‘text’ as it is. We need to convert it to some numeric form, so that we can apply the mathematical formulations [models] on top of it. This is called ‘Vectorization’. There are numerous ways to vectorize text data. Here, I have used Tf-idf vectorizer with bi-grams and max_features=1,00,000. Other popular choice of vectorizing text is Countvectorizer. These are also called ‘Bag of Words’ vectorizations. How Tf-idf and Countvectorizer works is beyond the scope of this blog. However, I have attached links in the references section if you want to learn more.

6. Machine Learning Models

The train-cv split was 80:20.

Before discussing about the models and its scores, a quick reminder that, it is very (x10) important to do hyper-parameter tuning on any machine learning model. All the cv scores below are only after doing a thorough hyper-parameter tuning on each model.

Lasso Regressor: This is a Linear Regression model with L1 regularization. Using this model, it gave a score of 0.7016 on the cv set. This score does not look very good. The reason for this may be that, an L1 regularization is a very strong form of regularization by nature. It brings sparsity to the model. Meaning, it has the capability to reduce the importance/weight of a feature down to zero. So, it will totally discard few of the features.

Ridge Regression: This is also a Linear Regression model with L2 regularization. An L2 regularization will surely reduce the weights for less important features, but unlike L1, it will never reduce it to zero. This gave a cv score of 0.4385. This performs fairly good as compared to Lasso.

SGD Regression, is regression model which uses ‘stochastic gradient descent’ approach, unlike the Ridge and Lasso which uses ‘solvers’. This has shown to give better results as compared to the above two regressions. The score on cv was 0.4359.

LGBM, is a light version of Gradient Boosting Model. It is a tree based ensemble model. The cv score using this was 0.4599. This does not provide any better results as compared to the above discussed three models. The reason for that lies in the nature of this model i.e. ‘Tree based model’. Tree based model have shown to not work well on data with very high number of features. Here we have 2,00,000+ features, hence it was bound to not give us better results.

FM FTRL: This model makes use of Factorization Machines (FM) using Follow The Regularized Leader (FTRL) approach. To implement this model a library named WordBatch was used. Here I’ve used 1000 dimensions for the FM. This means, it will take 1000 combinations of the features from the original set of features and create the model using FTRL. If you want to learn how FM FTRL works, then the reference links have been attached in the end. This gave better results than all the above models with a cv score of 0.4265.

Stacking Regressor: This is another type of ensemble model, where we can have several base learners and one meta learner. So basically, the base estimators will be predicted individually, and on top of that, a meta estimator will take those predictions of each base estimator as features and make a prediction. Hence, called Stacking.
I used ridge, sgdr and lgbm as the base estimators and xgb regressor as the meta estimator. This has gave a better result as compared to the other three regression models (except FM_FTRL). I couldn’t include FM_FTRL in the stack, because the FM_FTRL implementation does not have certain elements/functions required for cloning the model’s object during stacking in sklearn. A cv score of 0.4283 was achieved.

Voting Regressor: I have used two types of Voting mechanism.

Sklearn’s voting regressor: Here I used ridge, sgdr, lgbm and the stacking regressor as the base estimators (There is no meta estimator in voting models). This model performs even better than the standalone stacking regressor, giving a cv score of 0.4270.
Manual voting: Here I had to weigh the predictions of more than 1 models manually (best model gets more weight). After trying several combinations, I chose FM_FTRL, stacking regressor and lgbm regressor as the three models and each of them was weighed according to its individual performance. The following weights were used for the three models,
fm_ftrl*0.60 + sstacking*0.18 + lgbm*0.22
Surprisingly LGBM model proved to be very useful here even with the least weight. None of the other three models could take its place in giving better final results, giving a cv score of 0.4210.

All these good looking scores, BUT…

Looking at these CV scores I tried to submit the manual voting score - the best model so far (or so it seems) on Kaggle. However, on the Kaggle’s stage 2 test dataset which is of 3.4 million listings, it gave me a final submission score of 0.62143, which was disappointing. After all, what matters is not the cv score, but the final submission score on Kaggle. That reflects the true potential of your model.

7. Deep Learning Models

Finally, I had to resort to the Deep Learning (DL) techniques. Following are the models that I experimented with,

GRU model using Glove word embeddings
1D CNN model using Glove word embeddings
GRU using Fast Text word embeddings
1D CNN model using Fast Text word embeddings
Combination of CNN and GRU in a single model
Simple Multi-Layered Perceptron (MLP) model
Ensemble of best models

Intro to Word Embeddings

We saw earlier that we can vectorize the text data using Bag of Words (BOW). If you see its working, then you will find that when vectorizing the text using BOW, we are not capturing the context of the words. We are just using straight forward methods to calculate and assign a number to each of the words individually. Best we can do is n-grams (like I used bi-grams which takes into consideration 2 words at a time), but vectorizing n-grams is too costly to train and store, because of large number of resultant features. Also if you vectorize the text using a large n [grams], then in-turn it would hurt the model’s performance, because there are too many blind spots in each resultant feature; also referred to as the ‘curse of dimensionality’. (The theory is again beyond the scope of this blog. To learn more, links are attached the end.)

So researchers found a much more efficient way to represent a word which also captures its context. They are called the word embeddings. There are several types of word embeddings like Word2Vec, Glove, Fast Text, etc. Each of the word embeddings are trained separately on humongous data sets (like all the Wikipedia text), using complex deep learning models, which resulted in a certain dimensional representation for each word (like 100D, 200D, 300D and so on). We can easily download the word embeddings file which is in the form of a dictionary, with words as keys and their n-dimensional vector as value.

GRU model using Glove word embeddings

Here, for the text representation I used Glove embedding of 200 dimensions, which will be passed through a GRU layer. For the categorical features, I used an Embedding layer of 32 dimensions. This was a simple model, with not much complexity in it. I just wanted to see how it works out. The full architecture is portrayed as follows,

DL architecture using GRU (Glove word embeddings)

[Link to image, in case the above image is not clear. Open the link, then right click on the imgur image to open the image in new tab, then click to zoom]

This architecture gave me a cv score of 0.4885. On the Kaggle test dataset (stage 2; 3.4 million listings), it gave me a submission score of 0.48415. This is looking good as compared to the traditional machine learning models. So, now I can work on improvising this score further.

1D CNN model using Glove word embeddings

As I was confident that the DL models will work better, at least as compared to the traditional ML approaches, I decided to use 1D CNN layer in my architecture. 1D CNN layer is typically used for text data. It convolves in 1D space only (and texts are in 1D). For the text representation I used Glove embedding of 200 dimensions, which will be passed through a 1D CNN layer followed by a MaxPooling layer. Here, I made some changes in the model, and tried to make it a little more complex by making use of skip connections and concatenations. The item_condition, shipping and is_expensive features are the low cardinal features. Those and the numeric data (i.e. the item description and name word lengths) are directly used as inputs in the model. I did a skip connection on them, deep into the network. The full architecture is as follows,

DL architecture using 1D CNN (Glove word embeddings)

[Link to image]

The architecture might look scary at first, but trust me it is very easy to grasp. Just follow the arrows!

This architecture, gave me a submission score of 0.44844 on the Kaggle test dataset. This was even better than the simple GRU model that I tried before.

GRU using Fast Text word embeddings

As there are several other types of word embeddings as discussed earlier, I decided to use Fast Text word embeddings of 300D this time. I used the same architecture as described above, but instead of 1D CNN layer, I used a GRU layer. Other than that the whole architecture is exactly the same.

DL architecture using GRU (Fast Text word embeddings)

[Link to image]

This architecture, gave me a submission score of 0.45643 on the Kaggle test dataset. This was a decent score, but not as good as the 1D CNN using Glove.

1D CNN model using Fast Text word embeddings

Here, I used Fast Text word embeddings of 300D for the 1D CNN model. The model architecture is exactly same as the 1D CNN Glove, hence I am not showing it again.

This architecture, gave me a submission score of 0.43937 on the Kaggle test dataset. This was the best score that I have gotten so far!

Combination of CNN and GRU in a single model

As the Fast Text word embeddings was proving to be performing better, I decided, why not merge the 1D CNN and the GRU models in one architecture and see if it improves the score?

The model architecture is as follows,

CNN and GRU in sequential manner (Fast Text)

[Link to image]

This architecture, gave me a submission score of 0.46721 on the Kaggle test dataset. This is a decent score, but not a very good score as compared to other models we saw so far. You can see above, that I have used the CNN-GRU blocks in sequential order. As I did not get a good score using this architecture, I tried to concatenate the CNN-GRU blocks to see how they work out. Model architecture as follows,

CNN and GRU in concatenation (Fast Text)

[Link to image]

This architecture, gave me a submission score of 0.45951 on the Kaggle test dataset. This is a decent score, and a bit better than the sequential architecture just showed before this.

Simple Multi-Layered Perceptron (MLP) model

“Life is really simple, but we insist on making it complicated” — Confucius

After trying these complicated architectures, I thought why not try a very simple model using only few dense layers? A simple multilayer perceptron network. There will be no word embeddings here. Just the simple BOW vectorizers that we used earlier.

After training this model, I got a submission score of 0.43909, on the Kaggle test dataset. Wow, this is something amazing! Such a simple architecture proved to provide so much better results than any of the complex architectures discussed above.

Well, Confucius was truly a far sighted man!

Ensemble of best models

Obviously, I wasn’t going to settle down for this score. Hence, I decided to use an Ensemble of the best DL models so far. I used the Simple MLP, Fast Text CNN and Glove CNN models. I did manual voting on each predictions such that the best one gets a higher vote. The voting was as follows,

The results were the best among all! This gave me a submission score of 0.41689, on the Kaggle test dataset.

This score comes in the top 2.77% of the Private leaderboard, and my position would have been 66th.

My [could have been] position on Kaggle private leaderboard

8. Deployment

I have create a web app using Flask and deployed the model.

Here, is a short video demo of my deployed application.

Deployment Video Demo

9. Experiments that did not work well

Box-cox transformed prices

When I converted the prices to box-cox transform, I got a fitted λ value of -0.2435. Training the model on the box-coxed values as targets, was pretty straight forward. However, when I tried to convert the predicted values in to the real prices by doing an inverse box-cox transform, I was getting nan values for some prices.

The reason for that is, with a negative value of λ, the maximum possible transformed value is -1/λ. The inverses associated with values greater than -1/λ will be a complex number and hence all nan.

BERT

Instead of word embeddings, I tried using a sentence embedding using BERT. However, training with BERT (and even DistilBERT) was too much time consuming. It took 4 hours to complete a single epoch (obviously on GPU)! There was an out-of-memory issue too. This is because of the large dataset.

10. Future Work And Scope For Improvements

Instead of the binary feature — cheap vs expensive, we can have multiple levels of “expensiveness”, based on the percentiles.
We can try other types of word embeddings like Word2Vec embeddings.
We can try so many other feature engineering methods. For example, merging the description and text columns and then using word embedding on it.
Try web scrapping and get more data.

11. References

https://www.kaggle.com/c/mercari-price-suggestion-challenge/
https://towardsdatascience.com/natural-language-processing-feature-engineering-using-tf-idf-e8b9d00e7e76 (TF-idf)
https://qr.ae/pNa6ZB (TF-idf)
https://towardsdatascience.com/natural-language-processing-count-vectorization-with-scikit-learn-e7804269bb5e ( Count vectorizer)
https://www.analyticsvidhya.com/blog/2018/01/factorization-machines/ (Factorization machines [FM])
https://medium.com/@dhirajreddy13/factorization-machines-and-follow-the-regression-leader-for-dummies-7657652dce69 (FM FTRL)
https://www.kdnuggets.com/2017/04/must-know-curse-dimensionality.html (Curse of dimensionality)
https://towardsdatascience.com/box-cox-transformation-explained-51d745e34203 (Box-cox)
https://machinelearningmastery.com/power-transforms-with-scikit-learn/ (Power transform)

Link to my GitHub Repo

chintan1995/mercari-price-suggestion-challenge

This repo shows my approach on solving the Mercari's Price Suggestion on Kaggle and how I got my score at the 66th…

github.com

Mercari Price Suggestion Challenge — An End-to-End Machine Learning Case Study

Contents Summary

1. Business Problem

Business objectives and constraints

2. Source Of Data And Its Overview

3. Machine Learning Problem Formulation

Type of Machine Learning problem

Performance Metric

4. Exploratory Data Analysis (EDA)

Checking Null values in the dataset

Analyzing Price (Target column)

Applying box-cox transform to the prices

Applying log transform to the prices

Analyzing Shipping Column

Analyzing Item Condition

Analyzing Item Name text

Analyzing Item Description

Analyzing Brand Name

Analyzing Categories

Analyzing expensive brands

Top 10 expensive brands for each category

Analyzing interaction effects of more than one feature on the price

5. Feature Engineering

Filling all missing data with some default values

Fill missing brand names

Pre-process text data

Split categories into three parts

Adding Item Name and Item Description word counts

Add cheap or expensive brands feature

Vectorizing the text and categorical data

6. Machine Learning Models

All these good looking scores, BUT…

7. Deep Learning Models

Intro to Word Embeddings

GRU model using Glove word embeddings

1D CNN model using Glove word embeddings

GRU using Fast Text word embeddings

1D CNN model using Fast Text word embeddings

Combination of CNN and GRU in a single model

Simple Multi-Layered Perceptron (MLP) model

Ensemble of best models

8. Deployment

9. Experiments that did not work well

Box-cox transformed prices

BERT

10. Future Work And Scope For Improvements

11. References

Link to my GitHub Repo

chintan1995/mercari-price-suggestion-challenge

This repo shows my approach on solving the Mercari's Price Suggestion on Kaggle and how I got my score at the 66th…

LinkedIn profile

Chintan Dave - Tata Consultancy Services | LinkedIn

Skilled in Python Programming Language, Natural Language Processing, Data Science and Machine Learning. Strong…

Written by Chintan Dave