Mercari price suggestion: kaggle competition problem

Jan 23 · 26 min read

Here I will share how I solved this problem.

So let’s start — -

Here is the path, which I will pick and explain one by one.

  1. Introduction
  2. Business Problem
  3. Mapping to ML/DL problem
  4. Understanding the data
  5. First cut solution
  6. EDA
  7. Feature Engineering
  8. Modelling
  9. Results
  10. Conclusions and Future Work
  11. Profile
  12. References

Let’s dive in and understand each section.

1. Introduction —

Mercari is the shopping app of japan. They want to suggest prices to sellers, but this is not easy, so they want a system or algo which suggests prices for products automatically. So their owners put this problem on kaggle with some prize.

Basically here I have to predict the price of products given some info about products like product description, product_name, product_brand_name etc.

2. Business problem —

One of these sweaters cost $335 and the other cost $9.99. Can you guess which one is which.

sweater_1 — ABSOLUTE DEFENSE Avengers Hoodies for Men Women Casual Stylish Sweatshirt Regular fit Winter Jacket Boy Girl Hoodie.

sweater_2 — ADRO Heartbeat Love Mom & Dad Design Printed Hoodie/Sweatshirt for Men & Women.

This is the problem which we have to solve. Problem is suggesting the price for products given some information.

Product pricing is getting bigger and bigger day by day in big ecommerce companies. So basically the problem is when any sellers list their products on ecommerce website then they use to fill price for that product which is not according to products, so by this, product don’t sell(suppose any one list his 10 years ago hero bike on the price of 40,000 so it’s obvious any body don’t wanna by that because the new hero bike price is itself almost 50,000).when product doesn’t sell then product lasts a long time on an website, so the company(which own that website) can’t make profit from that product. So if there are a lot of products like that then the company will have to bear the loss. This is a problem which can close the company. So to solve this problem we will make a ml model which can predict the price based

on it’s features like product description, for this we will have data on which we will train our model.

3. Mapping to ML/DL problem —

Here the label is real value so it’s a regression problem. For given products features we have to predict it’s price.

a. The difference of predicted price and actual price should not be very high otherwise there will be no mean of implementing ml model

b. There are not any strict latency concerns, it can take 2 to 5 sec.

c. Interpretability is important.

As we know this is a regression problem Here we can use RMSE or RMSE or MAE or R-SQUARED as performance metrics.

4. Understanding the Data — —

I have downloaded the data from kaggle competition itself.

You can download the data from Here.

Here is the info about data —

There are eight columns in the dataset.

1. Train_id — this is the id of each product. ex. 1,2,3 etc

2. Name — name of the product. Ex. MLB Cincinnati Reds T Shirt Size XL. It is text data.

3. Item_condition_id — this is the rating of a product by its condition. Ex. 1,2,4 etc.

4. Category_name — category of the products. Ex. Men/Tops/T-shirts. It is categorical data.

5. Brand_name — brand name of the products. Ex. adidas. It is a categorical column.

6. Price — price of the products, this is the target feature.

7. Shipping — whether the shipping fee is paid by seller or buyer. It will be 0 or 1.

8. Item_description — this is the description of the products. Ex. Striped Men Hooded Neck Red, Black T-Shirt. ​This is the text data.

5. First cut solution —

  1. First i will prepare the data means i will clean the data and i will preprocess it.
  2. I will do exploratory data analysis. Remove outliers based on visualizations.
  3. Preprocess the text data and convert it into vectors using tf-idf or bow or using own trained word2vec on our text corpus). For tf-idf vectorization I can use the sklearn tfidfvectorizer method.
  4. I will make some features based on this paper like I will count the words in product description and make it one feature, adding the term frequency of all word in description and dividing it with length of sentence, etc, we will generate more feature like this based on analysis.
  5. Then I will make a random model and find a metric score so that we can check how much better our next models are.
  6. After that I will train some regression models like ridge regression and after that I will also make an MLP model.
  7. For training mlp model I will first train an word2vec cbow model on text description, for train word2vec model i will first make data from each description focus word as label and context word as feature here we can take context word based on analysis. for making feature for each context word first we make vector size of vocabulary(words in description column) and on the index of context vector i put 1 and on other indexes i put 0 so this will be out first context word(represented by binary vector), let’s say if i have select context word as 5 so we will do above steps for each of the 5 context vector, after doing all this i will get 5 vectors each of size of vocabulary size. we will do this also for our focus word and then giving input as those context vectors(which we made from context word so let’s say our vocab_size=50 and context_word=5 then input vector whole length is 250). I will have a hidden layer of size n(n is the size of each word representation vector) and then I have the output dense layer of vocab_size nodes with softmax activation function.
  8. Training this model I will get a weight vector matrix between the hidden layer and output layer of size (vocab_size, n), which have a representational vector of size n for each word.
  9. Using word2vec model i will convert each row(description column) into n size vectors. And then I will use a simple mlp model to train this new data.
  10. Based on the results I will modify further.
  11. Basically I have to do a lot of hypermeter tuning.

6. EDA —

  • Text data —

First I am Preprocessing text data. Text columns in our data are item_description and name.

For Preprocessing text data I am making some function decontracted and preprocess text data.

Decontracted function — in this function we are converting some short form to full form(if in data) like won’t to will not and many more. You can see code below —

Preprocessing_text_data — in this function we are removing everything else a-z and A-z, like removing stop words and removing special characters. Basically it can be achieved by this one line —

But I have written so much code, if you want you can comment some lines. In this I am also lemmatizing words so that it can generate meaningful root words. This function returns preprocessed texts. You can see it’s code here —

After writing these two functions I am passing my each text column to Preprocessing_text_data function one by one and the function returns the same length list with preprocessed values and finally saving all columns to df. You can see code for this below —

  • Categorical data —

Now I will move on categorical data. I have two categorical columns in my data category name and brand name.

I am lowering the case on the brand name column so that all brand name rows become in lower case. I am doing this because models can count Adidas and Adidas as different brand names.

Let’s move on to category_name.On the category_name column I am doing the same thing converting all categories to lower case. Then I have done some analysis in between Preprocessing and found that category column have max 5 category and min 3 category so here I think why don’t i make five column for each row of category column(because there can be max 5 category in a row) , let’s see visually —

I think now It is understandable. I am doing this because I can analyze respect to category and sub category and now it is also easy to featurize. You can see it’s code here —

After all this I have to move on numerical data there is nothing to do with the numerical column so let it be as it is.

Now I will move on to my next section.

Note — for code of analysis and visualizations you can checkout github link, I will paste at bottom of this blog.

Doing EDA on the data and understanding each feature —

First I will see the distribution of price. See plot here —

In this plot I can see that most of the product price is under 50, which means there are very few products that have very high prices.

  1. Name column — First I want to understand here that words which belongs to some price range, can I see some word which are deciding whether the price is low or high. So let’s plot the word cloud of product names which have prices greater than 50.

Here I can see that the word bundle, Michael kor, kate spade etc are big so these words are very frequent in products which have price greater than 50.

Let’s see the product name words which belongs price less than 50 —

So this is interesting. Here I can see that those words which are so big in greater than 50 word cloud are not here. Here more frequent words are Victoria Secret free shipping etc.

From these two images I can see that there are some words which are impacting price. So this name feature seems to be important in deciding price.

Now I am plotting the price with respect to the number of words in the product’s name. See the plot —

I see here there are less products with less price which have 7,8 or 9 words. Means most of the product names have less words with low and high prices both. It seems not to be very important, if I calculate correlation between price and each_name_length then I found 0.038, which is not so good.

2. Item_condition_id —

First see the countplot of item_condition_id —

There are very less data points which have item_condition_id 4 and 5.

I am plotting a scatter plot between item_condition_id and price.

In the above scatter plot when the condition_id is 5 then the price is always less than 750 and for condition_id 1,2,3 the price is low also and high also, and it is clear that there are more data points with item_condition_id 1,2 and 3. This feature also is not looking that important. Because the correlation between price and this feature is not high.

When I see the distribution of price for each condition_id then it confirms that this is not going to be that important. See the box plot —

3. Subcat1 —

To start with this, I have explained already how I generated this feature. Let see the distribution of price with respect to each subcat1 with the help of box plot —

I can see here in the box plot that some of the categories specially kids and handmade price distribution is little bit different from men sub category, but in between there are some subcat where prices are overlapping totally so that’s not a good sign, but there are some differences in first and last. It could be a matter for price prediction.

Now let’s take one step ahead and plot each subcat1 price distribution with each condition then see if there are any subcat1 which have different price distribution for different conditions. See the plot below —

In all above plots there is no any plot which shows that, on changing condition id the price distribution is also changing of subcategories1. There are some differences but not big. Means in all conditions the subcategories price ranges are almost the same.

4. Subcat2 —

For analysis of subcat2 I have plotted distribution of price respect top 20 subcat2 and This gives me some better visuals. See here is the box plot —

If you see this box plot you get lots of differences in price distributions on each subcat2. Ex. See the women’s handbag and cell phones & accessories price distributions, there is a lot of difference, so this is an interesting feature.

Let’s see the top 5 subcat2 price distribution with each condition —

When you see these above distribution plots you will notice that there is some subcat2 for which each condition id have some difference in price distribution. Ex. Athletic apparel subcat2 with condition id 1 have some different price distribution and shoes subcat2 also have some difference between each condition as well as the jewelry with condition 1 have different distribution.

After seeing this I can say that this is going to be an important feature.

5. Subcat3 —

Here also I have plotted the same thing, price distribution of top 5 subcat3. Here in top there is a subcat3 other which means there is no 3rd sub category for the given product.

Let see the box plot —

In this box plot the pants tights leggings price distribution is very different from where no 3rd subcat is given means other. There is also some difference in pants tights leggings and shoes price distribution. So it is also a good feature.

Again plot price dist. of each top 5 subcat3 with respect to its condition —

Not much difference in price distribution, but see the first plot pants tights leggings with condition 1 and condition 4 price distributions looks different. But except this all distributions are almost the same.

So let’s move forward.

6. Subcat4 —

This column/feature has six categorical values. See their price distribution —

See the tablet distribution and the ballet they are totally different means if subcat4 is tablet then price is usually high and subcat4 is ballet then the price is low. Compare ballet with serving they are also totally different.

Like this do the analysis for subcat5 and see. I have done it but not mentioned it here.

Now it’s time to look some more feature —

7. Brand name —

If I talk about brand_name features, the first thing I noticed is that it has lot’s of missing or nan values, In the data 42% rows doesn’t have a brand name. How to handle it? I will talk about this later, right now see the analysis with nan values.

Before doing analysis I have converted my nan row to nan string, just for treating it like a category.

Now let’s see the top 10 most frequent brand name’s price distribution —

Now see the distributions. See the Apple and Michael Kors distribution there price is ranging so wide, there price is quite high also and see the pink, intendo and forever 21, their price range is low. Each brand name has its own different price distribution. This is a very good feature. If we have Apple as a brand name then there is a high chance that the price is high but if brand name is forever 21 then there is very less chance that it’s price is high. So this is a good feature for predicting some range of price.

Now see price distribution of each top 10 brand name with respect to its condition —

Now if I see the above distribution and take lululemon price distribution with respect to it’s condition then I see that the lululemon with condition 5 have different distribution from lululemon with condition 1, like that you can see the Apple and Michael Kors price distribution with respect to their condition, see the Michael Kors condition 5 distribution and Michael Kors condition 1, how different they are(means Michael Kors with condition 1 have prices more compare to Michael Kors with 5). You can see these things in Apple’s price distribution also.

Now see, if there is any effect on prices of shipping and brand name. I have plotted the price range of each top 10 brand name with respect to shipping. See box plot first —

If you see all the box plot, then there is one thing common in each all box plots, I can see here each blue box is little bit ahead of each orange box, this means that when shipping fee is paid by buyer then the price is high, if we see then the American eagle brand name with each shipping (0 and 1) have different price range from Michael Kors and lululemon price range and we can see more such relations.

8. Shipping —

There are two values, 0 means shipping fee is paid by buyer and 1 means shipping fee is paid by seller.

If shipping fee is paid by seller then the avg price is low and when shipping fee is paid by buyer then the cost is little bit high. You can see this in the price distribution graph of each shipping id 0 and 1.

If we see the correlation between price and shipping it is almost -0. 24. Which is quite good.

9. Item description —

This is a totally text feature. I plot scatter plots between price and number of words in item_description and I get this plot.

From this above plot I can say two things. First, there is less data which have long description and second those who have long description their price is low and those who have less words they have all types of prices means low and high both. That’s it.

That’s all in the EDA part. You can do more.

Let’s see heatmap of correlation between features and price —

Here there are some features which are highly correlated with each other. I will remove one of those features which have a high correlation, because it is not good to have multicollinearity. Shipping is very highly correlated with price.

Finally I have selected 15 features. Here you can see —

7. Feature Engineering —

From previous analysis we know that the unavailability of brand name does effect price, so what I think what if I make new feature with only 0 and 1, where brand_name is given I will make it as 1 and where not I will make it 0(look here I am not removing the original feature brand_name, I am making new feature let’s name it brand_name_exist_encoder). After adding this feature when I calculate the correlation between this feature and price then I got the value of almost 0.21, Which is quite good.

So now we have one more feature brand_name_exist_encoder.

But, I have not used this feature because when I was training my model my system seems to be turned down because it was taking all my ram. Because the data is large, almost 1.4 million I couldn’t handle. By the way I have trained some models successfully and it is giving me better results compared to this(which I will show you). The reason behind better results could be more data and that one feature(brand_name_exist_encoder) which has a very good correlation with price. So if you have more ram in your laptop you can try this.

Before doing featurization I will split my data in train_set and test_set.

Now In this section I will do featurization with three methods.

1. One Hot Encoding

2. TF-IDF Vectorizer

3. Word2Vec

I could do tf-idf word2vec but limited to myself with these three only.

Basically I have made features with each method.

On categorical data I am applying only one_hot_encoding, but for text I am applying these three methods and making three feature sets —

(train_ohe, test_ohe),(train_tfidf,test_tfidf) and (train_w2v,test_w2v).

NOTE — Here I am fitting all encoders on only train and transforming both train and test to prevent data leakage.

  1. One Hot Encoding —

what one hot encoding means, suppose i have a column brand_name, which have three values occurring frequently, suppose i have ten rows, here i have three unique values, so i will make three column of name of those three unique brand_name values, now i will pick first column and check where the rows(actual brand_name) have value same as this first_column name i will make it 1 and else make it 0. I will do this same process for each remaining two columns. You can understand it better by seeing this image —

In above image column A is denoting that where the brand_name A is present and where not.

First I am making features with one hot encoding, which means I will apply onehotencoder on both text and categorical data.

I will stack all features including numerical one also using stack and I will save it as train_ohe and test_ohe.

2. TF-IDF Vectorizer —

Now I have already done one hot encoding on categorical data, so no need to do it again. Here I have to do only tfi-df vectorization on text data(name and item_description).

tf-idf means- Term Frequency * Inverse Document Frequency.

Term Frequency formula is-

Term Frequency of a word = no. of times word present in sentence/total no of words in sentence.

IDF of a word = log(no of sentence in corpus/no of corpus in which word is presented).

Here TF gives more importance to words which are coming frequently in a sentence and IDF gives important words which are occurring in very less sentences in a corpus. In the IDF formula there is a log which is for standardizing the values, if there is no log then the scale of the values will be too large which is not good for our models.

let’s see an example of tf-idf vectorizer —

corpus — how are you, you are so funny, he is funny

now compute the TF(you in how are you)=1/3 and idf(you)=log(3/2)

so our tf-idf value for you in sentence “how are you” is (1/3)*log(3/2).

Now I will make a vector and the size of this vector will be equal to the vocab size of the corpus. In my example vocab size is 7. so the vector looks like (how,are,you,so,funny,he,is).now the vector representation of sentence “how are you” will look like (tf-idf(how),tf-idf(are),tf-idf(you)=(1/3)*log(3/2),tf-idf(so),…….,tf-idf(is)).

Now again the same thing, I will stack all features including numerical one also using stack and I will save it as train_tfidf and test_tfidf, so that I don’t have to run it again.

3. Word2Vec —

I am applying word2vec only on text data, basically I will convert each word to a vector of a sentence and sum it. so it will represent our one text.

seeing below image you will understand —

Now the question is how I will generate the d-dim vector, so here I choose to convert it in 100 dimension vector, I have done it with pre trained glove vector, once you download this pretrained glove vector model then you can give one word and it will return 100 dimension vector. You can see in code that how to do it and yes you can download glove vector model from kaggle from here or you can just search on google “glove vector 100 d kaggle download” and you will find it. Let see the code —

I will save it as train_w2v and test_w2v.

Now our all feature sets are ready, So from here we will move on to the first cut models.

8. Modelling —

In kaggle competition they suggest to use RMSLE as a metric, why? because the RMSLE is less impacted by outlier, if let’s say I have an outlier in data, now if we calculate mse then the error will go very high, so if we use mse or anything without log then the error we get, on the basis of that error we can’t say that my model is good or bad, as example let’s say we have three data in x_test and they all three are inlier now if your model is good then it will give very less error, now if I add fourth x_test data which is outlier so now when you evaluate your model then you will get very high error and you will think that the model is so bad, now if you replace error with log of error then what will log do, it will compress the error and show you a better error.

You can see in the above example how I thought that my model is not good due to that outlier point but that’s wrong, so this is the thing what log does.

Here I will do minmax scaling on y_train and x_train, I will fit minmax on only y_train and transform y_train, here I will not touch y_test. see code —

Now the question is how I will evaluate my model because now my model predict price which is in scaled format and our y_test is original(not scaled) so for this I will save our y_train min and max, now if I have max and min of y_train then I can convert our predicted output price at the scale of original price.let’s see the code how to do it, suppose for now I have trained a model.

Now first I will make a random model(like dumb model) so that we can see our first cut models are better than random model or worst than random model.

In random model I will take average of y_train and make it predicted price for whole x_test and then calculate mse, see the code here —

when you run this you will get 2046.111599547367.

Now you have one line and you know below this line your model is dumb.

  1. SGDRegression model —
    I have fit my three train sets(OHE, TF-IDF, WORD2VEC) one by one on this model with hyperparameter tuning. Here I got mse of 1185 on OHE data, 1263 on TFIDF data, 1366 on W2V data. I can see the OHE data is giving better results.
  2. LGBMRegressor model —
    Here I got mse of 1041 on OHE data, 1002 on TFIDF data, 1065 on W2V data. I can see the OHE data is giving better result. here i have got best result on tfidf data.
  3. Ridge Regression Model —
    Here I got 1108 on OHE data, 1157 on TFIDF data, 1303 on W2V data. I can see the OHE data is giving better results. here i have got the best result on OHE data.
  4. Lasso Regression Model —
    Here I got mse of 1592 on OHE data, 1672 on TFIDF data, 1644 on W2V data. I can see the OHE data is giving better results. here i have got the best result on OHE data. this is the worst model till now.
  5. XGBRegressor model —
    Here I got mse of 1210 on OHE data, 1236 on TFIDF data, 1278 on W2V data. I can see the OHE data is giving better results. here I have got the best result on OHE data.

These five models I have tried as my first cut models, for more improvement that these results you can do more hyperparameter tuning, probably you should get some better results.

Now see the all models performance in one place and which gives best results till now —

In above I can see that the best model and it’s parameters, so the best model is LGBMRegressor which gives us error of 1002, trained on tf-idf data and hyperparameters are {‘boosting_type’: ‘gbdt’,’learning_rate’: 0.1,’max_depth’: 8,’n_estimators’: 200,’num_leaves’: 60}.

Using some stacking techniques —

Ensemble technique —

Now moving forward I have implemented Ensemble techniques, as ensemble techniques, I have used stacking and I have code it from myself first doing hyperparameter tuning. I have done hyperparameter tuning on meta_model, sample_size(on which I have trained my base models) and base_models. I have done this on each feature set(ohe,tfidf and w2v data).

basically for doing this I have made some function and by calling it three times I got three tables with hyperparameters and error(for ohe data, for tfidf data and for w2v data), I will show you code and results below →

when I train above stacking regressor on OHE data then I got the results —

on tf-idf data —

on word2vec data —

the best result is 1009, when the data is OHE , base models are sgd_lgbm_ridge_lasso, each_sample_size is 250000 and meta_model is ridge regression model.

Stacking using StackingCVRegressor with hyperparameter tuning —

Now I will use StackingCVRegressor, using this I can do hyperparameter tuning better. Read about StackingCVRegressor here.

see the implementation below —

when I fit above implementation on OHE data, i got this result —

on tf-idf data —

on word2vec data —

In the above results the best result is 982 which is on OHE data and when base models are sgd_lgbm_ridge_lasso and when meta_model is ridge.

this is the best result so far.

Now I will fit my word2vec data on a neural network.

Implementing deep learning models —

For a deep learning model first I will use my word2vec data to train it. I have already w2v data. So here is the code for neural network structure and training of neural network —

For training neural networks I am using a data loader because it will take less memory at any specific time.

By training this neural network on w2v data I got the mse 1011, which is not the best result, remember I have got almost 982 in stacking implementation.

Neural Network with Embedding Layers —

Now i won’t use pre trained word2vec for training neural network, I will learn my own vector for each word(in categorical and text data) through embedding layer.

For training this neural network I have to first convert each categorical column and text column to sequence of integers, which I have done with the help of keras preprocessing. suppose I have to convert item_description to sequence , so for this first I have fit train_item_desc column to tokenizer(keras.preprocessing.tokenizer.fit_on_texts(x_train)) and then use texts_to_sequences to convert both train_item_desc and test_item_desc column to sequences, and after that I have done padding on both train and text with pad_sequencs method, so that each data in train_item_desc and test_item_desc become of same length. Let’s see the below image for more understanding —

Now I have done the encoding of all text and categorical columns, I have also stacked all numerical columns(you can see the below part of the code).

Let’s see the code for model structure and fitting the data on model —

On training this neural network I have got mse 893, the minimum error so far.

Initially when you train this network, then you overfit your data, so in encounter I have used different initializers, dropouts and early stopping.

9. Results —

Below table shows models best performance on the basis of data.

Best model is neural network model with embedding layer which have given error 893.9278

I have submitted my model predicted price on kaggle, I got the RMSLE of 0.65249.

10. Conclusions and Future Work —

  1. we can improve our model by making one more feature, which will determine whether the brand name is present or not. Ex. if brand name present make it 1 and if not make it 0. if we do this, then error may reduce.
  2. Here I have not used tf-idf weighted word2vec for featurizing, we can do this and see how all models are performing.
  3. For filling missing brand name values, we can use model prediction based filling technique. We can take all data where brand_name is presented and make it as train_data and where brand_name is missing make it as test data, now fit a model on train data and predict for test data. but this will be tricky one because here we will have brand_name as target_variable which is categorical and have so many approach for this just i can think of that you can remove those rows from x_train which have brand_names very less occurring in train, Ex. you can remove brand_name associated data points which are occurring less than 5 or 10, this way you can reduce the number of classes in target_variable.

11. Profile —

My linkedin Profile —

Github link for the code —

12. References —


Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Written by

Currently working on Data Science.Python Developer(Django Framework)

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem