Mercari Price Suggestion Challenge Using ML and DL

Published in

Mercari price suggestion challenge using ML and DL

15 min readSep 28, 2020

This case study was a competition on kaggle (2018)
https://www.kaggle.com/c/mercari-price-suggestion-challenge

1 . Business Problem

Mercari is an online selling Application(Japan) . Sellers upload the products that they want to sell. While uploading , the sellers want to see the price of the products they should be selling at.

Given details about a product like product category name, brand name, and item condition, the challenge is to find an algorithm which predicts the price.
It can wipe up the labors used for maintaining the price and also this can increase the efficiency of the Shopping App.

The mission is to predict the price given the features/attributes of a particular product.

2 . Use of Machine learning / Deep learning to solve the business problem

The data set consists of seven features.

Name: The name of the product that is being listed by the seller.
Item condition id: A number that represents the condition of the product.
Category: The product category which are in a 3 level hierarchy splitted by ‘/’ and i am going to split it as Main _Category/Subcategory1/..
Brand: The brand of the product.
Shipping: This feature contains boolean variables in which 1 represents the price of shipping is paid by the seller and the other represents the shipping price is paid by the buyer.
Item Description: This feature contains the detailed description of the product , their working conditions , all the details that a buyer should know. This is a very important feature and most of the Machine Learning and Deep Learning algorithms consider text feature as an important one and this feature can let the NLP tasks to come and join in our travel.
Price: This is the value to be predicted and this contains values in the range 0 to 2009

Sample Dataset Consists of around 1.5 M rows

Error Metric : Root Mean Square Logarithmic Error(RMSLE)

RMSLE is the best metric for a regression task. For a product, it is okay to predict a higher price value and but it is not worthy to predict a lower price value for the same. RMSLE will increase if we under predict and it will decrease if we over predict. Go through the documentation below to get a better understanding.

Mean squared logarithmic error (MSLE) | Peltarion Platform

Mean squared logarithmic error (MSLE) can be interpreted as a measure of the ratio between the true and predicted…

peltarion.com

3 . Source :

Mercari Price Suggestion Challenge

Can you automatically suggest product prices to online sellers?

www.kaggle.com

4 . Existing Solutions

Ridge Model : Uses a simple Ridge Regression over Tfidf Text features to generate predictions. Model gives an RMSLE of 0.47.
LGBM Model : Uses LightGBM Regressor to give output score of 0.44.
Sparse MLP : This uses a sparse MLP to generate output over Tfidf Vectorizations of text and One Hot Encoding of categorical features. Model gives an RMSLE of 0.38 (1st place).

5 . Exploratory Data Analysis

Price :

data.price.describe()

From the price description it is understandable that the maximum value for price is 2009 , the mean value is 26.7 , 25th percentile values is a far lesser than the mean value and 75th values is far lesser than the maximum value.

Now i am going to plot the histogram representation of the price value.

data['price'].plot.hist(bins = 50, figsize = (10,5), edgecolor = 'white', range = [0,500])
plt.xlabel('price')
plt.ylabel('frequency')
plt.title('Price Histogram')
plt.show()

Here we can see that the price distribution is heavily right skewed (same as the conclusion from the description). Here comes the use of RMSLE. This is the reason why we took RMSLE as the error metric.Therefore, I am going to apply the log transformation to the price target variable, to make this assumption available for model training.So, we have to scale down the ‘price’ feature to log scale and plot the histogram.

data['logprice'] = np.log(data['price']+1)
data['logprice'].plot.hist(bins = 50, figsize = (10,5), edgecolor = 'white')
plt.xlabel('logprice')
plt.ylabel('frequency')
plt.title('logprice Histogram')
plt.show()

This representation doesn’t look skewed anymore

Shipping Status :

This feature consists of boolean values in which 1 represents the shipping fee paid by the seller and 0 represents the shipping fee paid by the buyer. Since this is a categorical feature , i am going to plot the box plot and violin plot.

sns.violinplot(x=data.shipping, y=data.logprice)

sns.boxplot(x=data.shipping, y=data.logprice)

The graphs shows that for higher item price,the shipping charge is paid by the buyer and which is not as the usual trend..
This doesn’t produce a usual logic. So i am assuming that the shipping price is dependent on a particular brand or a particular category of items.(may happen)

Categories :

data['category_name'].value_counts()

The categories are in a heirarchical order which are seperated by ‘/’.So the order is Main Category/Subcategory1/Subcategory2..
Women category have the highest number of products followed by any other category.

Next, i am going to break the hierarchy and do feature engineering on the category column.

data[['Main_cateogry','subcategory1','subcategory2','subcategory3','subcategory4']] = data['category_name'].str.split('/', 0, expand=True)

It is better to visualize the top level categories. Now i am going to plot box plot on main category.

sns.set(rc={'figure.figsize':(20,9)})
sns.boxplot(x="Main_cateogry", y="logprice", data = data)
plt.title('Boxplot')
plt.show()

The category men have the highest mean price and rest of the other categories have a similar median prices . To know the occurrence i am going to plot barplots.

df_cat1_counts = pd.DataFrame(data.groupby('Main_cateogry',as_index = False).agg({'shipping' : 'count'}))
df_cat1_counts.columns = ['cat1','count']
df_cat1_counts = df_cat1_counts.sort_values(by=['count'],ascending = False)
sns.set(rc={'figure.figsize':(8,6)}, style = 'whitegrid')
sns.barplot(x = "count", y="cat1", data=df_cat1_counts,palette="Blues_d")

Women category have the highest frequency followed by beauty which is followed by kids category. Now lets visualize through wordclouds

wordcloud = WordCloud(collocations=False).generate(text_cat)
plt.figure(figsize = (12,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title('WordCloud for Category Name')
plt.show()

From the word cloud , it is clear that Women category is occurring a lot of time than any of the other sub categories. The plot shows a high occurance of beauty , blowse , makeup.

Brands :

There are about 4800 unique brands in the dataset and most of the cells are missing as well. So we are going to plot the box plot and word clouds to know the frequency.

data.brand_name.value_counts()[:10].plot(kind = 'bar',figsize = (20,10), title="Top Brand Names",fontsize=20)

wordcloud = WordCloud(width = 1200, height = 1000).generate(" ".join(data.brand_name.astype(str)))
plt.figure(figsize = (20, 9))
plt.imshow(wordcloud)
plt.axis("off")
plt.title('Word Cloud for Listing Names')
plt.show()

Pink, victoria , secret are the most occurred brands and most of the cell have a null value which is represented above.

Item Description :

Now lets understand description by plotting wordclous

This plot gives the occurrence of words in item description column . Brand,new are the most occurred words.

6 . Feature Engineering And Preprocessing

First of all , we are going to replace any missing values with relevant information

train['name'] = train['name'].replace([np.nan], ' ')
train['item_description'] = train['item_description'].replace([np.nan,'No description yet'], '')
train["brand_name"]=train["brand_name"].fillna("missing").astype("category")
train["Main_category"]=train["Main_category"].fillna("missing").astype("category")

Now lets look at preprocessing the text data as it is essential to extract any useful information before we apply any model to it. Here are the steps,

Converting all words to lowercase
Removal of stop words
Removing punctuation and special characters
Removing unwanted multiple spaces
Handling Alpha-numeric values and so on.
Decontraction

#appliedaicourse.comdef decontracted(phrase):
phrase = re.sub(r"won't", "will not", phrase)
phrase = re.sub(r"can\'t", "can not", phrase)
phrase = re.sub(r"n\'t", " not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'t", " not", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
return phrasestopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've","you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \'s', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \'ve', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\"hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\"mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \'won', "won't", 'wouldn', "wouldn't"]def text_preprocess(data):
preprocessed = []
for sentance in tqdm(data):
sent = decontracted(sentance)
sent = sent.replace('\\r', ' ')
sent = sent.replace('\\"', ' ')
sent = sent.replace('\\n', ' ')
sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
sent = ' '.join(e for e in sent.split() if e not in stopwords)
preprocessed.append(sent.lower().strip())
return preprocessed

Pass each item description to this function and get the processed data. Now we gonna look at feature engineering hack using item description.

Item Description Length :

A simple ‘column_name.str.len()’ produces the character length of any string contained in the column. Create a column named description length and find all the lengths of the description . Now lets understand visually!!

sns.relplot(x="description_length", y="price", data=train);

From the plot , it is understandable that when length increases , price decreases and vice versa. So this is gonna be a use feature for predicting the price value.

Mean Price For Each Brand :

Now lets find the mean price for each brand and store it in a column(Corresponding mean of brand column prices) and lets visualize it.

sns.relplot(x="Mean_Brand_Price", y="price", data=train);

It seems that there is relationship between mean brand price and price , so it might be good to use this as a new feature.

Mean Price For Sub Category 2 :

Lets find the mean price for sub category 2 and understand it visually.

sns.relplot(x="Mean_Sub_Price", y="price", data=train);

The plot explains that there is a vital relation ship between mean price of sub category2 and price there by including this as a new feature.

Median Price For Brand :

Atlast lets find the median price for each brand and store it in a column(Corresponding median of brand column prices) and lets visualize it.

sns.relplot(x="Median_Brand_Price", y="price", data=train);

Since there is a relation ship , lets use this as a new feature.

7 . Modelling

Splitting Data :

Split data in to train and test for applying any ML or DL models. The model is trained on train data and test data is used to find the how a model performs. As i mentioned earlier , the evaluation metric here is Root Mean Squared Logarithmic Error(RMSLE). Here i use 20 % of the whole data as test data.

train, test, y_train, y_test = train_test_split(data, y, test_size=0.2, random_state=42)

Encoding Categorical Features :

Machine Learning models will only accept inputs when they are in proper format. There are many different ways to work on categorical data . As most of out features is categorical in nature i have used three different ways of encoding them .

LabelBinarizer on Brand Name :
It assigns a unique value or number to each label in a categorical feature.

bin = LabelBinarizer(sparse_output=True)
train_transform = bin.fit_transform(train["brand_name"])

One Hot Encoding On Item Condition and Shipping :
It encodes categorical integer features as a one-hot numeric array. It makes model training easier and faster. The one-hot Encoding is specifically done with respect to train data to avoid data leakage issue. We do not include test data into it because test data is unseen to us. So, if any category appears new while testing, we will ignore that value while converting to One-Hot encoded form.

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(data[['item_condition_id','shipping']])
train_ohe = enc.transform(train[['item_condition_id','shipping']])

One Hot Encoding on Category Columns :

The CountVectorizer class and its corresponding CountVectorizerModel help convert a collection of text into a vector of counts. The result when converting our categorical variable into a vector of counts is our one-hot encoded vector. The size of the vector will be equal to the distinct number of categories we have.

unique_categories = train["Main_category"].unique()
count_category = CountVectorizer(vocabulary = unique_categories,lowercase = False,binary = True)
train_main = count_category.fit_transform(train["Main_category"])

Encoding Text Features :

Count Vectorizer Item Description and Name:
Scikit-learn’s CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the pre-processing of text data prior to generating the vector representation.

def encoder_text(train,test,test1,params)
vectorizer = CountVectorizer(ngram_range = params[0],
min_df = params[1], #removes words that appear more infrequently
max_df = params[2], #removes words thath appear more infrequently
max_features = params[3]) #size of the vocab
train_transform = vectorizer.fit_transform(train)
test_transform = vectorizer.transform(test)
test1_transform = vectorizer.transform(test1)
feature_names = vectorizer.get_feature_names()
return train_transform,test_transform,test1_transform,feature_names

Encoding Numerical Features :

Well, now lets look at encoding numerical features. Before feeding in to a model , we should normalize our numerical features. I have used sklearn’s StandardScaler to encode all the newly created features ( Description length,mean brand price,mean subcategory2 price,median brand price)

mean_cat = StandardScaler(copy=True)
train_mean_cat = mean_cat.fit_transform(mean_cat_tra

Machine Learning Modelling :

There are many ML algorithms to try with. You will never know which model will produce better metric. So always try to tune the hyper parameters. Obviously , it would be great if you can tune the hyper parameters to produce a better result.I have hyper parameter tuned all the models and used the best hyper parameters to produce a better RMSLE Score.

I have used GridSearch as a hyper parameter tuning strategy and got some better results. If you want to know more about GridSearchCV here is a wonderful blog by krishni.
https://medium.com/datadriveninvestor/an-introduction-to-grid-search-ff57adcc0998

I have used ScikitLearn’s models because of its simplicity and accessibility. It’s easy to use even for beginners — and a great choice for simpler data analysis tasks.

I have modeled three ML regression models.

SGD Regressor
CatBoost Regressor
Decision Tree Regressor

SGDRegressor with GridSearchCV :

parameters = {"alpha":[0.0001,0.001,0.01,0.1,0,1,10,100,1000],"l1_ratio" : [0.2,0.3,0.4,0.5,0.6,0.7]}
model = SGDRegressor  (
loss='squared_loss',
learning_rate='invscaling',
max_iter=200,
penalty='l2',
fit_intercept=False)reg = GridSearchCV(model,param_grid =parameters,n_jobs=-1)reg.fit(X_train, y_train)

I have hyper parameter tuned this model to produce the best alpha and l1 ratio(0,0.2). After training the model, i have got an RMSLE score of 0.47.

CatBoost Regressor :

list = []
n = [200,250,300,350]
for i in n
model = CatBoostRegressor(n_estimators = i)
model.fit(X_train, y_train)
scores = model.predict(X_test)
RMSLE = np.sqrt(mean_squared_error(y_test, scores))
list.append(RMSLE)

Catboost really works well with categorical features but it will take plenty of time to train. It seems that when you increase the number of estimators , you will get better results. I have hyper parameter tuned Catboost to get an RMSLE score of 0.49

DecisionTree Regressor with GridSearchCV :

dtm = DecisionTreeRegressor()
param_grid = {"min_samples_split": [10, 20, 40],"max_depth": [2, 6, 8],"min_samples_leaf": [20, 40, 100]}
grid_cv_dtm = GridSearchCV(dtm, param_grid, cv=5)
grid_cv_dtm.fit(X_train,y_train)

I have hyper parameter tuned DecisionTreeRegressor using GridSearchCV and obtained max_depth as 8 , min_samples_leaf as 20 and min_samples_split as 10. I have got an RMSLE score of 0.57 after using these hyperparamters.

Deep Learning Modelling :

Since most of the features are in text format , the deep learning models will really produce good results. The model will learn itself to produce new features as it is one of the great advantages of using Deep Learning. I have used different approaches to encode item description , name and category columns.

Tokenizing Name and Item Description :

I have used keras preprocessing method called tokenizer to tokenize these two features. Tokenizer allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf..

raw_text = np.hstack([train.item_description.str.lower(), train.name.str.lower()])
vec = Tokenizer()
vec.fit_on_texts(raw_text)
train["seq_item_description"] = vec.texts_to_sequences(train.item_description.str.lower())

Label Enoding Category Columns :

Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form.

enc = LabelEncoder()
enc.fit(np.hstack([train.Main_category,test.Main_category]))
train['Main_category'] = enc.transform(train['Main_category'])

Long Short Term Memory :

I have used keras to build a neural network algorithm . This starts with an embedding layer in which the text features (descritption and name ) sequences after tokenizing gets expanded in to a massive vector . LSTM works well with text features. So give the embedding layer output to LSTM network and use those to build a powerfull NN architecture.

rnn_layer1 = ks.layers.LSTM(16, return_sequences = False) (emb_item_desc) #LSTM
rnn_layer2 = ks.layers.LSTM(8, return_sequences = False) (emb_name)
main = ks.layers.concatenate([ks.layers.Flatten() (emb_main_category)
, ks.layers.Flatten() (emb_brand_name)
, ks.layers.Flatten() (emb_item_condition_shipping)
, rnn_layer1
, rnn_layer2
, numerical_dense_out
, ks.layers.Flatten() (emb_sub_category1)
, ks.layers.Flatten() (emb_sub_category2)
],axis = 1)
main = ks.layers.Dropout(0.1
(ks.layers.Dense(128,activation="relu",kernel_initializer="he_norma",kernel_regularizer=ks.regularizers.l2(0.001)) (main))
main = ks.layers.Dropout(0.1)(ks.layers.Dense(64,activation="relu",kernel_initializer="he_normal,kernel_regularizer=ks.regularizers.l2(0.001)) (main))
output = ks.layers.Dense(1, activation="linear") (main)

This NN produce a RMSLE score of 0.48

LSTM + Convolutional Neural Network (CNN)

After getting the Embedding Vector , i have given those as inputs to a CNN layer and LSTM layer as well . Combine those outputs and produce a powerfull network.

id_cov1= ks.layers.Conv1D(128, 5, activation='relu')(emb_item_desc)
id_pool1 = ks.layers.MaxPooling1D(5)(id_cov1)
id_cov2 = ks.layers.Conv1D(128, 5, activation='relu')(id_pool1)
id_pool2 = ks.layers.MaxPooling1D(5)(id_cov2)
rnn_layer1 = ks.layers.LSTM(16, return_sequences = False) (id_pool2)
id_flat = ks.layers.Flatten()(rnn_layer1)
id_dense1 = ks.layers.Dense(128, activation='relu')(id_flat)n_cov1= ks.layers.Conv1D(128, 5, activation='relu')(emb_name)
n_pool1 = ks.layers.MaxPooling1D(5)(n_cov1)
rnn_layer2 = ks.layers.LSTM(8, return_sequences = False) (n_pool1)
n_flat = ks.layers.Flatten()(rnn_layer2)
n_dense2 = ks.layers.Dense(128, activation='relu')(n_flat)

This NN produces an RMSLE score of 0.49 which is not comparable with LSTM only model.

FastText Model :

This model uses keras GlobalAveragePooling1D on the embedding vectors.
GlobalAveragePooling1D takes a tensor and computes the average value of all values across the entire matrix for each of the input channels and it reduces the complexity. GlobalAveragePooling1D block takes a 2-dimensional tensor tensor of size (input size) x (input channels) and computes the maximum of all the (input size) values for each of the (input channels).

emb_name = ks.layers.GlobalAveragePooling1D(name = 'output_name_max')(emb_name)
emb_item_desc = ks.layers.GlobalAveragePooling1D(name = 'output_item_max')(emb_item_desc)
main = ks.layers.concatenate([ks.layers.Flatten() (emb_main_category)
, ks.layers.Flatten() (emb_brand_name)
, ks.layers.Flatten() (emb_item_condition_shipping)
, emb_item_desc
, emb_name
, numerical_dense_out
, ks.layers.Flatten() (emb_sub_category1)
, ks.layers.Flatten() (emb_sub_category2)],axis = 1)
main = ks.layers.BatchNormalization()(main)
main = ks.layers.Dense(1024)(main)
main = ks.layers.Activation('relu')(main)
output = ks.layers.Dense(1,activation = 'linear')(main)

This model produces RMSLE score of 0.43 which is good enough.

8 . Results

Out of all the models i have trained , the FastText model gives a lower score of RMSLE (0.43)

9 . Final Submission Score

Kaggel Final Submission Score —

10 . Summary

The best model i got is the FastText model. The point to be noted is that even complex models such as LSTM + CNN / LSTM doesn’t give good score but some simple models such as SGD or catboost is comparable with the complex models. So it is better to start with simpler Machine Learning models even the task is complex.

11 . My Improvements to Existing Approaches

1 ) I have tried out two things on encoding categorical features — different encoding techniques (CountVectorizer , Onhot Encoding , Label Binarizer ) and same encoding technique(One hot encoding) on different features (brand name , categories, shipping id) and i found out that if we use same encoding technique , the model may produce lower score especially on text type of data. By using different techniques my model’s performance some what increased .

2) The structure of the fasttext model is of my own but using global average pooling 1d is already depicted in one of the kaggle kernels.

12 . Future Works

I would like to learn word embedding using shallow neural network , would like to use CNN after globalpooling1D and engineer new features as a future work.

13 . Conclusion

I am beginner at Kaggle and it was very interesting to submit my predictions to kaggle and i got a pretty good results even if it is a late submission . I have tried out a lot of algorithms , feature engineering hacks and even encoding techniques and i chose the best of them. I have gone through a lot of blogs , kaggle reads and even research papers . It was a great inspiration for me and for sure i will participate in any of the upcoming kaggle competition as you can also do so!!!

Thank You So Much For Reading!

you can visit my github repository to view the full code and please follow me if you like my blog.