Mercari Price Suggestion Challenge- A Deep learning Regression Case Study

23 min readFeb 8, 2023

Preface:

Mercari Price Suggestion is a feature offered by the Mercari marketplace app that provides users with suggestions on what price to list their items for sale. The algorithm takes into account various factors such as the item’s condition, brand, and product category to give a suggested price range. The idea is to help users maximize the selling price of their items and ensure that they sell quickly. The price suggestion feature is just a guide and users are free to list their items at any price they choose.

Index:

Problem Statement
Determining the models via dataset analysis
Evaluation Metrics
Exploratory Data Analysis
Data Preprocessing & Feature engineering
Examining the existing solutions
Baseline Models
Final Model
Results & Conclusion
Future Work
References

1. Problem Statement

The Mercari Price Suggestion Challenge, a well-known Kaggle competition that took place in 2018, served as the inspiration for this case study.

It can be hard to know how much something’s really worth. Small details can mean big differences in pricing. For example, one of these sweaters costs $335 and the other cost $9.99. Can you guess which one’s which?

When you consider how many products are sold online, product pricing becomes even more difficult at scale. Clothing has strong seasonal pricing trends and is heavily influenced by brand names, whereas electronics prices vary based on product specifications.

Mercari, Japan’s biggest community-powered shopping app, knows this problem deeply. They’d like to offer pricing suggestions to sellers, but this is tough because their sellers are enabled to put just about anything, or any bundle of things, on Mercari’s marketplace.

In this competition, Mercari is challenging you to create an algorithm that automatically suggests the best product prices. You will be given user-input text descriptions of their products, which will include information such as product category name, brand name, and item condition.

2. Determining the models via dataset analysis

The publicly accessible dataset from the Kaggle competition was utilized to train the model. The dataset consists of two files, train.tsv and test.tsv, with 1.4 million and 3.5 million data points, respectively. Essentially, a TSV file is one in which the values are separated by tab spaces.

The columns or raw features in the train.tsv and test.tsv files are as follows:

train_id — The identifier of each listing.
name — The listing’s heading. To prevent leaks, Kaggle has cleansed the data by removing text that resembles prices (such as $20). These prices have been removed and are shown as [rm].
item_condition_id — The state of the goods as supplied by the vendor.
category_name — The category of the listing
brand_name — Textbox containing information about the product’s brand.
shipping — If the seller pays the shipping fee, the value is 1, otherwise it is 0.
item_description — The complete description of the item. To prevent leaks, Kaggle has cleansed the data by removing text that resembles prices (such as $20). These prices have been removed and are shown as [rm]. It was found to be one of the most important features.
price — The purchase price of the item. This is the target variable that needs to be predicted by my models. The unit is USD. This column doesn’t exist in test.tsv since that is what you will predict.

One key aspect to consider is the type of problem we’re trying to solve. In this case, the target variable, or the value we want to predict, is the price of the property. And since the price is a continuous variable, we have a regression-based machine learning problem on my hands. These models can help us understand the relationship between the independent variables (features) and the dependent variable (price), and then use that information to make predictions.

Several types of regression models can be used to predict the price of a property, including Linear Regression, Decision Trees, Random Forest, and more. The best model to use will depend on the specific details of the problem and the available data. But regardless of the model chosen, the goal remains the same: to accurately predict the price of a property using regression machine learning techniques.

3. Evaluation Metric

“The best model is the one that strikes the right balance between fit and simplicity.” — Hastie, Tibshirani, and Friedman, authors of the popular textbook “The Elements of Statistical Learning”. All great models are decided by their loss function against which it is evaluated.

The evaluation metric for this competition is Root Mean Squared Logarithmic Error(RMSLE). The RMSLE is calculated as,

RMSLE is one of the best-fit metrics that could be used in this scenario as -

It penalizes under prediction more than over prediction, avoiding financial losses for sellers.
RMSLE is easy to interpret as the error is expressed in relative terms rather than absolute terms.
RMSLE is a consistent metric that allows for meaningful comparison of model performance, regardless of the scale of the target variable(price).
RMSLE is differentiable and convex, making it suitable for optimization using gradient-based methods.
The use of logarithmic transformation in RMSLE helps to reduce the impact of extreme values on the error metric, making it more robust to outliers.
RMSLE is a commonly used metric for regression problems, making it a familiar choice for practitioners and researchers.

RMSLE — Custom Loss function

4. Exploratory Data Analysis (EDA)

“Data scientists spend 80% of their time cleaning and manipulating data, and only 20% building models. This is why EDA is so important, as it helps identify issues with the data early on, saves time and effort down the line, and ultimately leads to better results.” — Megan Risdal, Data Scientist and Kaggle Expert.

The first and foremost thing we need to check is the presence of “null” values and by the table below we can see that there is a large number of the item description, category and brand names which have null as a value.

Train ID:  0
Product Name:  0
Item Condition:  0
Category Name:  6327
Brand Name:  632682
Shipping Status:  0
Item Description:  4
Price:  0
Item Description where no description present:  82489

Variation of “Price” feature -

Price variability is a common phenomenon in the market and can be quantified by observing the range of price differences. Statistical analysis shows that the price range can vary from 75% to 100% among similar products or services. This indicates a high degree of variability in prices and highlights the need for consumers to compare prices before making a purchase. The findings can be seen in the PDF plot below.

Probability Density Function(PDF) of Heavily Skewed “Price” Distribution

These are the price statistics which also suggest the presence of outliers in the data.

                          count    1.482535e+06
                          mean     2.673752e+01
                          std      3.858607e+01
                          min      0.000000e+00
                          25%      1.000000e+01
                          50%      1.700000e+01
                          75%      2.900000e+01
                          max      2.009000e+03

We need to normalize the data and recheck the distribution. The best technique is to apply a logarithmic function. To ensure that the price feature is bounded, a constant value of 1 is added to the price leading to the final function being log_price = log(1+price).

The above distplot confirms that the price distribution is near to Gaussian, but the curve is neither smooth nor skewed. Now let's apply the QQ plot to check the normality.

From the above QQ plot, we can infer that the feature is normally distributed. Now to make sure that inference is correct, we create a box cox transformation and make a QQ plot.

Since the p-value of the distribution is less than 0.05 we can confidently say that the feature is normally distributed.

log(1+price) will be used as target variable instead of price for improved model performance as training a linear regression machine learning model with a focus on minimizing the Mean Squared Error (MSE) of the log-transformed target variable ultimately results in an improvement of the Mean Squared Logarithmic Error (MSLE) of the original target. This enhancement leads to a more accurate evaluation of performance, as determined by Root Mean Squared Logarithmic Error (RMSLE). By following this strategy, the process of selecting optimal hyperparameters for any trained model can be greatly improved.

2. Influence of Item Condition on the “Price” feature

From Train Variable Documentation reference →

                            New      -> (1)
                            Like New -> (2)
                            Good     -> (3)
                            Fair     -> (4)
                            Poor     -> (5)

By the above histogram plot, we can say that a large number of products are either in good or new condition.

The box plot above shows the variation in price and item condition and as can be seen from it, there is negligible variation. Although it can be seen that items with item condition ID 5 have a higher price than others which probably indicates the presence of antique items in the mercari marketplace.

3. Influence of shipping on the “Price” feature

From dataset documentation:

Shipping is a binary feature where shipping = 1 if the shipping fee is paid by the seller & 0 if by the buyer

Cumulative Distribution Function(CDF) of log_price & shipping

In most of the items, the price was paid by the buyer. The CDF is right-shifted and the boxplot is slightly up compared to others. The items for which the price was paid by the seller are included in the selling price which might increase the item price. Since there is no shipping price available so we can not perform any more analysis.

4. Influence of Brand name on the “Price” feature

We saw that the brand name has a large number of null values, as 57% of the items contain brand_name and others don’t.

We can see PINK being the most occurring brand with 70k times. There are also a lot of items which occurred only 1 time. If required group the items count for further analysis to check the items counts where the item count is less than 10.

Task: To find out the median price of top occurring items and comparing against the median price of the entire dataset

We can see that nearly the top 7 items have a median price greater than the overall median price. Michael Kors, Lululemon and LuLaRoe have a median price higher than the overall products' median price.

It is observed that apple has a large variance in the selling price while Forever 21 is sold in a very small price range.

Task: Top Costliest & Cheapest brand

It can be seen that the price of the costliest brand is 2000 dollar and least price are 1 dollar. There are some items which have zero price.

Task: To find the influence of the number and length of words in brand name

It can be seen that the majority of the word is one word which covers the 40 percentile and the remaining 25 percentile has 2 words. The 50 percentile of brand name has a length of 10.

5. Influence of Category Name on the “Price” feature

The category field consists of a hierarchical format in the form of main_category/sub_category_1/…

Task: To find out the most and least occurring category and sub-category

It is observed that there is a total of 1200+ categories where women dominate in all the categories and there is some category which does not have labels. There are 11 items in the general category.

It is observed that there are a lot of items with count 1 in sub-category 2 which means these are not purchased by people most. The product categories are hierarchical and have three levels separated by “/”.

Task: To find the impact of price(target variable) with general category

The prices across the general category are relatively stable with minimal variations. The median price of men’s products is slightly higher than the other categories, while handmade products have a slightly lower median price compared to others. However, there is a significant variation in prices in the Electronics and Home Goods categories, making them more dynamic and subject to frequent changes.

6. Influence of Item Name & Description on the “Price” feature

Since these two elements are text fields and therefore contain the most useful data, they feel like the most significant fields in the data.

Task: To preprocess the data by removing the stop words, and special characters and build a WordCloud. WordCloud gives us the most frequently occurring words in the given text

WordCloud on Item Name

In the analyzed data, Victoria’s Secret is the most frequently occurring brand, followed by Michael Kors, Apple, and iPhone which also appear frequently. The words “free shipping” and “brand new” are also commonly used in product descriptions. The pattern of brand names shows that the most frequently occurring brand names are typically one-word, followed by two-word brand names. This highlights the importance of branding and the influence it has on consumer behaviour and purchasing decisions.

WordCloud on Item Description

It is seen that words like brand, new, free, shipping, great, and condition are used frequently in the item description.

Is there a correlation between the length of a product description and its price point? Does providing more information about a product in the form of a longer description contribute to its perceived value and justify a higher price, or is a concise description sufficient for lower-priced items?

To answer this, we create a plot of log_price against the item description length where it is observed in some cases, a longer description may help to provide more information and details about the product, which could make it more appealing to potential buyers and justify a higher price. However, there are also instances where a shorter, more concise description may be preferred, especially for less expensive items.

5. Data Preprocessing & Feature engineering

“A model is only as good as the data it is trained on.” — Andrew Ng, renowned computer scientist and AI expert. That’s why data preprocessing & feature engineering are so important, it is one of the key factors that decide how the model will perform.

Feature Engineering and Data Preprocessing are essential steps in the data science pipeline. These steps help to extract meaningful insights and build robust machine-learning & deep-learning models from raw data. Feature Engineering involves creating new features from existing ones to improve the model’s accuracy. This includes transforming and scaling variables, one-hot encoding categorical variables, and engineering new variables through mathematical operations or aggregations. Data Preprocessing, on the other hand, involves cleaning and transforming the raw data to make it suitable for modelling. This includes handling missing values, removing duplicates, and normalizing variables.

In conclusion, Feature Engineering and Data Preprocessing play a critical role in the success of a machine learning model. It requires careful planning, understanding of the data, and experimentation to arrive at the best features and data preprocessing techniques.

Data Preprocessing —

Creating a log transform function to transform the price

As mentioned in the Exploratory Data Analysis (EDA), it will utilize a function to create the ‘log_price’ target feature. This will be the target variable that is used to train models. By using the log of the price instead of the raw price, it can better capture the distribution of the data and create a more robust model.

2. Dealing with Null values

Removing empty and null values from a dataset is crucial for preventing the creation of sparse data, which can lead to poor model performance. This process helps to reduce the presence of empty and null values in the dataset, thereby creating sparsity, which can be leveraged to generate sparse embedding. In turn, this enhances the accuracy of the model and ensures that it is well-suited for the given data.

3. Limiting the price to a specific range as defined on the mercari website

Removing outliers from a training dataset is essential for maintaining accuracy. Failing to do so can result in the underfitting of the model, leading to severe consequences for the accuracy of the results. Outlier removal helps to ensure that the model is well-fitted to the data and can generalize effectively to new, unseen data.

4. Concatenating features to create “name” & “text” vectors.

This technique is inspired by the winner’s solution to improve the score,

TF-IDF vectorization is a powerful tool for generating text features, but it can often result in a large number of features that can be overwhelming to work with. This is where their clever technique comes into play! By using this technique, i am able to not only control the number of features generated but also enhance their quality.

How do you ask? By bringing all the textual information together, it is ensured that each feature was more informative and representative of the overall text. The result was a set of features that were not only manageable but also of higher quality.

5. Cleaning the text data to make it more applicable for the text analysis

Cleaning text data is a crucial step in text analysis and is necessary for making the data more usable and relevant for analysis. The process involves removing unwanted elements such as special characters, punctuation, and stop words, as well as converting the text to a standardized format, such as lowercase. The ultimate goal of text cleaning is to prepare the data for feature extraction. By cleaning the text data, you can improve the quality and accuracy of the results and make it easier to draw meaningful insights from the data.

6. Performing lemmatization

Lemmatization is useful in text analysis because it helps to reduce the dimensionality of the data by grouping similar words together. This can improve the accuracy of text classification models, as well as make it easier to identify patterns and relationships within the text data.

7. Creating the length of merged text as a feature

df_train['length_combined_text'] = df_train['text'].str.len()
df_cval['length_combined_text']    = df_cval['text'].str.len()

EDA revealed that the length of the text data is an important feature to consider. To make the most of this information, it’s important to add the length feature to the data frame. Adding the length feature to the data frame allows us to capture the distribution of text lengths and use this information in analysis. This can make it easier to identify patterns and relationships within the text data.

Feature engineering —

Generating derived features(Price Statistics)

By generating historical statistics from the price data to gain insights into how prices have changed over time. Derived features are new features generated from existing features in a dataset. In this case, it is generating price statistics features from the price data. These features can provide additional insights and information about the distribution of prices in the dataset. By calculating the mean, median, and standard deviation of the prices, it can gain a better understanding of the distribution of prices and identify any outliers. This information can be used to improve the accuracy of machine learning models, as well as provide insights into the market and customer behavior.

2. Applying different types of text encoders on different kinds of text data

It is found that sparse TFIDF embedding work the best for the models . The sparse TF-IDF text encoding on the text data represents the importance of words in each document. The sparse encoding is used to reduce the dimensionality of the data and handle the high sparsity of the TF-IDF representation. By applying the sparse TF-IDF encoding to the text data, it can extract meaningful features from the text data and use this information in the analysis. Through the experimentation, it is found out that n_grams=1 and n_grams=2 worked optimally for the name and text features respectively.

X_train_name, X_cval_name,_ = text_encoder(df_train['name'], df_cval['name'], "TFIDF", 1)
X_train_text, X_cval_text,_ = text_encoder(df_train['text'], df_cval['text'], "TFIDF", 2)

One-hot encoding is a common technique used in machine learning to represent categorical data as numerical data. The technique creates new columns for each category in the data and assigns a binary value of 0 or 1 to indicate the presence or absence of a category in a given observation.

In this case, by applying one-hot encoding on the shipping and item condition data, which are both categorical variables. By converting these variables into numerical data, by using them in this analysis and incorporate them into machine learning models. One-hot encoding is particularly useful for categorical data with a large number of categories or data with high cardinality. By converting the categorical data into numerical data, it can handle these challenges and improve the accuracy of the models.

3. Feature Selection via SelectKBest & Correlation Matrix

The process of feature selection is crucial for reducing the dimensionality of the data and enhancing the effectiveness of machine learning models. SelectKBest, a popular method for feature selection, chooses the best features based on a statistical test like chi-squared or ANOVA. To minimise the complexity of the data, highly associated characteristics can also be found using a correlation matrix and then eliminated.

By using SelectKBest for feature selection and also taking into account the correlation matrix to identify highly correlated features. By using SelectKBest and the correlation matrix together, by selecting the most relevant and non-redundant features from the dataset, which can improve the accuracy of the models and make the analysis more interpretable. Also, by using SelectKBest for feature selection instead of SelectFromModel due to space and time constraints. SelectKBest is faster and more memory efficient than SelectFromModel, making it a good choice for large datasets or for situations where computational resources are limited.

Heatmap of all the generated features against the target “price” feature

After a thorough evaluation of the results from both methods, it was observed that the historical price statistics features can play a significant role in predicting the price.

Hence, The final sparse HStack that would be employed as the training and cross-validation set in the models are

Sparse hstack is the process of horizontally stacking several sparse matrices to form a single, combined matrix. In this case, by stacking matrices derived from four different features: name, text, shipping, and item condition. This combination allows us to capture the complete information from all four features while preserving the sparsity that is inherent in text data.

So why is this important? By combining the information from these features into a single matrix, by creating a more robust representation of the data that can be used to train machine learning models. This can lead to improved model accuracy, better results, and more effective use of your data.

6. Examining the Top Existing Solutions

Ridge Model: This solution employs a simple Ridge model trained on TF-IDF features for text and One Hot Encoding for categorical variables. It is a very simple, elegant, and effective solution to this problem. The obtained RMSLE is 0.44.
Sparse MLP: This is the solution to the Kaggle problem proposed by the winners. They achieved first place with an RMSLE of 0.387 using interesting feature engineering techniques and ensembles of Sparse MLPs trained on sub-datasets.
CNN Glove Single Model: To train a single CNN model, this solution employs pre-trained word embeddings for text and One Hot Encoding for categorical features. This solution is distinct because, in contrast to everyone else, only one deep learning model is used in this solution. Furthermore, the feature engineering techniques employed here are highly innovative. The obtained RMSLE score is 0.41 (35th Rank).

7. Baseline Models

By dividing the training data into two sets: a training set and a validation set. In this project, by choosing to use an 80–20 random split, which means that 80% of the original training data will be used for training, and the remaining 20% will be used for validation. The models below are developed using features derived from the text data:

Ridge Regression

Finding the link between a dependent variable and one or more independent variables is a common regression job. Linear Regression with Regularization is a straightforward but reliable model that can be applied in this situation. Regularization is a method for avoiding overfitting, a problem that frequently arises in regression models. When a model fits the data too closely and becomes overly complicated, it is said to overfit, which causes it to perform poorly on newly discovered data.

L1 Regularization adds a penalty term equal to the absolute value of the coefficients, whereas L2 Regularization adds a penalty term equal to the square of the coefficients. These penalty terms help to reduce the magnitude of the coefficients, making the model simpler and preventing overfitting.

In this experiments, it is observed that L2 Regularization (Ridge Regression) worked better than L1 Regularization (Lasso Regression) for the regression task. This may be due to the high number of features generated from sparse vectorization. Sparse vectorization is a technique used to represent text data in numerical form, which can result in a high number of features. This can make the Lasso model difficult to converge, leading to poor performance.

In conclusion, Linear Regression with Regularization is a simple yet effective model for regression tasks. L2 Regularization (Ridge Regression) is a better choice than L1 Regularization (Lasso Regression) when dealing with high-dimensional data, such as that generated from sparse vectorization.

By performing hyperparameter tuning of the hyperparameters -> max_iter and alpha using RandomSearchCV & BayesianSearch,

By getting the RMSLE loss = 0.444 which is a great score considering this simple yet effective model.

2. XGBoost Regressor

XGBoost can be used to build a predictive model that takes the product attributes as input and outputs a predicted price. XGBoost has several advantages over traditional gradient-boosting algorithms, including faster training times, more accurate predictions, and the ability to handle missing values and extreme values in the data. Additionally, XGBoost provides many tunable parameters that allow users to fine-tune the model to their specific needs, making it a popular choice for machine learning practitioners.

By performing hyperparameter tuning of the hyperparameters -> max_iter and alpha using RandomSearchCV,

The result is achieved using XGBoost was not as favourable as the Ridge Regression model. The XGBoost model produced an RMSLE loss = 0.45, which was worse compared to the loss obtained from Ridge Regression. Additionally, the training time for XGBoost was two hours, which was significantly longer than the two minutes it took to train the Ridge Regression model.

3. LightGBM Regressor

LightGBM is designed to be faster than XGBoost, especially when dealing with large datasets or high-dimensional data. LightGBM uses a histogram-based method to bin the data, which reduces memory usage and speeds up the training process. Additionally, LightGBM supports parallel learning, which further improves its training efficiency. As training time was a concern in XGBoost, let’s implement LightGBM to see how good does it fit the data.

After Hyperparameter tuning using RandomSearchCV,

The result is achieved using LightGBM was slightly more favourable than Ridge Regression model. LightGBM model produced an RMSLE loss = 0.439, which is slightly better compared to the loss obtained from XGBoost & Ridge Regression. Additionally, the training time for LightGBM was 1 hr, which is significantly shorter than the time taken by XGBoost. The initial expectation was that tree-based models would struggle with a large number of features, which likely contributed to the lower performance of the LightGBM & XGBoost model.

4. Sparse Multi-layer Perceptron

The Multilayer Perceptron (MLP) is a popular artificial neural network architecture that can be used for regression and classification tasks. The Sparse MLP is a variation of the MLP that has been modified to handle sparse text data. The training of neural networks got incredibly convenient because Keras 2.0 takes sparse inputs natively.

(2Layers) -> MLP-1 & (6-Layers) -> MLP-2

MLP-1 has a very small and simple architecture but is very powerful —

Input Layer -> Dense (256) -> BN Layer() ->Dense (128) -> Dropout(0.2) -> Dense (1) -> Output Layer

Sparse MLP-1 Model

MLP-2 has a more complex architecture than the previous model —

Input Layer -> Dense (1024) -> BN Layer() ->Dense (512) -> Dense (256) -> Dense (128) -> Dense (64) -> Dense (32) -> Dropout(0.2) -> Dense (1) -> Output Layer

Sparse MLP-2 Model

For the hidden layers of model, it utilizes the Rectified Linear Unit (ReLU) activation function. For the output layer, by opting for a linear activation function, f(x) = x. In experimentation showed that the LeakyReLU activation function did not produce desirable results and thus, was not utilized in the model.

The model was trained using the Adam optimization algorithm for a total of 3 epochs. The initial batch size was set to 256, and after each epoch, the batch size was doubled. This approach was inspired by the winning solution of a Kaggle competition and was implemented to further enhance the training process.

The experiments showed that adding a Batch Normalization layer and a Dropout layer significantly improved the RMSLE of the Sparse MLP models. The RMSLE for MLP-1, without Batch Normalization and Dropout, was 0.43, while the RMSLE for MLP-2 was 0.422. These results demonstrate the effectiveness of incorporating these techniques in deep learning models, as they can greatly enhance model performance.

After training the two Sparse MLP models, MLP-1 and MLP-2, by evaluating their performance using the cross-validation data and root Mean Squared Logarithmic Error (RMSLE). The RMSLE for MLP-1 = 0.42 and the RMSLE for MLP-2 = 0.412. These results indicate that MLP-2 performed slightly better than MLP-1, with a lower RMSLE value.

This is evidence of deep learning’s potential and prowess when used with lots of data. With only two hidden layers and a straightforward Sparse MLP model to extract features implicitly, which produced an incredibly low RMSLE on the test set of data. The model was found to be over fitted to the training set of data, nevertheless. Despite this worry, the outstanding test results persuaded us to ignore this problem. We were encouraged by the positive outcomes from this straightforward model and decided to test a more intricate model to see if we might enhance performance even further.

8. Final Model

An ensemble generator refers to the combination of multiple models to create a new, more powerful model. MLP-1 and MLP-2 are two separate Multi-Layer Perceptron (MLP) models. An ensemble generator combining MLP-1 and MLP-2 would create a new model by aggregating the predictions made by the two individual models, to improve the overall performance and accuracy of the new model. We get the optimum weight to be 0.405 and the final RMSLE loss to be 0.4047.

When the Ensemble MLP model was evaluated on the unseen test data with 3.5 million rows on Kaggle, it achieved an RMSLE of 0.41226.

9. Results & Conclusion

Final Results of all models in a PrettyTable format

The best model is an ensemble of two Multi-Layer Perceptron (MLP) models: MLP Model 1 and MLP Model 2.
It achieved a training RMSLE (Root Mean Squared Logarithmic Error) of 0.1995 and a cross-validation RMSLE of 0.4047 with minimal overfitting.
The other models couldn’t capture the patterns in the data due to a higher bias than the variance. However, the MLP model was successful in detecting these patterns.
The use of engineered features, such as historical price statistics, may negatively impact the model’s performance due to overfitting.
The model utilized TFIDF sparse vectorization instead of dense Word2Vec representation due to the absence of certain words in the vocabulary of the Google News dataset used to train the Word2Vec model.
LightGBM and the ensemble MLP performed well due to the large dataset and the neural network, respectively.
The accuracy of the model was improved due to the use of sparse hstack feature representation, which had a reduced number of features.
The Kaggle score was significantly improved by incorporating batch normalization and a dropout rate of 0.2. The score improved from 0.418 to 0.412.

The ensemble model result placed the model in the top 1.7% of the leaderboard for this competition, demonstrating its strong performance and competitiveness among other models in the competition.

10. Future Work

Implementing LightGBM & MLP as an ensemble model to find out if there is any improvement.
Implementing extensive hyperparameter search using LeakyReLu In MLP models.
Implementing neural networks like LSTM or 1-D CNN models that are specifically designed for handling sequence data.
Implementing Transfer learning techniques to minimize the RMSLE and creating deep neural networks.

11. References

Link to my profile:

The complete code can be found on this Github Link. You can connect with me on Linkedin and you can reach me via Gmail.