A walkthrough of my Machine Learning approach in a data science competition
Sharing my solution to help you kickstart your hackathon journey
I recently participated in MachineHack’s Buyer’s Time Prediction Challenge and would like to share my approach with you. So, let's get started with a quick outline:
- Problem Statement
- Data understanding
- Solution: a) Target variable transformation, b) Outlier removal, c) Feature Engineering, and d) Modeling
- Learning from peers
The competition focused on developing a machine learning model to buyers’ time spent on an eCommerce platform.
Evaluation metric: Root mean squared logarithmic error
Target variable: “time_spent”
Let us quickly understand the data:
We have 9 features in total as shown below:
Now that we know what all data we will be working with, let us look at the solution below:
- Outlier removal:
I initially started with removing those 55 entries where time_spent is greater than 99 percentile. This was done on the premise that the algorithm would able to learn the statistical dependencies for a large part of the data, in the absence of such records:
2. Target variable transformation:
Original Target Variable distribution looks like below:
Post log transformation:
I took the log of the target variable and the resulting distribution looks like below:
3. Feature Engineering:
As we can see in the “data understanding” section, there are 3 object type features barring session_id (it will be dropped from model input). These are client_agent, device_details, and date features.
Here is how I worked on them:
1) client_agent: I created features from it using tfidf, but it did not improve the score, hence did not include it as part of my submitted solution.
2) device_details: I did one-hot encoding on device_details and ended up creating features like this:
3) date: I converted it into a datetime object and created an extensive set of features from it such as the week of the year, day of the month, the month of the year, day of the week, etc.
I also created features like “week_end” and “month_end” with the following intuition:
- The buyer might be spending a lot of time on the websites figuring out what he will end up buying once he receives his salary in ‘month_end’.
- The buyer might spend more time on non-working days
4) Modeling: I used Lasso Regression. I tried other algorithms like Decision Tree, Random Forest, Light GBM, catboost, and Xgboost among others.
Note the inverse transformation of the target variable at the time of making predictions and calculating the evaluation_metric.
Lastly, learning is never complete if we do not look at the winning solutions and learn from them. So, I went through the link where MachineHack posted winners’ solutions, and here are my key takeaways:
- Creating one combined feature of related variables: ‘purchased’, ‘added_in_cart,’ and ‘checked_out’
- Creating more advanced features from ‘client_agent’ such as classifying client agents into handheld devices and desktops, and extracting the browser version
Hope you enjoyed learning from my experience of participating in this competition.
Thanks for reading!!!