WiCDS
Published in

WiCDS

A walkthrough of my Machine Learning approach in a data science competition

Sharing my solution to help you kickstart your hackathon journey

Source

I recently participated in MachineHack’s Buyer’s Time Prediction Challenge and would like to share my approach with you. So, let's get started with a quick outline:

  • Problem Statement
  • Data understanding
  • Solution: a) Target variable transformation, b) Outlier removal, c) Feature Engineering, and d) Modeling
  • Learning from peers

Problem Statement:

The competition focused on developing a machine learning model to buyers’ time spent on an eCommerce platform.

Evaluation metric: Root mean squared logarithmic error

Target variable: “time_spent”

Let us quickly understand the data:

We have 9 features in total as shown below:

Source: Author

Solution:

Now that we know what all data we will be working with, let us look at the solution below:

  1. Outlier removal:

I initially started with removing those 55 entries where time_spent is greater than 99 percentile. This was done on the premise that the algorithm would able to learn the statistical dependencies for a large part of the data, in the absence of such records:

2. Target variable transformation:

Original Target Variable distribution looks like below:

Post log transformation:

I took the log of the target variable and the resulting distribution looks like below:

3. Feature Engineering:

As we can see in the “data understanding” section, there are 3 object type features barring session_id (it will be dropped from model input). These are client_agent, device_details, and date features.

Here is how I worked on them:

1) client_agent: I created features from it using tfidf, but it did not improve the score, hence did not include it as part of my submitted solution.

2) device_details: I did one-hot encoding on device_details and ended up creating features like this:

3) date: I converted it into a datetime object and created an extensive set of features from it such as the week of the year, day of the month, the month of the year, day of the week, etc.

I also created features like “week_end” and “month_end” with the following intuition:

  • The buyer might be spending a lot of time on the websites figuring out what he will end up buying once he receives his salary in ‘month_end’.
  • The buyer might spend more time on non-working days

4) Modeling: I used Lasso Regression. I tried other algorithms like Decision Tree, Random Forest, Light GBM, catboost, and Xgboost among others.

Note the inverse transformation of the target variable at the time of making predictions and calculating the evaluation_metric.

Lastly, learning is never complete if we do not look at the winning solutions and learn from them. So, I went through the link where MachineHack posted winners’ solutions, and here are my key takeaways:

  • Creating one combined feature of related variables: ‘purchased’, ‘added_in_cart,’ and ‘checked_out’
  • Creating more advanced features from ‘client_agent’ such as classifying client agents into handheld devices and desktops, and extracting the browser version

Hope you enjoyed learning from my experience of participating in this competition.

Thanks for reading!!!

--

--

--

A collaborative community for Women in Data Science and Programming to learn and grow

Recommended from Medium

How to Build or Select a Good Prediction Model In Machine Learning

Introduction to Deep Learning (Basic Concepts Covered)

ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for semantic segmentation of lung cancer…

Porto Seguro’s Safe Driver Prediction: Machine Learning Case Study

Logistic regression in machine learning

Federated Leaning Concept and Application

Although the mobile industry is booming, major Mobile Network Operators didn’t have that many…

A Brief Introduction to Machine Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Vidhi Chugh

Vidhi Chugh

Data Scientist

More from Medium

Recommendation Systems. Simple concept — Powerful applications!

Data Science Ideas Using Graphs: Line Graphs and Edge Clustering

Experimental Process with Transfer Knowledge

How to Create a Dataset for Machine Learning