Black Friday Data Science Hackathon at analyticsvidhya.com
Analytics Vidhya is a community of Data science enthusiasts. We have regular data hackathon’s. The latest data hack was organized over this weekend from 20–22 Nov- 2015. Participant ranged from beginners like me to experts like Rohan Rao.
Problem Statement
The challenge was to predict purchase prices of various products purchased by customers based on historical purchase patterns. The data contained features like age, gender, marital status, categories of products purchased, city demographics etc.
Approach:
The data set was quite big with 550068 rows in train set and 233599 in test set. I analyzed it to look for outliers in label set. There were none. The data set was very well distributed throughout the label values. So i went ahead to anayze the categorical values. Which also revealed that the distribution of categorical values between test and train set was very balanced.
Percentage distribution of the categorical values in both train and test set was close.
Link to detailed analysis: https://github.com/vi3k6i5/black_friday_data_hack/blob/master/analysis.ipynb
My initial thought was to go with Random forest Regressor. For that I convert all the categorical variables using one hot encoding and got close to 91 rows. Then in ran the model using scikit learn random forest regressor. Which sadly took a lot of time. I moved on considering that optimizing it will be tough. By this time my rmse on public leader board was close to 2.65(~ something)K.
I figured using label encoding will make the data set simpler. So i went ahead with that. After label encoding the data i ran Xgboost model. The code i used for this was a optimized benchmark script shared by another participant Aayush.
So i ran that model which gave me a very good CV and rmse close to 2.5K on the public leader board.
At this point it was pretty clear that the most important distinguishing factor is going to be feature engineering. Who ever does the best Feature engineering has the highest chances of winning.
So i sat to analyse the data a little bit more. I considered coming up with avg expense group by gender, age, other categories. All these values were not giving me much boost in CV locally. Finally i decided to go with product_id avg purchase price. And that boosted my CV a little.
Then the next piece i worked on was user categorization. I divided the users into purchase power category.
I used this distribution to categorize users:

I took this distribution and classified the users in 10 categories.
And re-ran the model with 2 extra features i had come up with.
My final Leader board cv was close to 2.4k. And the final feature importance distribution came to be like this :

And final score on Private LB:

I came 6th overall with final rmse on private Ds to be ~2471. Not bad. I also got to try R this time again. It was good learning. I am hoping to participate in more such events in future :)
Cheers