How to finish top 10 percentile in Bike Sharing Demand Competition In Kaggle? (part -1)

Vivek Srinivasan
May 3, 2017 · 6 min read

Every aspiring data science professional has always wanted to participate in Kaggle competitions. Like most of them I started my humble Kaggle journey through famous Titanic-Machine Learning From Disaster competition. Kaggle has handful of datasets ranging from easy to tough, which the user can explore and get practical expertise in data science.

Bike Sharing Demand is one such competition especially helpful for beginners in the data science world. It is a fairly simple dataset suitable for applying some concrete statistical techniques like Regression and also for some advance ensemble models such as Random Forest and Gradient Boosting algorithms.

The bike sharing demand analysis is split into two parts. The first part of the blog helps you to get started with dataset and discover some interesting patterns between dependent and explanatory variables. Model building is covered in the second part of the blog where we start with basic techniques such as Regression and work our way through Regularization and end up building complex ensemble models. At the end of the analysis we will reach top 10 percentile on the public leaderboard.

As far as this blog is concerned all the exploratory analysis and model building is performed using python. Python has got some excellent libraries for data science like pandas, numpy, seaborn and sklearn which we will be using throughout this blog. Refer to the following Github link for the complete Ipython notebook.

About The Competition

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able to rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bike share program in Washington, D.C.

More information about the dataset and competition can be found on the following link Bike Sharing Demand

Data Summary

As a first step lets do three simple things on the data.

Data Summary

Feature Engineering

As we see from the above results, the columns season, holiday, workingday and weather should be of categorical data type. But the current data type is int for those columns. Let us transform the dataset in the following ways so that we can get started with our EDA.

Lets start with a very simple visualization of variables data type count.

Missing Value Analysis

Once we get the hang of the data and attributes, next step we generally do is to find out whether we have any missing values in our data. Luckily we do not have any missing value in the data. One way which I generally prefer to visualize missing value in the data is through missingno library in python.

It is quite a handy library to quickly visualize missing values in attributes. As I mentioned earlier we got lucky this time as there were no missing values in the data. But we have lot of 0’s in “windspeed” column which we will deal later when we build machine learning models.

Outlier Analysis

At first look, “count” variable contains lot of outlier data points which skews the distribution towards right (as there are more data points beyond Outer Quartile Limit). Besides, the following inferences can also been made from the simple box plots given below.

Correlation Analysis

One common thing to understand how a dependent variable is influenced by numerical features is to find a correlation matrix between them. Let’s construct a correlation plot between “count” and [“temp”, ”atemp”, ”humidity”, ”windspeed”] .

Regression plot in seaborn library in python is one useful way to depict the relationship between two features. Here we consider “count” vs “temp”, “humidity” and “windspeed”. Although we have three numerical features which have some correlation with the dependent variable “Count”, its not going to help us a lot in predicting and it is clearly visible from the regression plots shown below. So as a next step let us see how the categorical variables help us in model building.

Visualizing Distribution Of Independent Variable

As it is visible from the below figures that “count” variable is skewed towards right. It is desirable to have normal distribution as most of the machine learning techniques require dependent variable to be normal. One possible solution is to take log transformation on “count” variable after removing outlier data points. After the transformation the data looks lot better but still not ideally following normal distribution.

Visualizing Count Vs (Month,Season,Hour,Weekday,Usertype)

From the above charts we can infer:

So we have visualized the data to a greater extent now. Let us go ahead and build some models and see how we can reach top 10 percentile in the leader board.

How to finish top 10 percentile in Bike Sharing Demand Competition In Kaggle? (part -2)