How to finish top 10 percentile in Bike Sharing Demand Competition In Kaggle? (part -1)
Every aspiring data science professional has always wanted to participate in Kaggle competitions. Like most of them I started my humble Kaggle journey through famous Titanic-Machine Learning From Disaster competition. Kaggle has handful of datasets ranging from easy to tough, which the user can explore and get practical expertise in data science.
Bike Sharing Demand is one such competition especially helpful for beginners in the data science world. It is a fairly simple dataset suitable for applying some concrete statistical techniques like Regression and also for some advance ensemble models such as Random Forest and Gradient Boosting algorithms.
The bike sharing demand analysis is split into two parts. The first part of the blog helps you to get started with dataset and discover some interesting patterns between dependent and explanatory variables. Model building is covered in the second part of the blog where we start with basic techniques such as Regression and work our way through Regularization and end up building complex ensemble models. At the end of the analysis we will reach top 10 percentile on the public leaderboard.
As far as this blog is concerned all the exploratory analysis and model building is performed using python. Python has got some excellent libraries for data science like pandas, numpy, seaborn and sklearn which we will be using throughout this blog. Refer to the following Github link for the complete Ipython notebook.
About The Competition
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able to rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.
The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bike share program in Washington, D.C.
More information about the dataset and competition can be found on the following link Bike Sharing Demand
As a first step lets do three simple things on the data.
- Dataset size.
- Get a glimpse of data by printing few rows of it.
- What type of variables contribute our data.
As we see from the above results, the columns season, holiday, workingday and weather should be of categorical data type. But the current data type is int for those columns. Let us transform the dataset in the following ways so that we can get started with our EDA.
- Create new columns “date, ”hour”, ”weekday”, ”month” from “datetime” column.
- Coerce the datatype of “season”,”holiday”,”workingday” and weather to categorical datatype.
- Drop the datetime column as we already extracted useful features from it.
Lets start with a very simple visualization of variables data type count.
Missing Value Analysis
Once we get the hang of the data and attributes, next step we generally do is to find out whether we have any missing values in our data. Luckily we do not have any missing value in the data. One way which I generally prefer to visualize missing value in the data is through missingno library in python.
It is quite a handy library to quickly visualize missing values in attributes. As I mentioned earlier we got lucky this time as there were no missing values in the data. But we have lot of 0’s in “windspeed” column which we will deal later when we build machine learning models.
At first look, “count” variable contains lot of outlier data points which skews the distribution towards right (as there are more data points beyond Outer Quartile Limit). Besides, the following inferences can also been made from the simple box plots given below.
- Spring season has got relatively lower count. The dip in median value
in box plot gives evidence for it.
- The box plot with Hour Of The Day is quite interesting. The median values are relatively higher at 7AM — 8AM and 5PM — 6PM. It can be attributed to regular school and office users at that time.
- Most of the outlier points are mainly contributed by “Working Day” than “Non Working Day”. It is quite visible from from figure 4.
One common thing to understand how a dependent variable is influenced by numerical features is to find a correlation matrix between them. Let’s construct a correlation plot between “count” and [“temp”, ”atemp”, ”humidity”, ”windspeed”] .
- “temp” and “humidity” features have positive and negative correlation with count respectively. Although the correlations between them are not very prominent, still the count variable has got little dependency on “temp” and “humidity”.
- “windspeed” is not going to be a really useful numerical feature and that is visible from the correlation value with “count”.
- “atemp” variable is not taken into account since “atemp” and “temp” has strong correlation with each other. During model building any one of the variables has to be dropped since they will exhibit multicollinearity in the data.
- “casual” and “registered” attributes are also not taken into account since they are leakage variables in nature and need to be dropped during model building.
Regression plot in seaborn library in python is one useful way to depict the relationship between two features. Here we consider “count” vs “temp”, “humidity” and “windspeed”. Although we have three numerical features which have some correlation with the dependent variable “Count”, its not going to help us a lot in predicting and it is clearly visible from the regression plots shown below. So as a next step let us see how the categorical variables help us in model building.
Visualizing Distribution Of Independent Variable
As it is visible from the below figures that “count” variable is skewed towards right. It is desirable to have normal distribution as most of the machine learning techniques require dependent variable to be normal. One possible solution is to take log transformation on “count” variable after removing outlier data points. After the transformation the data looks lot better but still not ideally following normal distribution.
Visualizing Count Vs (Month,Season,Hour,Weekday,Usertype)
From the above charts we can infer:
- It is obvious that people tend to rent bike during summer
season since it is really conducive to ride bike in that
season. Therefore June, July and August has relatively higher
demand for bicycle.
- On weekdays more people tend to rent bicycle around 7AM-8AM and 5PM-6PM. As we mentioned earlier this can be attributed to regular school and office commuters.
- The above pattern is not observed on Saturdays and Sundays where more people tend to rent bicycle between 10AM and 4PM.
- The peak user count around 7AM-8AM and 5PM-6PM is purely contributed by registered user.
So we have visualized the data to a greater extent now. Let us go ahead and build some models and see how we can reach top 10 percentile in the leader board.