# How to finish top 10 percentile in Bike Sharing Demand Competition In Kaggle? (part -1)

Every aspiring data science professional has always wanted to participate in *Kaggle* competitions. Like most of them I started my humble *Kaggle* journey through famous *Titanic-Machine Learning From Disaster*** **competition.

*Kaggle*has handful of datasets ranging from easy to tough, which the user can explore and get practical expertise in data science.

*Bike Sharing Demand*** **is one such competition especially helpful for beginners in the data science world. It is a fairly simple dataset suitable for applying some concrete statistical techniques like

*Regression*and also for some advance ensemble models such as

*Random Forest*and

*Gradient Boosting*algorithms.

The bike sharing demand analysis is split into two parts. The first part of the blog helps you to get started with dataset and discover some interesting patterns between dependent and explanatory variables. Model building is covered in the second part of the blog where we start with basic techniques such as *Regression* and work our way through *Regularization* and end up building complex ensemble models. At the end of the analysis we will reach top 10 percentile on the public leaderboard.

As far as this blog is concerned all the exploratory analysis and model building is performed using python. Python has got some excellent libraries for data science like pandas, numpy, seaborn and sklearn which we will be using throughout this blog. Refer to the following ** Github** link for the complete Ipython notebook.

#### About The Competition

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able to rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bike share program in Washington, D.C.

More information about the dataset and competition can be found on the following link *Bike Sharing Demand*

#### Data Summary

As a first step lets do three simple things on the data.

- Dataset size.
- Get a glimpse of data by printing few rows of it.
- What type of variables contribute our data.

#### Feature Engineering

As we see from the above results, the columns season, holiday, workingday and weather should be of *categorical data type*. But the current data type is *int *for those columns. Let us transform the dataset in the following ways so that we can get started with our EDA.

- Create new columns “date, ”hour”, ”weekday”, ”month” from “datetime” column.
- Coerce the datatype of “season”,”holiday”,”workingday” and weather to
*categorical datatype*. - Drop the datetime column as we already extracted useful features from it.

Lets start with a very simple visualization of variables data type count.

#### Missing Value Analysis

Once we get the hang of the data and attributes, next step we generally do is to find out whether we have any missing values in our data. Luckily we do not have any missing value in the data. One way which I generally prefer to visualize missing value in the data is through *missingno* library in python.

It is quite a handy library to quickly visualize missing values in attributes. As I mentioned earlier we got lucky this time as there were no missing values in the data. But we have lot of 0’s in “windspeed” column which we will deal later when we build machine learning models.

#### Outlier Analysis

At first look, “count” variable contains lot of outlier data points which skews the distribution towards right (as there are more data points beyond *Outer Quartile Limit*). Besides, the following inferences can also been made from the simple box plots given below.

- Spring season has got relatively lower count. The dip in median value

in box plot gives evidence for it. - The box plot with
*Hour Of The Day*is quite interesting. The median values are relatively higher at*7AM — 8AM*and*5PM — 6PM*. It can be attributed to regular school and office users at that time. - Most of the
*outlier*points are mainly contributed by “Working Day” than “Non Working Day”. It is quite visible from from figure 4.

#### Correlation Analysis

One common thing to understand how a *dependent variable* is influenced by *numerical features* is to find a *correlation matrix* between them. Let’s construct a *correlation plot* between “count” and [“temp”, ”atemp”, ”humidity”, ”windspeed”] .

- “temp” and “humidity” features have positive and negative correlation with count respectively. Although the correlations between them are not very prominent, still the count variable has got little dependency on “temp” and “humidity”.
- “windspeed” is not going to be a really useful numerical feature and that is visible from the correlation value with “count”.
- “atemp” variable is not taken into account since “atemp” and “temp” has strong correlation with each other. During model building any one of the variables has to be dropped since they will exhibit
*multicollinearity*in the data. - “casual” and “registered” attributes are also not taken into account since they are
*leakage variables*in nature and need to be dropped during model building.

Regression plot in seaborn library in python is one useful way to depict the relationship between two features. Here we consider “count” vs “temp”, “humidity” and “windspeed”. Although we have three numerical features which have some correlation with the dependent variable “Count”, its not going to help us a lot in predicting and it is clearly visible from the regression plots shown below. So as a next step let us see how the categorical variables help us in model building.

#### Visualizing Distribution Of Independent Variable

As it is visible from the below figures that “count” variable is skewed towards right. It is desirable to have normal distribution as most of the machine learning techniques require dependent variable to be *normal*. One possible solution is to take *log transformation* on “count” variable after removing outlier data points. After the transformation the data looks lot better but still not ideally following normal distribution.

#### Visualizing Count Vs (Month,Season,Hour,Weekday,Usertype)

From the above charts we can infer:

- It is obvious that people tend to rent bike during summer

season since it is really conducive to ride bike in that

season. Therefore June, July and August has relatively higher

demand for bicycle. - On weekdays more people tend to rent bicycle around
*7AM-8AM*and*5PM-6PM*. As we mentioned earlier this can be attributed to regular school and office commuters. - The above pattern is not observed on Saturdays and Sundays where more people tend to rent bicycle between
*10AM and 4PM*. - The peak user count around
*7AM-8AM*and*5PM-6PM*is purely contributed by registered user.

So we have visualized the data to a greater extent now. Let us go ahead and build some models and see how we can reach top 10 percentile in the leader board.

*How to finish top 10 percentile in Bike Sharing Demand Competition In Kaggle? (part -2)*