Forecasting Demand for Bike Sharing System with Python — Part 1

Data Preparation and Feature Visualization

Cheer Hung

Published in

Cheer and Utkarsh’s trial on Machine Learning

8 min readJan 10, 2020

Prediction Machine Learning

“Prediction” refers to the output generated by an algorithm which has been trained on historical data and applied to unseen data. In this project it can be referred to the number of bikes rented out in the next quarter based on previous rental behavior by the customers.

Objective

Bike sharing programs are becoming more and more popular around the world due to environmental issues, pricing and convenience. In an attempt by the city or government to control and understand mobility flows in the city of Washington, bicycle sharing system data can be used as an approximation to explain the commuting of people around the city. Bike sharing has become a trend all around the world.

Therefore, we decide to use the Bike Sharing dataset provided by UCI. The main goal of this project is to predict the hourly number of bikes rented during the last quarter of the year.

For this purpose, two datasets with 17 variables were provided with data between the first of January of 2011 until the last day of 2012. One dataset had hourly information while the other one contained daily information regarding the bike sharing system business.

Let’s Do it!

We start importing two datasets first by using function pd.read_csv then we put the column instant as the index.

# Importing the data
day = pd.read_csv(~, index_col=”instant”, parse_dates=True)
hour = pd.read_csv(~, index_col=”instant”, parse_dates=True)# Checking the head of the hourly and daily dataset
day.head()
hour.head()

Then we use the function .head() to check the head of the dataset. We are going to focus on the hourly dataset as the target is to make hourly predictions. However, we will also look at the daily dataset to explore any possible differences.

Data Preparation

Different analyzes will be performed to understand the data we are dealing with. This analyzes include some of the following:

Checking missing values and coherence
Checking variable types
Analyzing distribution of the variables

Check Missing Values

# Checking missing values for the dataset
day.isnull().any()
hour.isnull().any()

We use function .isnull().any() to check if there is any missing value, as we can see there are no missing values in both csv. files. Then, we check the amount of the entries by using .shape.

Here is an easy calculation, 731 days in day dataset time 24 hour a day. We expect that hour dataset should have 17544 entries (1 per hour). However, we only have 17379 entries. What about the remaining 165 hours? We need to check the consistency of the dataset.

To visualise the inconsistency in the dataset. The x axis of this graph is “date” and the y axis is “Number of hours” there is an entry for, which implies that if the value is 24 then there is a record in the dataset for the whole day(24 hours 1 entry per hour) and where ever there is a valley there is some anomaly which caused the business to be inoperable for those many hours. However, if there was a regular pattern it could be interpreted as a routine maintenance of the bike but as there is no pattern which is witnessed implies that there is some event which occurred on that day in Washington which caused the business to be inoperable.

We witness a significant dip in October 2012. So, we trace the historical resources, we found out that there is a day when the business was not operating for 23 hours. Upon further research, we found out that Washington DC was hit by a hurricane on October 29, 2012. So it can be inferred that these missing hours are due to some underlying cause and hence cannot be imputed.

Convert Numerical Variables into Categories

# check the datatype
hour.info()

By looking at the above results we can see that some of the columns are interpreted as integers even though they are actually better interpreted as categories. Here, we use the pandas library pd.to_datetime to convert the dteday to the right datatype. Then use the function .astype(“category”) convert the numerical variables to categorical.

Then we use the function .info() again to verify that all the variables have been changed. All the variables have been correctly changed according to the result which is shown below.

Data Visualisation

Data visualisation is an important process for us to get the customer behaviour insights. Here, we use libraries seaborn to generate the graphs.

# Total number of passengers by year and season
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(35, 10))sns.boxplot(x=”yr”, y=”cnt”, data=hour, ax=ax1, palette=”Blues_d”)
ax1.set_title(“Boxplot for year variable”,fontsize = 30)
ax1.set_xlabel(“Year” , fontsize = 20)
ax1.set_ylabel(“Cnt” , fontsize = 20)sns.boxplot(x=”season”, y=”cnt”, data=hour, ax=ax2, palette=”Blues_d”)
ax2.set_title(“Boxplot for season variable”,fontsize = 30)
ax2.set_xlabel(“Season” , fontsize = 20)
ax2.set_ylabel(“Cnt” , fontsize = 20)

Yr: There was an increase in bike renting from year 2011 to year 2012 around 64%.

Season: People rent more bikes during season 2 (fall) and 3 (summer) and less in winter and spring. This may be due to weather condition as the weather is at its best in summer and fall in Washington

# Total number of passengers by month and holiday vs non holiday day
fig, (ax3, ax4) = plt.subplots(ncols=2, figsize=(35, 10))
sns.boxplot(x=”mnth”, y=”cnt”, data=hour, ax=ax3, palette=”Blues_d”)
ax3.set_title(“Boxplot for month variable”,fontsize =30)
ax3.set_xlabel(“Month” , fontsize = 20)
ax3.set_ylabel(“Cnt” , fontsize = 20)sns.boxplot(x=”holiday”, y=”cnt”, data=hour, ax=ax4, palette=”Blues_d”)
ax4.set_title(“Boxplot for holiday variable”,fontsize =30)
ax4.set_xlabel(“Holiday” , fontsize = 20)
ax4.set_ylabel(“Cnt” , fontsize = 20)

Month: People rent more bikes between the months of May to October and less in during December, January and February. As the season variable, the trend is in sync with weather condition.

Holiday: People rent more bikes on non-holiday than holiday. This could be due to bikers who commute to work/school won’t be using them during holidays.

# Total number of passengers by weekday and by workingday vs non workingday
fig, (ax5, ax6) = plt.subplots(ncols=2, figsize=(35, 10))
sns.boxplot(x=”weekday”, y=”cnt”, data=hour, ax=ax5, palette=”Blues_d”)
ax5.set_title(“Boxplot for weekday variable”,fontsize=30)
ax5.set_xlabel(“Weekday” , fontsize = 20)
ax5.set_ylabel(“Cnt” , fontsize = 20)
sns.boxplot(x=”workingday”, y=”cnt”, data=hour, ax=ax6, palette=”Blues_d”)
ax6.set_title(“Boxplot for workingday variable”,fontsize= 30)
ax6.set_xlabel(“Workingday” , fontsize = 20)
ax6.set_ylabel(“Cnt” , fontsize = 20)

Weekday: People seems to rent less bikes during weekends. Again this could be due to the bikers who commute to go to work/school. Monday has also less count than the rest of weekdays.

Workingday: There seem to be no big difference between the total amount of people renting bikes on the weekend or during the week. The median is higher for workingdays than for non-working days, meaning people mostly use bikes to commute within the city.

# Total number of passengers by weather situation and hour 
fig, (ax7, ax8) = plt.subplots(ncols=2, figsize=(40, 10))
sns.boxplot(x=”weathersit”, y=”cnt”, data=hour, ax = ax7, palette=”Blues_d”)
ax7.set_title(“Boxplot for weathersit variable”,fontsize=30)
ax7.set_xlabel(“Weathersit” , fontsize = 30)
ax7.set_ylabel(“Cnt” , fontsize = 30)
sns.boxplot(x=”hr”, y=”cnt”, data=hour, ax=ax8, palette=”Blues_d”)
ax8.set_title(“Boxplot for hour variable”,fontsize= 30)
ax8.set_xlabel(“hour” , fontsize = 30)
ax8.set_ylabel(“Cnt” , fontsize = 30)

Weather: Definitely affect the count as the lowest bikes are rented on extreme weather(weather 4). People tend to rent bikes during clear days (weathersit=1).

Hr: The median values are relatively higher at 7AM — 8AM and 5PM — 6PM. It can be attributed to regular school and office users at that time. We can check the next graph to have the better insight.

According to the graph, we can see there are rush hours durning morning and evening on working days, while we are doing the feature engineering should take these periods into account. Let’s dig out the details a little bit more by dividing the renter into 2 groups casual and registered.

# Hourly distribution of casual vs registered
fig, (ax1, ax2)=plt.subplots(ncols=2, figsize=(40,10))sns.boxplot(x=hour[“hr”], y=hour[“casual”],data=hour,ax=ax1, palette=”Blues_r”)ax1.set_title(“Distribution of hourly casual customers”, fontsize=40)
ax1.set_xlabel(“Hour” , fontsize = 30)
ax1.set_ylabel(“Casual” , fontsize = 30)
sns.boxplot(x=hour[“hr”], y=hour[“registered”],data=hour, ax=ax2, palette=”Oranges_r”)ax2.set_title(“Distribution of hourly registered customers”, fontsize=40)
ax2.set_xlabel(“Hour” , fontsize = 30)
ax2.set_ylabel(“Registered” , fontsize = 30)

The pattern between casual and registered customers are different. For the casual customers, there is only one peak period which is shown between 14–17. However, registered customers there are some observable spikes in the morning and in the afternoon. As mentioned before, it is probably related with the usage of the bikes by registered customers, who use them to commute in the city.

We also look into the distribution of casual vs registered customers in weekday by using the code below.

# Weekday distribution of casual vs registered
fig, (ax1, ax2)=plt.subplots(ncols=2, figsize=(40,10))sns.boxplot(x=hour["weekday"],y=hour["casual"],data=hour,ax=ax1,palette="Blues_r")
ax1.set_title("Distribution of weekday casual customers", fontsize=40)
ax1.set_xlabel("Weekday" , fontsize = 30)
ax1.set_ylabel("Casual" , fontsize = 30)sns.boxplot(x=hour["weekday"], y=hour["registered"],data=hour, ax=ax2, palette="Oranges_r")
ax2.set_title("Distribution of weekday registered customers", fontsize=40)
ax2.set_xlabel("Weekday" , fontsize = 30)
ax2.set_ylabel("Registered" , fontsize = 30)

Casual customers tend to rent more bikes on the weekend, whereas registered customers are more regular in their bike rentings. A small decrease can be observed on the weekends for registered customers.

# Workingday vs non workingday distribution of casual vs registered
fig, (ax1, ax2)=plt.subplots(ncols=2, figsize=(40,10))
sns.boxplot(x=hour[“workingday”], y=hour[“casual”],data=hour,ax=ax1, palette=”Blues_r”)
ax1.set_title(“Casual in working day vs non working day”,fontsize =40)
ax1.set_xlabel(“Workingday” , fontsize = 30)
ax1.set_ylabel(“Casual” , fontsize = 30)
sns.boxplot(x=hour[“workingday”], y=hour[“registered”],data=hour, ax=ax2, palette=”Oranges_r”)
ax2.set_title(“Registered in working day vs non working day”, fontsize =40)
ax2.set_xlabel(“Workingday” , fontsize = 30)
ax2.set_ylabel(“Registered” , fontsize = 30)

the y-axis of the blue diagram need to be changed

The proportion of casual customers is much higher on workingdays comparing to non-working days. It is in accordance with the analysis that has already been done.

# Weathersit distribution of casual vs registered 
fig, (ax1, ax2)=plt.subplots(ncols=2, figsize=(40,10))
sns.boxplot(x=hour[“weathersit”], y=hour[“casual”],data=hour,ax=ax1,palette=”Blues_r”)
ax1.set_title(“Casual customers per weathersit”,fontsize=40)
ax1.set_xlabel(“Weathersit”, fontsize = 30)
ax1.set_ylabel(“Casual” , fontsize = 30)
sns.boxplot(x=hour[“weathersit”], y=hour[“registered”],data=hour, ax=ax2,palette=”Oranges_r”)
ax2.set_title(“Registered customers per weathersit”, fontsize=40)
ax2.set_xlabel(“Weathersit”, fontsize = 30)
ax2.set_ylabel(“Registered” , fontsize = 30)

There is a heavy decrease in the proportion of casual customers when the weather conditions are not favourable, meaning that probably casual customers mostly use bikes for leisure reasons.

In this chapter, we developed a basic understanding of our dataset and are done with 60% of the EDA. Moreover, we learn key steps for data preparation and visualisation. In the next chapter, we will finish the visualisation and talk about dealing with skewness and dummy-encoding.