Forecasting Demand for Bike Sharing System with Python — Part 2

Feature Analysis and Feature Engineering

Published in

Cheer and Utkarsh’s trial on Machine Learning

6 min readJan 17, 2020

In the first chapter, you can find out the pipeline of developing an understanding for the data, data cleaning and visualizing important variables and their relationship. This chapter we will finish the rest of the numerical analysis and feature engineering.

Numerical Analysis

There are four numerical features need to be looked into, namely : temp, Atemp, windspeed,

Definition of Temp :
Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)Definition of Atemp:
Normalized feeling temperature in Celsius. 
The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)

Let’s dig into the Temp and Atemp first by using boxplot and scatterplot first. In boxplots as below, there seem to be no outliers neither for “temp” nor for “attemp”. Both variables are normalized, but seems that temp is perfectly distributed with a median value of 0.5. From the correlations, it is possible to observe that the higher the temperature and the feeling of the temperature, the higher the number of people renting bikes.

fig, (ax1, ax2)=plt.subplots(ncols=2, figsize=(40,10))
sns.boxplot(hour.temp, ax= ax1) 
ax1.set_xlabel(“count”,fontsize = 20)
ax2 = plt.scatter(hour.temp, hour.cnt)
plt.xlabel(“temp”,fontsize = 10)
plt.ylabel(“count”,fontsize = 20)fig, (ax3, ax4)=plt.subplots(ncols=2, figsize=(40,10))
sns.boxplot(hour.atemp, ax= ax3) 
ax3.set_xlabel(“count”,fontsize = 20)
ax4 = plt.scatter(hour.atemp, hour.cnt)
plt.xlabel(“atemp”,fontsize = 20)
plt.ylabel(“count”,fontsize = 20)

Definition of windspeed:
Normalized wind speed. The values are divided to 67 (max)

In terms of the windspeed, we use the same code to check if the outliers exist, and if the model distribute normally. There are some outliers in the windspeed variable. At the same time, and even though it is a continuous variable, windspeed seems to behave like a categorical variable (multiple categories).

fig, (ax1, ax2)=plt.subplots(ncols=2, figsize=(40,10))
sns.boxplot(hour.windspeed, ax= ax1) 
ax2 = plt.scatter(hour.windspeed, hour.cnt)
plt.xlabel(“windspeed”, fontsize = 20)

Definition of hum:
Normalized humidity. The values are divided to 100 (max)

By using the same code to examine the humidity, we can find the result as below. Normally, we will eliminate the outlier, however this case consider timeline as one of the dimensions. Removing the outlier would imply removing time frames from the dataset which we thought was not an appropriate approach as it only creates more inconsistency which we discussed in the previous chapter.

After doing all the visualization, we assumed that some variables might have the possibility to describe others, so we want to go through the correlation heat-map, making sure if there is any variable which is needed to be dropped out.

Analyzing Correlation

# Checking correlations 
correlation_matrix= hour.corr()
correlation_matrix.style.background_gradient(cmap=’coolwarm’).set_precision(2)

According to the correlation heat-map above we find out that temp and atemp are highly correlated, we also make the scatter plot to visulaize the relationship between them. We decide to drop atemp to avoid multicollinearity.

# Plotting the temp in relationship with attemp (highly correlated)
plt.scatter(hour.temp, hour.atemp)
plt.xlabel(“temp”, fontsize = 20)
plt.ylabel(“atemp”, fontsize = 20)
plt.yticks(fontsize = 15)
plt.xticks(fontsize = 15)

Dealing With Skewness

For normally distributed data, the skewness should be about 0. A skewness value > 0 means that there is more weight in the left tail of the distribution. The function skew can be used to determine if the skewness value is close enough to 0, statistically speaking.

Let’s analysed the skewness of each numerical variable visually and using the skew formula.

# Plot temp on y bar
fig, (ax1, ax2, ax3, ax4) = plt.subplots(ncols=4,figsize=(60, 10))sns.distplot(hour[“temp”], ax=ax1)
sns.distplot(hour[“hum”], ax=ax2)
sns.distplot(hour[“windspeed”],ax=ax3)
sns.distplot(hour[“cnt”],ax=ax4)

According to the statistics and diagrams above we can tell that the test for ‘temp’, ‘hum’ do not show skewness so no transformation will be made. ‘windspeed’ and ‘cnt’, however, appears to be skewed, and the skew test confirm the skewness. We will try to take both the log and sqrt. Lets take cnt as an example.

Ta-Da! After taking the square root the result is 0.2864. Compared to pre-transformation, it has significantly improved from 1.27741 to 0.2864. Now, we are going to the last step before the machine learning.

Feature Engineering

Remember in the last chapter, we got lots of insight while we are visualizing the data. Here, we need to put those insight together by creating the new features. We use function np.where(condition,X,Y) to return elements chosen from x or y depending on condition. An array with elements from x where condition is True, and elements from y elsewhere. Take the hour[‘IsOfficeHour’] as an example. The condition is weekday after 9am and and before 5PM. Whenever the time meets for the condition will return 1, and else will return 0.

#Rented during office hours
hour['IsOfficeHour'] = np.where((hour['hr'] >= 9) & (hour['hr'] < 17) & (hour['weekday'] == 1), 1 ,0)
hour['IsOfficeHour'] = hour['IsOfficeHour'].astype('category')#Rented during daytime
hour['IsDaytime'] = np.where((hour['hr'] >= 6) & (hour['hr'] < 22), 1 ,0)
hour['IsDaytime'] = hour['IsDaytime'].astype('category')#Rented during morning rush hour
hour['IsRushHourMorning'] = np.where((hour['hr'] >= 6) & (hour['hr'] < 10)  & (hour['weekday'] == 1), 1 ,0)
hour['IsRushHourMorning']=hour['IsRushHourMorning'].astype('category')#Rented during evening rush hour
hour['IsRushHourEvening'] = np.where((hour['hr'] >= 15) & (hour['hr'] < 19) & (hour['weekday'] == 1), 1 ,0)
hour['IsRushHourEvening'] = hour['IsRushHourEvening'].astype('category')#Rented during most busy season
hour['IsHighSeason'] = np.where((hour['season'] == 3), 1 ,0)
hour['IsHighSeason'] = hour['IsHighSeason'].astype('category')#binning temp, atemp, hum in 5 equally sized bins
bins = [0, 0.19, 0.49, 0.69, 0.89, 1]
hour['temp_binned'] = pd.cut(hour['temp'], bins).astype('category')
hour['hum_binned'] = pd.cut(hour['hum'], bins).astype('category')

Get Dummy!

Since we have categorical values in our data set, we need to ‘tell’ our algorithm that classes have equal weight for our analysis. For instance: our weekdays are represented by numbers from 0 to 6. But we can’t really say that a 6 is better than a 5 here. (You can find more information in our previous project.)

A way to change this perspective is using the one hot encoding technique. This is a process by which we convert categorical variables into binary categories by using the function pd.get_dummies . By the way, when we apply one hot encoding, it’s important to leave one variable out to avoid multicollinearity. Here’s why drop_first is set as True .

OKAY~ Now we are good to go for the next stage. Next chapter we are going to hands on doing machine learning.

Stay Tuned!