In-Depth Analysis

Predicting Bike Sharing Demand using RandomForest Algorithm

A Beginners Guide to Kaggle Competition

Rutvik Deshpande
Analytics Vidhya

--

Photo by Jorge Alcala on Unsplash

Introduction

As urbanisation have reached unprecedented levels, road congestion has become a modern day issue. Heavy traffic is associated with air pollution, safety risks, economic competitiveness, sustainable growth and social cohesion.

In this situation Bike sharing systems are becoming a major solution to avoid the above mentioned issues. The benefits of bike sharing schemes include transport flexibility, reductions to vehicle pollutants, health benefits, reduced congestion and fuel consumption, and financial savings for individuals.

This is a report on Data Science for beginners and enthusiasts starting Kaggle competitions. The following is detailed description of Data Science competition which includes Data Manipulation, Data Visualization and using Machine Learning Algorithms for predicting the Bike Sharing Demand in the city of Washington D.C., also it gives step by step guide for Prediction of sharing demand.

## Loading the required packageslibrary(tidyverse)
library(dplyr)
library(scales)
library(ggplot2)
library(glmnet)
library(caret)
library(lubridate)

I used popular Machine Learning Meta Algorithm RandomForest, which an ensemble learning method for classification and regression. It is also one of the most used algorithms, because it produces great without hyperparameter tuning.

Photo by Sebastian Unrau on Unsplash
## Importing the data 
biketrain <- read.csv("../input/bike-sharing-demand/train.csv")
biketest <- read.csv("../input/bike-sharing-demand/test.csv")
## Making equal no. of columns in boths datasets
biketest$registered=0
biketest$casual=0
biketest$count=0
## Joining both train and test datasetsbikesharing <- rbind(biketrain, biketest)## Inspecting the datasummary(bikesharing)## season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
## weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy ; 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist ;3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds ;4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
## Converting a few a variables into factors
bikesharing$season <- as.factor(bikesharing$season)
bikesharing$weather <- as.factor(bikesharing$weather)
bikesharing$holiday <- as.factor(bikesharing$holiday)

Feature Engineering

Lets work on the first variable datetime , because we can break it down into many additional meaningful variables like Hour of the Day , Day of the Month , Day of the Week , Hour of the Day , Month , Year , which can useful for further Data Analysis and we can feed them for Predicting sharing demands .

1. Hour

## Extracting hour from date timebikesharing$hour = substr(bikesharing$datetime,12,13)
bikesharing$hour <- as.factor(bikesharing$hour)
## plotting the graphggplot(data=bikesharing, aes(x = hour, y=count, fill=hour))+geom_bar(stat = "identity")+ggtitle(label = "Hourly Comparison with Seasons")+theme_minimal()+theme(plot.title = element_text(hjust = 0.5, lineheight = 0.8, face = "bold"),legend.position = "none")+xlab("Weekdays")+ylab("Count")

People in the city used the bike maximum in between 5–8 pm , also there was a surge in demand between 8–9 am during office hours, to maybe avoid traffic.

## Season and Hour vs Count Graphggplot(bikesharing, aes(hour, count)) + 
geom_bar(aes(fill = season), position = 'dodge', stat = 'identity') +
ggtitle(label="Count with Hours of Day and Seasons") + scale_y_continuous(labels = comma) + theme(plot.title = element_text(hjust = 0.5, lineheight = 0.8, face = "bold")) + theme_minimal()

Fall and Winter were main seasons where demand was highest across most of the hours of day . According to cycling experts , Fall Season is considered the best for cycling as , you dont get all sweaty after the ride , colour of the leaves keep on changing and it feels pleasant to ride a bike.

“Spring passes and one remembers one’s innocence ,

Summer passes and one remembers one’s exuberance ,

Fall passes and one remembers one’s reverence ,

Winter passes and one remembers one’s perseverance”

2. Day

## Daylightbikesharing$light <- as.factor(bikesharing$hour == c(7:19))
## Holidaybikesharing$holiday <- as.factor(bikesharing$holiday)## Weeked or notbikesharing$weekend=0
bikesharing$weekend[bikesharing$day=="Sunday" | bikesharing$day=="Saturday" ]=1
## Free time in a weekbikesharing$officehour <- as.factor(bikesharing$hour == c(9:17))
bikesharing$freehourworkday <- as.factor(bikesharing$officehour == FALSE && bikesharing$holiday == 0)
## Day of the Weekbikesharing$date = substr(bikesharing$datetime,1,10)
days<-weekdays(as.Date(bikesharing$date))
bikesharing$days=days
bikesharing$days <- factor(bikesharing$days, levels=c("Monday","Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))## Weekdays Graphggplot(data=bikesharing, aes(x = days, y=casual, fill=days))+geom_bar(stat = "identity")+ggtitle(label = "Weekday Comparison of Casual Users")+theme_minimal()+theme(plot.title = element_text(hjust = 0.5, lineheight = 0.8, face = "bold"),legend.position = "none")+xlab("Weekdays")+ylab("Casual Count")+ scale_y_continuous(labels = comma)

Casual users were more on weekends that is , they did not require a registration as they rode the bikes in their free time only . These maybe onetime users .

## Registered Users and Count across Weekdaysggplot(data=bikesharing, aes(x = days, y=registered, fill=days))+geom_bar(stat = "identity")+ggtitle(label = "Weekday Comparison of Registered Users")+theme_minimal()+theme(plot.title = element_text(hjust = 0.5, lineheight = 0.8, face = "bold"),legend.position = "none")+xlab("Weekdays")+ylab("Registered Count") + scale_y_continuous(labels = comma)

Registered no. of users peaked on weekdays , using the bikes to avoid traffic and reaching office on time , so they must have registered as they require bikes almost everyday .

ggplot(data=bikesharing, aes(x = days, y=count, fill=days))+geom_bar(stat = "identity")+ggtitle(label = "Weekday Comparison")+theme_minimal()+theme(plot.title = element_text(hjust = 0.5, lineheight = 0.8, face = "bold"),legend.position = "none")+xlab("")+ylab("Total Count") + scale_y_continuous(labels = comma)
## Weekday and Weather Comparisonggplot(data=bikesharing, aes(x=days, y=count, fill=weather)) +
geom_bar(stat="identity", position=position_dodge()) + ggtitle(label ="Day of the Week and Weather with Count Comparison") +theme_minimal()+theme(plot.title = element_text(hjust = 0.5, lineheight = 0.8, face = "bold"))+xlab("")+ylab("")

3. Month

## Converting date to a date format in Rbikesharing$date <- as.POSIXct(bikesharing$date)## Extracting Month Namebikesharing$MONTHS <- format(bikesharing$date,"%B")bikesharing$MONTHS <- factor(bikesharing$MONTHS, levels=c("January","February", "March", "April", "May", "June","July", "August", "September", "October", "November", "December"))## Monthly Graphggplot(data=bikesharing, aes(x = MONTHS, y=count, fill=MONTHS))+geom_bar(stat = "identity")+ggtitle(label = "Monthly Comparison of Users")+theme_minimal()+theme(plot.title = element_text(hjust = 0.5, lineheight = 0.8, face = "bold"), legend.position = "none")+xlab("")+ylab("Count")

Second half of the year may have been good to ride the bike as we can see , peak in demand can be seen in the fall season .

## Month and Weekday vs Countggplot(data=bikesharing, aes(x = MONTHS, y=count, fill=days))+geom_bar(stat = "identity")+ggtitle(label = "Users Count compared in Months and Weekdays")+theme_minimal()+theme(plot.title = element_text(hjust = 0.5, lineheight = 0.8, face = "bold"))+xlab("")+ylab("")
## Day of the Month Extractionbikesharing$dom = substr(bikesharing$datetime,9,10)
bikesharing$dom <- as.numeric(bikesharing$dom)
## Season and Day of Month vs Count of Usersggplot(data=bikesharing, aes(x = dom, y=count, fill=season))+geom_bar(stat = "identity")+ggtitle(label = "Day of the Month with Season Comparison")+theme_minimal()+theme(plot.title = element_text(hjust = 0.5, lineheight = 0.8, face = "bold"))+xlab("Days upto 19th")+ylab("Count") + xlim(01,19)
## Visualizing the demand only till 19 th of every Month as we have to predict the number of users from 20th to end of each month## Plotting a Heat Tile Graphggplot(bikesharing, aes(dom, hour, fill = count)) + geom_tile(color = "white") +ggtitle(label = "Day of the Month and Hour of the Day Comparison")+theme_minimal()+theme(plot.title = element_text(hjust = 0.5, lineheight = 0.8, face = "bold")) + xlab("Day of the Month(Till 19th)") + ylab("Hour")

We can Treat a few more variables to extract some more additional information

What else we can do here ? Useful information can be extracted from humidity, temperature, windspeed, weekdays, lets check it out.

4. Weather Conditions

Best conditions to ride are when you have sunshine that creates great visibility, less windspeed, less humidity to avoid sweating during high temperatures.

## Ideal Condition to ride a bikebikesharing$Ideal <- as.factor(bikesharing$weather == 1 && bikesharing$windspeed < 20 && bikesharing$humidity < 25 && bikesharing$atemp < 35)## Calculation Pollution(Ozone Content) using Temp, Windspeed and Humiditybikesharing$ozone <- 5.3*(bikesharing$windspeed)+0.4*(bikesharing$temp)+0.1*(bikesharing$humidity)## Plotting Weather and Temp vs Count Graphggplot(bikesharing, aes(temp, count)) + geom_jitter(aes(colour = weather)) + ggtitle(label ="Effect of Weather and Temperature on Count")+theme_minimal()+theme(plot.title = element_text(hjust = 0.5, lineheight = 0.8, face = "bold")) + xlab("Temperature") + ylab("Count of Users")
## Windspeed & Hour of the day vs Users' Countggplot(bikesharing, aes(windspeed, hour)) + geom_jitter(color= "blue") + theme_minimal() + ggtitle(label = "Windspeed and Hour of the Day with Users' Count")+theme_minimal()+theme(plot.title = element_text(hjust = 0.5, lineheight = 0.8, face = "bold")) + xlab("Windspeed") + ylab("Hour")

Create new log variables

  • As we know that dependent variables have natural outliers so we will predict log of dependent variables.
  • Predict bike demand of registered and casual users separately, logcas=log(casual+1) and logreg=log(registered+1),add 1 to deal with zero values .
bikesharing$logcas <- log(bikesharing$casual + 1)
bikesharing$logreg <- log(bikesharing$registered + 1)

Model Deployment and Predcition

## Converting numeric values to factor for randomForestbikesharing = bikesharing %>% mutate_if(is.factor, as.numeric)str(bikesharing)## Train and Test data separationbiketrain <- bikesharing[1:10886, ]
biketest <- bikesharing[10887:17379, ]
## Deploy the modelset.seed(8234)
library(randomForest)
## Casualbkmodel <- randomForest(logcas ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + Ideal + hour + MONTHS + days + officehour + freehourworkday + weekend + ozone + light + dom, data = biketrain)# Predictpredict1 <- predict(bkmodel, biketest)
summary(bkmodel)
## Registered set.seed(9234)
bkmodel1 <- randomForest(logreg ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + Ideal + hour + MONTHS + days + officehour + freehourworkday + weekend + ozone + light + dom, data = biketrain)
# Predictpredict2 <- predict(bkmodel1, biketest)
biketest$logreg = predict2
biketest$logcas = predict1
biketest$casual = exp(biketest$logcas)
biketest$registered = exp(biketest$logreg)
## Saving submission file as per requirement of competitionbiketest$count=biketest$casual+biketest$registered
bike<-data.frame(datetime=biketest$datetime,count=biketest$count)
write.csv(bike,file="bikeshare.csv",row.names=FALSE)

This is Exploratory Data Analysis is for beginners and can be made more complex. These results can take you in the top 20 percentile of Kaggle competition, which is quite good for a beginner playground competition. Few more elements like ntree, importance, npreb, and proximity can be added to the random forest model to increase its efficiency.

Orignally published as a Kaggle notebook.

--

--

Rutvik Deshpande
Analytics Vidhya

Cities, Data & Machine Learning. Accelerating the transition to better cities