Build a predictive Tree based model with a front end dashboard report in excel

Gaurang Mehra
5 min readFeb 20, 2023

--

Objective: We will use the bike data set provided by Kaggle to predict bike rentals given hourly data on rentals, weather (temp, humidity, windspeed etc.) and whether the day was a working day or holiday. We will use excel as a front end to show data visualizations and results of different tree based models.

The data set has many weather related features like

  • Windspeed: The windspeed for that hourly observation
  • atemp: The feels like temperature for the hourly observation
  • humidity: The humidity for that observation

The target variable is count (count of bike rentals). We will build tree based models to predict bike rentals given weather conditions, time of year and weekday, weekend etc.

First thing we want to do is explore some of the features and understand the relationship between them and count (bike rentals). This is done in Jupyter notebook. We export these visualizations to an excel workbook called Report dashboard and a tab called Exploratory Data Analysis.

Its logical to assume that bike rentals and temperature might be related and also that bike rentals and the month of year might be related. We first explore these relationships.

Fig 1.1 Exploratory Data analysis

We can infer 3 things from the figure above.

  • Bike rentals tend to be higher in the summer months as evidenced by the much higher median values in the box plot.
  • atemp (feels like temperature) and bike rentals seem to have a relationship. On colder days the bike rentals tend to be much lower as seen in the scatter plot.
  • The distribution of bike rentals tends to be right skewed with some days having very large number of rentals.

Code for this below

# Use excelwings to manipulate the excel workbook
book=xw.Book('Bike_Data/Report_Dashboard.xlsx')
sheet=book.sheets('Exploratory Analysis')
# Creating the plots
fig,(ax0,ax1,ax2)=plt.subplots(nrows=1,ncols=3)
fig.set_figheight(6)
fig.set_figwidth(21)
sns.boxplot(x='MonthofYear',y='count',data=bike,ax=ax0)
sns.regplot(x='atemp',y='count',scatter=True,scatter_kws={'alpha':0.05},x_jitter=True,y_jitter=True,data=bike,ax=ax1)
sns.histplot(x='count',data=bike,ax=ax2)
ax0.set_title('Bike Rentals by month')
ax0.set_ylabel('Bike Rentals')
ax1.set_title('Temp vs Bike rentals')
ax1.set_ylabel('Bike rentals')
ax2.set_label('Bike Rentals')
ax2.set_title('Distribution of Bike Rentals')
plt.tight_layout()
sns.set_context('talk')
#sns.set_theme('dark')
plt.show()
sheet.pictures.add(fig,name='Bike Rentals',update=True)

Next thing we would like to do is explore the correlations between rentals and all the features as well as among the features

Fig 1.2 Correlation matrix

From the above matrix we find that the following variables have the highest correlations with count (bike rentals).

  • temp, atemp: These variables both have a correlation coeffecient of 0.39 with count but also have a high degree of correlation with each other. In this case we keep only a temp
  • humidity: This has a negative correlation with rentals. Makes sense, nobody wants to go cycling on a very humid day
  • weather: weather is a categorical variable for extreme weather events. This seems to have an effect on rentals, while not being highly correlated (less than 0.5) between the other variables

Code for this step below

# Creating a correlation matrix 
corr_mat=bike.corr().round(2)
corr_mat
# Creating a correlation matrix plot and exporting to excel
fig,ax=plt.subplots()
fig.set_figheight(7)
fig.set_figwidth(14)
sns.heatmap(corr_mat,annot=True,linewidth=0.3,cmap='viridis')
ax.set_title('Correlation Matrix')
sheet.pictures.add(fig,name='Corr_mat',update=True,left=sheet.range('B37').left,top=sheet.range('B37').top)

Now that we understand the correlations better lets use the temp, atemp humidity, weather and holiday to build a decision tree model.

  • Step 1 is to break the data into feature array X and target array with the rental counts. We then split the data into train and test sets. We will set the test data set to 20% of the total data we will train the data on the train dataset.
X=bike[['atemp','MonthofYear','humidity','weather','holiday']]
y=bike['count']
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
  • Step 2 is to fit the training data set as defined by the feature array X_train, y_train. Decision tree is a good intial choice because we have some non-linear features like the MonthofYear. Decision Trees are also easy to visualize. Its necessary to set the max_depth parameter of the tree prior to fitting the training data. We can run the algorithm repeatedly with different max_depth parameter values to see which one gives us the best fit on the training data
max_depths=[3,4,5,6,7,8,9,10,11,12,13,14,15]
for max_depth in max_depths:
dt=DecisionTreeRegressor(max_depth=max_depth,random_state=20)
dt.fit(X_train,y_train)
y_pred=dt.predict(X_test)
score=r2_score(y_test,y_pred)
print('The r2_score of tree with max depth {} is'.format(max_depth),score)
Fig 1.3 r2 results with different tree depths

Above we see that a tree with a depth of 8 gives us the best fit with an r2 of 0.29 which is not too good. Here is the code that visualizes the tree. We then add it to a second tab in the excel labelled Decision Tree.

sheet2=book.sheets('Decision tree')
from sklearn import tree
fig,ax=plt.subplots()
fig.set_figheight(35)
fig.set_figwidth(35)
_=tree.plot_tree(dt,feature_names=X.columns,filled=True)

sheet2.pictures.add(fig,name='Tree',update=True,left=sheet2.range('A6').left,top=sheet2.range('A6').left)
Fig 1.4 Decision Tree Visualization snippet depth 2

Lets see if we can get a better fit with a Random Forest model.

Random Forests: Are an ensemble algorithm that consists of a number of trees. Each tree is trained on a random sample (boot strap sample) from the training data. The prediction value of a random forest is an average value of all the predicted values of all the trees in the forest . A good property of Random Forests is that they rank the most important features and we can visualize these as as bar plot. For the Random Forest iteration we will use all the features and then let Random Forest feature importances tell us what the most important features are.

  • Step 1 is to create the feature and target arrays. Then we do a train/test split
# Creating the feature array
X=bike.drop(columns=['count','datetime'])
y=bike['count']

# Doing a train test split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
  • Step 2 is to run a Random Forest model with a 100 trees and with multiple values of the max_depth parameter. We then see which value of max_depth gives us the best fit on the test data
# Fitting a Random forest model with a 100 trees
from sklearn.ensemble import RandomForestRegressor
for max_depth in max_depths:
rf=RandomForestRegressor(n_estimators=100,max_depth=max_depth,random_state=30)
rf.fit(X_train,y_train)
y_pred=rf.predict(X_test)
score=r2_score(y_test,y_pred)
print('R2_score with {} max depth is'.format(max_depth),score)
Fig 1.5 r2 results with different max_depths using Random Forest (n=100)

We see that a Random Forest (n_estimators=100) and a max_depth of 11 gives the best fit

Now lets see how Random Forest ranks the features. The feature_importances_ method is an extremely useful tool. We can also use this method to rank the most predictive/impactful features if the dataset contains a large number of features.

mat=pd.DataFrame(rf.feature_importances_,index=X.columns).sort_values(by=[0])
fig,ax=plt.subplots()
fig.set_figheight(7)
fig.set_figwidth(12)
ax.barh(width=mat[0],y=mat.index)
plt.xticks(rotation=45)
ax.set_title('Feature Importances')
sheet3.pictures.add(fig,name='Feature',update=True,left=sheet3.range('A8').left,top=sheet3.range('A8').left)
Fig 1.6 Random Forest most predictive features

--

--

Gaurang Mehra

Deeply interested in Data Science, AI and using these tools to solve business problems.