Forecasting Demand for Bike Sharing System with Python — Part 3
Machine Learning
This article is part of a series. Check out the full series:
Chapter 1 Data Preparation and Feature Visualization
Chapter 2 Feature Analysis and Feature Engineering
After obtaining the knowledge of data cleaning, visualizing important variables and their relationship and feature engineering in the first and the second chapter. In this chapter we are going to focus on three parts : Model training, tuning and finally evaluating in order to compare different models and identify which one to put into production.
Please find the complete code on the github.
We used 5 different preditors, namely Lasso, Ridge, Decision Tree, Random Forest, and XGBoost. Since XGboost performed the best so we are going to take it as the example to elaborate the coding process.
xgb = XGBRegressor(max_depth=3, learning_rate=0.01, n_estimators=15,
objective="reg:squarederror", subsample=0.8,
colsample_bytree=1, seed=1234, gamma = 1)xgb.fit(hour_d_train_x, hour_d_train_y)result = xgb.predict(hour_d_test_x)
Then , we print out the result by using the code below:
print("R-squared for Train: %.2f" % xgb.score(hour_d_train_x, hour_d_train_y))
print("R-squared for Test: %.2f" % xgb.score(hour_d_test_x, hour_d_test_y))RMSE = np.sqrt(np.mean((hour_d_test_y ** 2 - result ** 2) ** 2))
MSE = RMSE ** 2print("MSE ={}".format(MSE))
print("RMSE = {}".format(RMSE))
We want to improve the result by tuning the hyper-parameters. Here, we used the GridSearch to find the best parameter.
gsc = GridSearchCV(estimator=XGBRegressor(),
param_grid={"max_depth": (6,7),
"learning_rate": (0.06, 0.08),
"n_estimators": (400, 600),
"subsample":(0.7,0.8),
"colsample_bytree":(0.4,0.5),
"gamma" : (1.4,1.5)
},
cv=5,
scoring="r2",
verbose=10, #possibly 10 when takes too long
n_jobs=4,
)grid_result = gsc.fit(hour_d_train_x, hour_d_train_y)
After obtaining the result we train the model with the tuned hyper-parameters and finally make the predictions.
xgb = XGBRegressor(max_depth=6, learning_rate=0.06, n_estimators=600,
objective="reg:squarederror", subsample=0.8,
colsample_bytree=0.5, seed=1234, gamma = 1.5)xgb.fit(hour_d_train_x, hour_d_train_y)result_xgb = xgb.predict(hour_d_test_x)
Lets check the result after we tuned the hyper parameter!
print("R-squared for Train: %.2f" % xgb.score(hour_d_train_x, hour_d_train_y))
print("R-squared for Test: %.2f" % xgb.score(hour_d_test_x, hour_d_test_y))RMSE = np.sqrt(np.mean((hour_d_test_y ** 2 - result_xgb ** 2) ** 2))
MSE = RMSE ** 2print("MSE ={}".format(MSE))
print("RMSE = {}".format(RMSE))
TA-DA! The result of the R2 on test improve significantly, and there is no issue of overfitting!
This is a scatter plot of both actual and predicted values. The predictions have started aligning with the actual values as the model complexity is increased. Lastly we see that the tuned XG-Boost performs really well in predicting the values accurately.
This is the distribution of errors(Actual — Predicted) and we can also infer that as we started fitting more complex models with tuned hyper-parameters the errors started reducing to 0.
Conclusion
Here is the result after we run all the models:
As it can be seen the best R-Squared we achieved is very high and the RMSE comes close to the top predictions that could be found online which we attribute to the steps we have undertaken in the feature engineering section. However, there are improvement opportunities which can be tackled in future researches.
Recommendation
The recommendations for futures researches can split in broadly 3 categories:
1. Deploying other techniques to predict bike demand
To treat this problem as a time series problem and fit an ARIMA model or a linear regression of which the error terms would be predicted with a time series model.
As we noticed that the rental behaviour of registered and casual renters differs greatly, it might make sense to train two different models, one for registered users and one for casual users. Finally adding the predictions to get an idea of what the day would look like in terms of total rentals.
2. Enriching the data and clarify anomalies
It could further be investigated which other variables proof to be helpful in predicting the demand of bikes and either enriching the dataset with these variables or starting to gather those variables in the future. Moreover, it is recommended that future researchers get in contact with the owners of these data to clarify the origin (e.g. downtime of the system, measuring error, etc.) of the anomalies of the data in order to make more informed decisions how to handle those outliers.
3. Employ more feature engineering and validation techniques
Future researchers could also opt to employ more advanced feature engineering technique such as PCA, LDA or QDA. In terms of validation, we advise future researchers also to employ rolling cross- validation considering time frames.
Here are the previous chapters in case you missed it before :
We would love to hear from you for any suggestion, or please let us know if we can help you with the customer insight and predictive analytics.
Connect with us on linkedin : Cheer & Utkarsh
** This project is done by four, except Utkarsh and I, also included Antonio and Christoph. If you are interested in the code itself, please find out here in our github.