Citi Bike Trips Analysis And Prediction

CHAPTER 2: MODEL BUILDING

4 min readAug 20, 2023

If you haven't read Chapter 1: Data Analysis of CitiBike Trip data, I will suggest you check out that first. Previously we discussed about data and brought up some insights from that. And in the end, we finalize our dataset.

Objective

We have to predict the trips count for Quarter 1 of the year 2021 (January 2021 — March 2021). Even though this is a time-series problem, We will use machine learning models to predict outcomes.

We will train different models and will evaluate each to select the best among them, which will be used to train the models based on the whole data.

Dataset

So Far our dataset looks like the below, We have 366 rows of data(1 row = 1 day) with Date, Trip count, Day Name, holiday, and Temperature related features.

Feature Engineering

As we have decided to follow traditional machine learning over Timeseries modeling, we have to get rid of Date feature. maybe we can extract date features data into other new features

We can extract the following features from Date :

Day of year
Day of week
Day of Month
year
Month_num
Month_name

And We have to convert Categorical features into numberical, which we can do using pd.get_dummies() method

Model Selection

To Evaluate model and find out best among all, I have created funtion which can helpful

# THANKS TO https://github.com/prateeknigam9/Hinglish_Classifier/blob/main/CaseStudy_Hinglish.ipynb

def metric_calc(model, xtrain, ytrain, xtest, ytest):
    """
    To plot the metrics and graph of regression
    ==============================================
    inputs:
      model: Model selected
      xtrain: input for train
      ytrain: ground truth for xtrain
      xtest: input for test
      ytest: ground truth for xtest
    outputs:
      PRINT --> Metrics R2_score, MAE, MSE, RMSE for both Train and validation
      RETURN --> Metrics for Validation

    """
    y_pred_train = model.predict(xtrain)
    r2_train = r2_score(ytrain,y_pred_train)
    mae_train = mean_absolute_error(ytrain,y_pred_train)
    mse_train = mean_squared_error(ytrain,y_pred_train)
    rmse_train = np.sqrt(mse_train)    
    
    metrics_train = {}
    metrics_train['R2_Score'] = r2_train
    metrics_train['MAE'] = mae_train
    metrics_train['MSE'] = mse_train
    metrics_train['RMSE'] = rmse_train
    
    y_pred = model.predict(xtest)
    r2 = r2_score(ytest,y_pred)
    mae = mean_absolute_error(ytest,y_pred)
    mse = mean_squared_error(ytest, y_pred)
    rmse = np.sqrt(mse)
    
    metrics = {}
    metrics['R2_Score'] = r2
    metrics['MAE'] = mae
    metrics['MSE'] = mse
    metrics['RMSE'] = rmse

    print("="*20,"Model Metrics - Train","="*20)
    print(pd.DataFrame([metrics_train]))    
    print("\n")
    print("="*20,"Model Metrics - Test","="*20)
    print(pd.DataFrame([metrics]))
    print("\n")
    plt.rcParams.update({'figure.figsize': (15, 3)})
    fig, ax = plt.subplots()
    sns.lineplot(data=ytest, x=ytest.index, y='Trips')
    sns.lineplot(x=ytest.index, y=pd.DataFrame(y_pred)[0], color='r')
    plt.grid(linestyle='-', linewidth=0.3)
    legend_elements = [Line2D([0], [0], color='b', lw=4, label='Actual'),
                Line2D([0], [0], color='r', lw=4, label='Predicted')]
    plt.legend(handles=legend_elements)
    #ax.tick_params(axis='x', rotation=90)
    
    return metrics

This metric calculation function is inspired from one of github user’s work (Prateek Nigam), which calculates R2_Score, MAE, MSE And RMSE for Model, And also plot Actual vs Predicted values in line chart.

Here are results,

Linear Regression

2. Random Forest Regressor

3. Xgboost Forest Regressor

Random Forest performing better than other models, and performace can be improved by hyperparameter tunning, but for now I will keep parameters as it is.

Final Model

For creating input dataset for 2021Q1, We can get holiday information from Holiday library and We will use average temperature for 2021.

df_test = pd.DataFrame(columns=['Date', 'Weekday', 'Holiday', 'TMAX', 'TMIN', 'DOY', 'DOW', 'DOM', 'month', 'year', 'month_num'])

df_test['Date'] = pd.date_range(start='1/1/2021',end='31/3/2021')
df_test['Weekday'] = df_test.Date.dt.day_name()
df_test['Holiday'] = [1 if d in holidays.US() else 0 for d in df_test["Date"]]
df_test['DOY'] = df_test.Date.dt.dayofyear
df_test['DOW'] = df_test.Date.dt.dayofweek
df_test['DOM'] = df_test.Date.dt.day
df_test['month'] = df_test.Date.dt.month_name()
df_test['year'] = df_test.Date.dt.year
df_test['month_num'] = df_test.Date.dt.month

jan_min = int(df_modeling[df_modeling['month_num']==1]['TMIN'].mean())
feb_min = int(df_modeling[df_modeling['month_num']==2]['TMIN'].mean())
mar_min = int(df_modeling[df_modeling['month_num']==3]['TMIN'].mean())
# avgerage by previous year month
jan_max = int(df_modeling[df_modeling['month_num']==1]['TMAX'].mean())
feb_max = int(df_modeling[df_modeling['month_num']==2]['TMAX'].mean())
mar_max = int(df_modeling[df_modeling['month_num']==3]['TMAX'].mean())

df_test['TMIN'] = df_test['month_num'].apply(lambda x: jan_min if x==1 else(feb_min if x==2 else mar_min))
df_test['TMAX'] = df_test['month_num'].apply(lambda x: jan_max if x==1 else(feb_max if x==2 else mar_max))

test = df_test.drop(columns=['Date'])

I have Created Pipeline to train Random Forest Model on data of 2020

After training model, We can predict trips for Q1 of 2021, below is result. This results can be improved for sure.

Conclusion

Model performance can be improved by hyper tunning model or using well fitted Model, using LSTM can be great idea.
We can see drop in predicted trips count, same drop as training data.

The source code is available on Github, Happy learning!