Combining Time Series Analysis with Artificial Intelligence: The Future of Forecasting

Building an ensemble in python of popular forecasting models like XGBoost, LSTM, Random Forest, SARIMAX, VAR, VECM and more!

Indraneel Dutta Baruah
Analytics Vidhya
21 min readSep 28, 2020

--

Welcome to the final part of my 3-blog series on building a predictive excellence engine. Accurate forecasts have the potential to accelerate a firm’s growth as they impact strategic business decisions such as sales targets, stock value, financial planning, raw material procurement, manufacturing and even the supply chain decisions. Though guesstimates and traditional time series analysis are the popular approaches for forecasting, emergence of artificial intelligence necessitates and facilitates a re-thinking of old practices to improve accuracy.

This blog series describes an innovative forecasting approach which is able to overcome the challenges associated with forecasting by seamlessly merging time series analysis with artificial intelligence (AI).Forecasting essentially comes under time series analysis. But artificial intelligence can be a force multiplier in terms of heightening the accuracy of forecasting as it has the capacity to study and discover fascinating patterns in the data automatically. There are two key ingredients in our forecasting recipe:

  1. Intelligently engineering and identifying key drivers of the target variable
  2. Using these drivers in an ensemble modelling approach consisting of successful time series and AI models

The first part of the blog series gave a brief introduction to time series analysis and the tools needed to makes sense of such datasets (Link here) while the second part focused on feature engineering and selection (Link here). We will be deep-diving into the final predictive engine in our driver based forecasting approach in this blog. It is made up of 10 models (5 time series and 5 AI models) and in the end we take average of all the model predictions as the final forecast of our predictive engine. Given the large number of models, we will give a brief introduction to each along with details on how to implement them in python. In a separate blog we will discuss the best practices on optimizing each of these models.

The dataset being used here is the same as the previous blogs in this series. It is a daily dataset on Hong Kong flat prices along with 12 macro economics variables. The target variable to predict is ‘Private Domestic (Price Index)’ i.e Hong Kong flat prices. As highlighted in the previous blog, we chose a set of 3 drivers of Hong Kong flat prices:

  1. Total stock
  2. M3 (HK$ million)
  3. Lag 1 of M3 (HK$ million)

List of Models:

  1. SARIMA (Univariate Time Series)
  2. Holt Winters Exponential Smoothing (HWES)or Triple Smoothing (Univariate Time Series)
  3. SARIMAX (Multivariate Time Series)
  4. VAR (Multivariate Time Series)
  5. VECM (Multivariate Time Series)
  6. LSTM (Univariate Time Series)
  7. Random Forest (Multivariate AI)
  8. XGBoost (Multivariate AI)
  9. Linear Regression (Multivariate AI)
  10. SVR (Multivariate AI)

1.SARIMA

Seasonal Autoregressive Integrated Moving Average, SARIMA or Seasonal ARIMA, is a forecasting method for univariate time series data that works well with data having a seasonal component. It supports both autoregressive and moving average elements while the integrated element refers to differencing allowing the method to support non-stationary time series data. There are a total of 7 parameters that require configuration for this model:

  • p: Trend autoregression order.
  • d: Trend difference order.
  • q: Trend moving average order.
  • P: Seasonal autoregressive order.
  • D: Seasonal difference order.
  • Q: Seasonal moving average order.
  • m: The number of time steps for a single seasonal period.

To learn more about SARIMA and these parameters, please refer to this lab by UC Berkeley.

# fitting a stepwise model to find the best paramters for SARIMA:
stepwise_fit = pm.auto_arima(target_final, start_p=1, start_q=1, max_p=3, max_d=2,max_q=3,m=7,
start_P=0,start_Q=0,max_P=3, max_D=3,max_Q=3, seasonal=True, trace=True,
error_action='ignore', # don't want to know if an order does not work
suppress_warnings=True, # don't want convergence warnings
stepwise=True) # set to stepwise
stepwise_fit.summary()
Fig 1: Stepwise model to find optimal parameter values for SARIMA

We find the optimal values for each parameter by doing a grid search that fits a range of parameter values the developer specifies. We select the set of parameter values with the lowest AIC score.

# SARIMA# Using parameters automatically based on grid serach
SARIMA_Forecast = pd.DataFrame(stepwise_fit.predict(n_periods= 365))
SARIMA_Forecast.columns = ['SARIMA_Forecast']
SARIMA_Forecast.index = HWES_Forecast.index
SARIMA_Forecast.head(5)
# Manually fit the model
#sarima_model = SARIMAX(series, order=(5, 2, 2), seasonal_order=(0, 0, 1, 7))
#sarima_model_fit = sarima_model.fit(disp=False)
# make prediction
#SARIMA_Forecast = model_fit.predict(len(data), len(data))
#print(yhat)
Fig 2: Forecasting with the SARIMA model

In case we are not satisfied with the set of parameters providing the lowest AIC, we can manually fit the model with our prefered set of parameter values (check the commented out section of code in figure 2).

2. Holt-Winters Exponential Smoothing (HWES) or Triple Smoothing

The Holt-Winters technique is based on the following three forecasting techniques:

  1. Weighted Averages
  2. Simple exponential smoothing
  3. Holt’s linear method

Holt-Winters Exponential Smoothing is an upgraded version of these models as it can be used in time series data that exhibits both trend and a seasonal behaviour. There are two variations to this method:

  1. Additive method: It is preferred when the seasonal variations are roughly constant through the series
  2. Multiplicative method: It is preferred when the seasonal variations are changing proportional to the level of the series.

To learn more about the model, please refer to this book by Rob J Hyndman. We have used the additive method in our case as shown below:

# Holt Winter’s Exponential Smoothing (HWES)or Triple Smoothing
# fit model
random.seed(10)model = ExponentialSmoothing(series, trend = 'add',seasonal= 'add', seasonal_periods= 7)
model_fit = model.fit()
# make prediction
HWES_Forecast = pd.DataFrame(model_fit.forecast(steps=365))
HWES_Forecast.columns = ['HWES_Forecast']
HWES_Forecast.head(5)
Fig 3: Forecasting using HWES model

3. SARIMAX

SARIMAX is similar to SARIMA and stands for seasonal autoregressive integrated moving average with exogenous factors. The key difference between SARIMA and SARIMAX is the exogenous factors. As highlighted in the previous blog, we chose a set of 3 drivers of our target variable and these will be the exogenous factors we will be using for our SARIMAX model.

Before we fit the SARIMAX model, we need to make sure that exogenous factors and target variables are stationary and we will also standardize them to make sure they are on the same scale. To check for stationarity, we run an ADF test:

# Function to check stationaritydef adfuller_test(series, signif=0.05, name='', verbose=False):
"""Perform ADFuller to test for Stationarity of given series and print report"""
r = adfuller(series, autolag='AIC')
output = {'test_statistic':round(r[0], 4), 'pvalue':round(r[1], 4), 'n_lags':round(r[2], 4), 'n_obs':r[3]}
p_value = output['pvalue']
def adjust(val, length= 6): return str(val).ljust(length)
# Print Summary
print(f' Augmented Dickey-Fuller Test on "{name}"', "\n ", '-'*47)
print(f' Null Hypothesis: Data has unit root. Non-Stationary.')
print(f' Significance Level = {signif}')
print(f' Test Statistic = {output["test_statistic"]}')
print(f' No. Lags Chosen = {output["n_lags"]}')
for key,val in r[4].items():
print(f' Critical value {adjust(key)} = {round(val, 3)}')
if p_value <= signif:
print(f" => P-Value = {p_value}. Rejecting Null Hypothesis.")
print(f" => Series is Stationary.")
else:
print(f" => P-Value = {p_value}. Weak evidence to reject the Null Hypothesis.")
print(f" => Series is Non-Stationary.")
# Data PrepAI_Model_data = pd.concat([target_final, features_final], axis=1, sort=False)# Check stationarity# ADF Test on each column
for name, column in AI_Model_data.iteritems():
adfuller_test(column, name=column.name)
print('\n')
Fig 4: ADF Test to check stationarity

As seen in Figure 4, all the features as well as the target variable are non-stationary and so we first difference to make them stationary. We also standardize them as mentioned previously.

# 1st difference
AI_Model_data_differenced = AI_Model_data.diff().dropna()
# Check stationarity# ADF Test on each column
#for name, column in AI_Model_data_differenced.iteritems():
#adfuller_test(column, name=column.name)
#print('\n')
x = AI_Model_data_differenced.values #returns a numpy array# Standardise Datasetsstandard_scaler = StandardScaler()
x_scaled = standard_scaler.fit_transform(x)
scaled_AI_Model_data = pd.DataFrame(x_scaled)
scaled_AI_Model_data.columns = AI_Model_data_differenced.columns
# Check stationarity on diffrenced and scaled data# ADF Test on each column
for name, column in scaled_AI_Model_data.iteritems():
adfuller_test(column, name=column.name)
print('\n')
Fig 5: ADF test on differenced and standardized dataset

After differencing and standardising the dataset, we can see that all the fields are stationary and we will use them to fit the SARIMAX model. The parameters for SARIMAX model are the same as SARIMA and hence, we use the same grid search approach to find the optimal values on the differenced and standardised dataset.

# fitting a stepwise model to find the best paramters for SARIMAX:
stepwise_fit_2 = pm.auto_arima(scaled_AI_Model_Target, start_p=1, start_q=1, max_p=3, d=0,max_q=3,m=7,
start_P=0,start_Q=0,max_P=3, max_D=3,max_Q=3, seasonal=True, trace=True,
error_action='ignore', # don't want to know if an order does not work
suppress_warnings=True, # don't want convergence warnings
stepwise=True) # set to stepwise
stepwise_fit_2.summary()
Fig 6: Find Optimal Parameter Values for SARIMAX

Once we have obtained the optimal values of the parameters, we can fit the model. The critical parameter needed for this model is ‘exog’ which is the dataset of exogenous regressors. In our case, the regressors are the three drivers we selected in our second blog of this series. While using the fitted model to forecast, we will need to provide the forecasts of the drivers as an input to the model. We generate these driver forecasts using the SARIMA model. Please note that the driver forecasts are based on the standardised and stationary data. These driver forecasts will be leveraged in all the multivariate AI models as well.

#Driver Forecast using SARIMAforecast_index = pd.date_range('2019/11/27', '2020/11/25')driver_forecasts = pd.DataFrame(index=forecast_index, columns= features_final.columns)scaled_AI_Model_Features_1 = scaled_AI_Model_Features.copy()
date_index = pd.date_range('2003-01-06', '2019/11/26')
scaled_AI_Model_Features_1.index = date_index
for i in range(len(features_final.columns)):df_series = scaled_AI_Model_Features_1.asfreq('D')
series_features = pd.Series(df_series.iloc[:,i], index= df_series.index)
stepwise_fit_features = pm.auto_arima(series_features, start_p=1, start_q=1, max_p=2, max_d=2,max_q=2,m=7,
start_P=0,start_Q=0,max_P=2, max_D=2,max_Q=2, seasonal=True, trace=True,
error_action='ignore', # don't want to know if an order does not work
suppress_warnings=True, # don't want convergence warnings
stepwise=True) # set to stepwise
SARIMA_Forecast_features = stepwise_fit_features.predict(n_periods= 365)
#Add to driver forecast dataset
driver_forecasts.iloc[:,i] = SARIMA_Forecast_features
driver_forecasts.head(5)
Fig 6: Driver Forecasts using SARIMA

The final step is to fit the model using the optimal parameter values and the stationary and standardised historical values of target variable and drivers. We then use the fitted model and the driver forecasts to predict the future. As the model forecasts are based on the differenced and standardised dataset, we need to de-standardise it and undo the first differencing to get the final forecasts.

# SARIMAX 
# contrived dataset
data1 = scaled_AI_Model_Target
data2 = scaled_AI_Model_Features
# fit model
model = SARIMAX(data1, exog=data2, order=stepwise_fit_2.order, seasonal_order=stepwise_fit_2.seasonal_order)
model_fit = model.fit(disp=False)
# make prediction
exog2 = driver_forecasts
SARIMAX_Forecast = pd.DataFrame(model_fit.predict(len(data1),len(data1)+364, exog=exog2))
SARIMAX_Forecast.columns = ['Raw_Forecast']
SARIMAX_Forecast['SARIMAX_Forecast_De_Standardise_R_Sum'] = SARIMAX_Forecast['Raw_Forecast'].cumsum()
SARIMAX_Forecast['SARIMAX_Forecast'] = SARIMAX_Forecast['SARIMAX_Forecast_De_Standardise_R_Sum'] + Latest_target[0]
SARIMAX_Forecast.index = HWES_Forecast.index
SARIMAX_Forecast.head(5)
Fig 8: Fitting SARIMAX model to generate forecasts

4. VAR

Vector Autoregression (VAR) is a forecasting algorithm that can be used when two or more time series influence each other. It is called an autoregressive model because each variable is modeled as a function of the past values, that is the predictors are nothing but their own lags. To understand VAR better, kindly go through this lesson from Penn State University.

As VAR model automatically calculates lags of the variables, we cannot add lags of the variables as a separate variable in the model. Hence, we drop the variable: ‘M3 (HK$ million)_Lag1' before training the model as we already have ‘M3 (HK$ million)’ as a variable. And similar to the SARIMAX model, we need to make the dataset stationary before modelling. We difference the dataset to make it stationary.

# Data Prep for VAR/VECM - First Difference and check stationarity# Data Prep for VAR - check stationarityvarma_vecm_data = pd.concat([target_final, features_final], axis=1, sort=False)
varma_vecm_data = varma_vecm_data.drop(columns=['M3 (HK$ million)_Lag1'])
# 1st difference
varma_vecm_data_differenced = varma_vecm_data.diff().dropna()
# ADF Test on each column
for name, column in varma_vecm_data_differenced.iteritems():
adfuller_test(column, name=column.name)
print('\n')
Fig 9: Difference the dataset to make it stationary

The next part is to select the right order of the VAR model. We use the ‘select_order’ function to iteratively fit increasing orders of VAR model and pick the order that gives a model with least AIC. We then fit the VAR model on the stationary dataset with the optimal number of lags. We use the fitted model to forecast the change in target variable and then undo the differencing to get the final forecasts.

#VAR Modelmodel = VAR(varma_vecm_data_differenced)
lag_selection_VAR = model.select_order(maxlags=12)
list(lag_selection_VAR.selected_orders.values())[0]# Get the lag order
lag_order = list(lag_selection_VAR.selected_orders.values())[0]
model_fitted = model.fit(lag_order)# Input data for forecasting
forecast_input = varma_vecm_data_differenced.values[-lag_order:]
forecast_input
# Forecast
VAR_FC = model_fitted.forecast(y=forecast_input, steps= 365)
VAR_FC_2 = pd.DataFrame(VAR_FC , index= driver_forecasts.index, columns=varma_vecm_data_differenced.columns + '_1d')
VAR_FC_2
#VAR Forecastdef invert_transformation(df_train, df_forecast, second_diff=False):
"""Revert back the differencing to get the forecast to original scale."""
df_fc = df_forecast.copy()
columns = df_train.columns
for col in columns:
# Roll back 2nd Diff
if second_diff:
df_fc[str(col)+'_1d'] = (df_train[col].iloc[-1]-df_train[col].iloc[-2]) + df_fc[str(col)+'_2d'].cumsum()
# Roll back 1st Diff
df_fc[str(col)+'_forecast'] = df_train[col].iloc[-1] + df_fc[str(col)+'_1d'].cumsum()
return df_fc
df_results = invert_transformation(varma_vecm_data, VAR_FC_2, second_diff=False)
VAR_Forecast = pd.DataFrame(df_results['Private Domestic (Price Index)_forecast'])
VAR_Forecast.columns = ['VAR_Forecast']
VAR_Forecast.index = HWES_Forecast.index
VAR_Forecast.head(5)
Fig 10: Forecast using VAR model

5. VECM Model

The VAR model is guilty of ignoring possibly important (“long run”) relationships between the variables (“cointegration”). The usual approach is to use Johansen’s method for testing whether or not cointegration exists. If the answer is yes, then a vector error correction model (VECM) should be used to take into account the cointegrating relationships. One can go through these lecture slides by Michael Hauser to better understand the VECM model. We use the same dataset we used for the VAR model and perform the Johansen Cointegration test to find the optimal number of cointegrating relationships.

# VECM - Johansen Cointegration Testrank_test = select_coint_rank(varma_vecm_data_differenced,-1, lag_order, method="trace",
signif=0.05)
print(rank_test)
Fig 11.1: Johansen Cointegration Test

We can see that there are 3 statistically significant cointegrating relationships in Figure 11.1. We set the same lag order as the VAR model for the parameter ‘k_ar_dif’ and set optimal value for the parameter ‘coint_rank’ based on the cointegration test. Similar to the VAR model, we undo the differencing to get the final forecasts.

#VECMmodel = VECM(varma_vecm_data_differenced, deterministic="ci",
k_ar_diff=lag_order,
coint_rank=rank_test.rank)
vecm_res = model.fit()VECM_FC = vecm_res.predict(steps=365)VECM_FC_2 = pd.DataFrame(VECM_FC , index= driver_forecasts.index, columns=varma_vecm_data_differenced.columns + '_1d')df_results_2 = invert_transformation(varma_vecm_data, VECM_FC_2, second_diff=False)VECM_Forecast = pd.DataFrame(df_results_2['Private Domestic (Price Index)_forecast'])
VECM_Forecast.columns = ['VECM_Forecast']
VECM_Forecast.index = HWES_Forecast.index
VECM_Forecast.head(5)
Fig 11.2: Forecast using VECM Model

AI Models

For all the AI models (except linear regression) in our forecasting engine, we will be using the following approach:

  1. Divide the differenced and scaled data (used for the SARIMA model) into train and test data. We use this data as stationarity and uniform scaling usually improves the model performance.The latest year is separated out as test data. While dealing with time series data we should not used random sampling for train and test split as the datasets are chronologically related.
  2. Perform a grid search by training the model on train data using different values for the key parameters.
  3. Use the trained model to predict during the test time period.
  4. Calculate the forecast accuracy of the model based on MAPE score
  5. Select the parameter values which provided best accuracy
  6. Train model using best fit model’s parameter values on entire historical dataset
  7. Predict using the fitted model and driver forecasts
  8. De-standardise and undo differencing to get final forecast

This approach ensures that we are optimising model performance while ensuring that time series properties of the dataset are not violated. Step 1 to 5 is used for hyper parameter tuning while the remaining steps focus on generating the forecast from the final model. In a separate blog, we will deep dive into other methods for optimising model performance like backtesting strategies as well as details on how each parameter affects the models. In this blog we will give a brief introduction to the models and their parameter and focus on it’s implementation in python.

6. LSTM

LSTM (long short term memory) is a model that extends the memory of recurrent neural networks. Usually recurrent neural networks have ‘short term memory’ as only recent information is used for the present task. LSTM introduces long-term memory into recurrent neural networks. It mitigates the vanishing gradient problem by using a series of ‘gates’. Please refer to this blog by Christopher Olah for further details on the model.

Before creating LSTM model we need to create a Time Series Generator object. Additionally, we will try to optimise a set of three parameters: length of the output sequences, number of time series samples in each batch and number of epochs. The following piece of code is used to perform the hyper parameter tuning as per the approach mentioned previously:

#LSTM# Data Prep
dataset = pd.DataFrame(target_train).values
dataset = dataset.astype('float32')
# Batch Size
batch_size = [1,2,4]
# Epochs
epoch_size = [5,7,10]
# lenght
lenght = [7,30,120]
n_features = 1LSTM_Test_Accuracy_Data = pd.DataFrame(columns = ['batch_size','epoch_size','lenght','Test Accurcay'])for x in list(itertools.product(batch_size, epoch_size,lenght)):
generator = TimeseriesGenerator(dataset, dataset,batch_size=x[0], length= x[2])
lstm_model = Sequential()
lstm_model.add(LSTM(200, activation='relu', input_shape=(x[2], n_features)))
lstm_model.add(Dense(1))
lstm_model.compile(optimizer='adam', loss='mse')
lstm_model.fit_generator(generator,epochs=x[1])
lstm_predictions_scaled = list()batch = dataset[-x[2]:]
current_batch = batch.reshape((1, x[2], n_features))
# Test Data
for i in range(len(target_test)):
lstm_pred = lstm_model.predict(current_batch)[0]
lstm_predictions_scaled.append(lstm_pred)
current_batch = np.append(current_batch[:,1:,:],[[lstm_pred]],axis=1)
predictions_test = pd.DataFrame(lstm_predictions_scaled)
predictions_test.index = target_test.index
# Calculate the absolute errors
errors_test = abs(predictions_test.iloc[:,0] - target_test)
# Calculate mean absolute percentage error (MAPE)
mape_test = 100 * (errors_test/ target_test)
# Calculate and display accuracy
accuracy_test = 100 - np.mean(mape_test)
LSTM_Test_Accuracy_Data_One = pd.DataFrame(index = range(1),columns = ['batch_size','epoch_size','lenght','Test Accurcay'])LSTM_Test_Accuracy_Data_One.loc[:,'batch_size'] = x[0]
LSTM_Test_Accuracy_Data_One.loc[:,'epoch_size'] = x[1]
LSTM_Test_Accuracy_Data_One.loc[:,'lenght'] = x[2]
LSTM_Test_Accuracy_Data_One.loc[:,'Test Accurcay'] = accuracy_test
LSTM_Test_Accuracy_Data = LSTM_Test_Accuracy_Data.append(LSTM_Test_Accuracy_Data_One)

LSTM_Test_Accuracy_Data
Figure 12: Hyper parameter tuning for LSTM

Based on the optimal value of the parameters, we fit the LSTM model and generate the final forecasts as shown in Figure 13 below.

#LSTM# Data Prep
dataset = pd.DataFrame(scaled_AI_Model_Target).values
dataset = dataset.astype('float32')
Latest_target = target_final.tail(1).values.tolist()
# Best Fit ModelBest_Fit_LSTM = LSTM_Test_Accuracy_Data.loc[LSTM_Test_Accuracy_Data['Test Accurcay'] == max(LSTM_Test_Accuracy_Data['Test Accurcay'])]Best_Fit_LSTM = Best_Fit_LSTM.values.flatten().tolist()# Fit Model
generator = TimeseriesGenerator(dataset, dataset,batch_size=Best_Fit_LSTM[0], length= Best_Fit_LSTM[2])
lstm_model = Sequential()
lstm_model.add(LSTM(200, activation='relu', input_shape=(Best_Fit_LSTM[2], n_features)))
lstm_model.add(Dense(1))
lstm_model.compile(optimizer='adam', loss='mse')
lstm_model.fit_generator(generator,epochs=Best_Fit_LSTM[1])
lstm_predictions_scaled = list()batch = dataset[-Best_Fit_LSTM[2]:]
current_batch = batch.reshape((1, Best_Fit_LSTM[2], n_features))
# Test Data
for i in range(365):
lstm_pred = lstm_model.predict(current_batch)[0]
lstm_predictions_scaled.append(lstm_pred)
current_batch = np.append(current_batch[:,1:,:],[[lstm_pred]],axis=1)
LSTM_Forecast = pd.DataFrame(lstm_predictions_scaled)# De-Standardise
LSTM_Forecast_Adjusted_1 = (LSTM_Forecast*std_target)+ mean_target
LSTM_Forecast_Adjusted_1
LSTM_Forecast_Adjusted_2 = pd.DataFrame({'LSTM_Forecast_De_Standardise': LSTM_Forecast_Adjusted_1.iloc[:,0]})#Roll back first difference#LSTM_Forecast_Adjusted_3 = AI_Model_dataLSTM_Forecast_Adjusted_2['LSTM_Forecast_De_Standardise_R_Sum'] = LSTM_Forecast_Adjusted_2['LSTM_Forecast_De_Standardise'].cumsum()
LSTM_Forecast_Adjusted_2['LSTM_Forecast'] = LSTM_Forecast_Adjusted_2['LSTM_Forecast_De_Standardise_R_Sum'] + Latest_target[0]
LSTM_Forecast_Adjusted_2.index = HWES_Forecast.index
LSTM_Forecast_Adjusted_2.head(5)
Figure 13: Forecast Using LSTM

7. Random Forest

Random forest is an ensemble of decision tree algorithms. A number of decision trees are created where each tree is created from a different bootstrap sample. It can be used for both classification or regression. In our case the final prediction is the average prediction across the decision trees. We will be tuning the following parameters in this exercise: maximum depth of decision tree, minimum number of samples required to be at a leaf node, minimum number of samples required to split an internal node and number of trees in the forest (see figure 14). To get a better understanding of random forest, you can check out the original paper by Leo Breiman.

#Random Forest
# Number of trees in random forest
n_estimators = [500,1000,2000]
# Maximum number of levels in tree
max_depth = [10,50,100]
# Minimum number of samples required to split a node
min_samples_split = [50,100,200]
# Minimum number of samples required at each leaf node
min_samples_leaf = [2,4,10]
RF_Test_Accuracy_Data = pd.DataFrame(columns = ['n_estimators','max_depth','min_samples_split','min_samples_leaf','Train Accurcay','Test Accurcay'])for x in list(itertools.product(n_estimators, max_depth,min_samples_split,min_samples_leaf)):
rf = RandomForestRegressor(n_estimators = x[0],max_depth = x[1], min_samples_split = x[2],min_samples_leaf=x[3], random_state = 10,n_jobs=-1, max_features= "auto")
# Train the model on training data
rf.fit(features_train, target_train)
# Train Data
# Use the forest's predict method on the train data
predictions_train = rf.predict(features_train)
# Calculate the absolute errors
errors_train = abs(predictions_train - target_train)
# Calculate mean absolute percentage error (MAPE)
mape_train = 100 * (errors_train/ target_train)
# Calculate and display accuracy
accuracy_train = 100 - np.mean(mape_train)
# Test Data
# Use the forest's predict method on the test data
predictions_test = rf.predict(features_test)
# Calculate the absolute errors
errors_test = abs(predictions_test - target_test)
# Calculate mean absolute percentage error (MAPE)
mape_test = 100 * (errors_test/ target_test)
# Calculate and display accuracy
accuracy_test = 100 - np.mean(mape_test)
RF_Test_Accuracy_Data_One = pd.DataFrame(index = range(1),columns = ['n_estimators','max_depth','min_samples_split','min_samples_leaf','Train Accurcay','Test Accurcay'])RF_Test_Accuracy_Data_One.loc[:,'n_estimators'] = x[0]
RF_Test_Accuracy_Data_One.loc[:,'max_depth'] = x[1]
RF_Test_Accuracy_Data_One.loc[:,'min_samples_split'] = x[2]
RF_Test_Accuracy_Data_One.loc[:,'min_samples_leaf'] = x[3]
RF_Test_Accuracy_Data_One.loc[:,'Train Accurcay'] = accuracy_train
RF_Test_Accuracy_Data_One.loc[:,'Test Accurcay'] = accuracy_test
RF_Test_Accuracy_Data = RF_Test_Accuracy_Data.append(RF_Test_Accuracy_Data_One)

RF_Test_Accuracy_Data
Figure 14: Hyper parameter tuning for Random Forest

The following piece of code fits the final random forest model based on the optimal parameter values (see figure 15).

#Random Forest# Best Fit ModelBest_Fit_Random_Forest = RF_Test_Accuracy_Data.loc[RF_Test_Accuracy_Data['Test Accurcay'] == max(RF_Test_Accuracy_Data['Test Accurcay'])]Best_Fit_Random_Forest = Best_Fit_Random_Forest.values.flatten().tolist()# Fit  Model
rf = RandomForestRegressor(n_estimators = Best_Fit_Random_Forest[0],max_depth = Best_Fit_Random_Forest[1], min_samples_split = Best_Fit_Random_Forest[2],min_samples_leaf=Best_Fit_Random_Forest[3], random_state = 10,n_jobs=-1, max_features= "auto")# Train the model on training data
rf.fit(features_train, target_train)
# Use the forest's predict method on the test data
Random_Forest_Forecast = rf.predict(driver_forecasts)
Random_Forest_Forecast
# De-Standardise
Random_Forest_Forecast_Adjusted_1 = (Random_Forest_Forecast*std_target)+ mean_target
Random_Forest_Forecast_Adjusted_2 = pd.DataFrame({'Random_Forest_Forecast_De_Standardise': Random_Forest_Forecast_Adjusted_1[:]})
#Roll back first difference#Random_Forest_Forecast_Adjusted_3 = AI_Model_dataRandom_Forest_Forecast_Adjusted_2['Random_Forest_Forecast_De_Standardise_R_Sum'] = Random_Forest_Forecast_Adjusted_2['Random_Forest_Forecast_De_Standardise'].cumsum()
Random_Forest_Forecast_Adjusted_2['Random_Forest_Forecast'] = Random_Forest_Forecast_Adjusted_2['Random_Forest_Forecast_De_Standardise_R_Sum'] + Latest_target[0]
Random_Forest_Forecast_Adjusted_2.index = HWES_Forecast.index
Random_Forest_Forecast_Adjusted_2.head(5)
Figure 15: Forecast using random forest

8. XGBoost

XGBoost (Extreme Gradient Boost) provides a high-performance implementation of gradient boosted decision trees. Rather than training all of the models in isolation of one another like random forest, XG Boost trains models in succession. Each new model is trained to correct the errors made by the previous ones. Models are added sequentially until no further improvements can be made. We tune a set of 5 parameters for this model: number of gradient boosted trees, maximum tree depth for base learners, boosting learning rate, minimum loss reduction required to make a further partition on a leaf node and minimum sum of weights of all observations required in a child node. The original paper on XGBoost can be found here.

#XG Boost
# Number of trees
n_estimators = [500,1000,2000]
# Maximum number of levels in tree
max_depth = [10,50,100]
#minimum sum of weights of all observations required in a child
min_child_weight = [1,2]
#Gamma specifies the minimum loss reduction required to make a split
gamma = [1,5]
# boosting learning rate
learning_rate = [.1,.05,.01]
xgb_Test_Accuracy_Data = pd.DataFrame(columns = ['n_estimators','max_depth','min_child_weight','gamma','learning_rate','Train Accurcay','Test Accurcay'])for x in list(itertools.product(n_estimators, max_depth,min_child_weight,gamma,learning_rate)):
xgb_reg = xgb.XGBRegressor(n_estimators=x[0],max_depth =x[1],min_child_weight = x[2],gamma = x[3],learning_rate = x[4])

# Train the model on training data
xgb_reg.fit(features_train, target_train)
# Train Data
# Use the forest's predict method on the train data
predictions_train = xgb_reg.predict(features_train)
# Calculate the absolute errors
errors_train = abs(predictions_train - target_train)
# Calculate mean absolute percentage error (MAPE)
mape_train = 100 * (errors_train/ target_train)
# Calculate and display accuracy
accuracy_train = 100 - np.mean(mape_train)
# Test Data
# Use the forest's predict method on the test data
predictions_test = xgb_reg.predict(features_test)
# Calculate the absolute errors
errors_test = abs(predictions_test - target_test)
# Calculate mean absolute percentage error (MAPE)
mape_test = 100 * (errors_test/ target_test)
# Calculate and display accuracy
accuracy_test = 100 - np.mean(mape_test)
xgb_Test_Accuracy_Data_One = pd.DataFrame(index = range(1),columns = ['n_estimators','max_depth','min_child_weight','gamma','learning_rate','Train Accurcay','Test Accurcay'])xgb_Test_Accuracy_Data_One.loc[:,'n_estimators'] = x[0]
xgb_Test_Accuracy_Data_One.loc[:,'max_depth'] = x[1]
xgb_Test_Accuracy_Data_One.loc[:,'min_child_weight'] = x[2]
xgb_Test_Accuracy_Data_One.loc[:,'gamma'] = x[3]
xgb_Test_Accuracy_Data_One.loc[:,'learning_rate'] = x[4]
xgb_Test_Accuracy_Data_One.loc[:,'Train Accurcay'] = accuracy_train
xgb_Test_Accuracy_Data_One.loc[:,'Test Accurcay'] = accuracy_test
xgb_Test_Accuracy_Data = xgb_Test_Accuracy_Data.append(xgb_Test_Accuracy_Data_One)

xgb_Test_Accuracy_Data
Figure 16: Hyper parameter tuning for XG Boost

As shown in Figure 17 below, we fit the final XG Boost model and generate the forecasts based on the best fit model during hyper parameter tuning.

#XG Boost# Best Fit ModelBest_Fit_XG_Boost = xgb_Test_Accuracy_Data.loc[xgb_Test_Accuracy_Data['Test Accurcay'] == max(xgb_Test_Accuracy_Data['Test Accurcay'])]Best_Fit_XG_Boost = Best_Fit_XG_Boost.values.flatten().tolist()# Fit  Model
xgb_reg = xgb.XGBRegressor(n_estimators=Best_Fit_XG_Boost[0],max_depth =Best_Fit_XG_Boost[1],min_child_weight = Best_Fit_XG_Boost[2],gamma = Best_Fit_XG_Boost[3],learning_rate = Best_Fit_XG_Boost[4])
xgb_reg.fit(features_train, target_train)
# Use the forest's predict method on the test data
XGB_Forecast = xgb_reg.predict(driver_forecasts)
# De-Standardise
XGB_Forecast_Adjusted_1 = (XGB_Forecast*std_target)+ mean_target
XGB_Forecast_Adjusted_2 = pd.DataFrame({'XGB_Forecast_De_Standardise': XGB_Forecast_Adjusted_1[:]})
#Roll back first differenceXGB_Forecast_Adjusted_2['XGB_Forecast_De_Standardise_R_Sum'] = XGB_Forecast_Adjusted_2['XGB_Forecast_De_Standardise'].cumsum()
XGB_Forecast_Adjusted_2['XGB_Forecast'] = XGB_Forecast_Adjusted_2['XGB_Forecast_De_Standardise_R_Sum'] + Latest_target[0]
XGB_Forecast_Adjusted_2.index = HWES_Forecast.index
XGB_Forecast_Adjusted_2.head(5)
Figure 17: Forecast using XGBoost

9. Linear Regression

The objective of a linear regression model is to find a relationship between one or more features and a target variable. It can be represented by the following equation

  • Y is the model prediction
  • θ₀ is the constant term.
  • θ₁,…,θₙ are the model coefficients
  • x₁, x₂,…,xₙ are the feature values

We don’t need to perform parameter tuning for this model and hence directly proceed to fitting and predicting with the model.

#Linear Regressionreg = LinearRegression().fit(features_final, target_final)Regression_Forecast = reg.predict(driver_forecasts)Regression_Forecast#XG Boostreg = LinearRegression().fit(features_train, target_train)# Use the forest's predict method on the test data
Regression_Forecast = reg.predict(driver_forecasts)
# De-Standardise
Regression_Forecast_Adjusted_1 = (Regression_Forecast*std_target)+ mean_target
Regression_Forecast_Adjusted_2 = pd.DataFrame({'Regression_Forecast_De_Standardise': Regression_Forecast_Adjusted_1[:]})
#Roll back first differenceRegression_Forecast_Adjusted_2['Regression_Forecast_De_Standardise_R_Sum'] = Regression_Forecast_Adjusted_2['Regression_Forecast_De_Standardise'].cumsum()
Regression_Forecast_Adjusted_2['Regression_Forecast'] = Regression_Forecast_Adjusted_2['Regression_Forecast_De_Standardise_R_Sum'] + Latest_target[0]
Regression_Forecast_Adjusted_2.index = HWES_Forecast.index
Regression_Forecast_Adjusted_2.head(5)
Figure 18: Forecast using Linear Regression

10. SVR

Support Vector Regression (SVR) is a type of support vector machine that supports both linear and non-linear regression. The key equation to predict is similar to linear regression and is called the hyperplane. The data points on either side of the hyperplane that are closest to it are known as Support Vectors which is used to plot the boundary line. SVR tries to fit the best line within a threshold value, i.e SVR model tries to satisfy the condition:

-a < y-wx+b < a where a is the threshold value

We try to tune two parameters for this model: regularization parameter and kernel coefficient. To learn more about SVR, please refer to this book by Mariette Awad and Rahul Khanna

#SVR
# Regularization parameter
C = [0.1, 1, 10, 100, 1000]
#Kernel coefficient
gamma = [1,5]
svr_Test_Accuracy_Data = pd.DataFrame(columns = ['C','gamma','Train Accurcay','Test Accurcay'])for x in list(itertools.product(C,gamma)):
svr_reg = SVR(kernel= 'rbf', C = x[0], gamma= x[1])
# Train the model on training data
svr_reg.fit(features_train, target_train)
# Train Data
# Use the forest's predict method on the train data
predictions_train = svr_reg.predict(features_train)
# Calculate the absolute errors
errors_train = abs(predictions_train - target_train)
# Calculate mean absolute percentage error (MAPE)
mape_train = 100 * (errors_train/ target_train)
# Calculate and display accuracy
accuracy_train = 100 - np.mean(mape_train)
# Test Data
# Use the forest's predict method on the test data
predictions_test = svr_reg.predict(features_test)
# Calculate the absolute errors
errors_test = abs(predictions_test - target_test)
# Calculate mean absolute percentage error (MAPE)
mape_test = 100 * (errors_test/ target_test)
# Calculate and display accuracy
accuracy_test = 100 - np.mean(mape_test)
svr_Test_Accuracy_Data_One = pd.DataFrame(index = range(1),columns = ['C','gamma','Train Accurcay','Test Accurcay'])svr_Test_Accuracy_Data_One.loc[:,'C'] = x[0]
svr_Test_Accuracy_Data_One.loc[:,'gamma'] = x[1]
svr_Test_Accuracy_Data_One.loc[:,'Train Accurcay'] = accuracy_train
svr_Test_Accuracy_Data_One.loc[:,'Test Accurcay'] = accuracy_test
svr_Test_Accuracy_Data = svr_Test_Accuracy_Data.append(svr_Test_Accuracy_Data_One)svr_Test_Accuracy_Data
Figure 19: Hyper parameter tuning for SVR

Similar to LSTM, random forest and XG Boost, we fit the final model based on the best fit model on the test dataset.

# SVR# Best Fit ModelBest_Fit_SVR = svr_Test_Accuracy_Data.loc[svr_Test_Accuracy_Data['Test Accurcay'] == max(svr_Test_Accuracy_Data['Test Accurcay'])]Best_Fit_SVR = Best_Fit_SVR.values.flatten().tolist()# Fit  Model
svr_reg = SVR(kernel= 'rbf', C = Best_Fit_SVR[0], gamma= Best_Fit_SVR[1])
# Train the model on training data
svr_reg.fit(features_train, target_train)
# Use the forest's predict method on the test data
svr_Forecast = svr_reg.predict(driver_forecasts)
# De-Standardise
svr_Forecast_Adjusted_1 = (svr_Forecast*std_target)+ mean_target
svr_Forecast_Adjusted_2 = pd.DataFrame({'svr_Forecast_De_Standardise': svr_Forecast_Adjusted_1[:]})
#Roll back first differencesvr_Forecast_Adjusted_2['svr_Forecast_De_Standardise_R_Sum'] = svr_Forecast_Adjusted_2['svr_Forecast_De_Standardise'].cumsum()
svr_Forecast_Adjusted_2['SVM_Forecast'] = svr_Forecast_Adjusted_2['svr_Forecast_De_Standardise_R_Sum'] + Latest_target[0]
svr_Forecast_Adjusted_2.index = HWES_Forecast.index
svr_Forecast_Adjusted_2.head(5)
Fig 20: Forecast using SVR

Final Prediction

In most projects, we are not certain about the applicability of a single model and it interferes with the quality of the prediction. Based on my project experiences, the best way to tackle this issue is by forming a consensus between lots of models. The idea here is that some models will be just right for the prediction point while some will underestimate or overestimate. By averaging, we can even out the overestimation and underestimation. This is exactly what we do in our forecasting engine.

Final_Forecast = pd.concat([HWES_Forecast,SARIMA_Forecast,SARIMAX_Forecast['SARIMAX_Forecast'],VAR_Forecast,VECM_Forecast,LSTM_Forecast_Adjusted_2['LSTM_Forecast'],Random_Forest_Forecast_Adjusted_2['Random_Forest_Forecast'],XGB_Forecast_Adjusted_2['XGB_Forecast'],svr_Forecast_Adjusted_2['SVM_Forecast'],Regression_Forecast_Adjusted_2['Regression_Forecast']],1)
Final_Forecast['Final_Forecast'] = Final_Forecast.mean(axis=1)
Final_Forecast.tail(5)
Final_Forecast.plot()

Instead of a simple average, we can do a weighted average as well if we have clarity on the applicability of models or we can also take an average of the top n models based on test data performance. The final prediction should be made based on the business problem at hand but all the tools needed to make a robust forecasting engine can be found in this blog series.

Conclusion

The demand for a product or service keeps changing from time to time. No business can improve its financial performance without estimating consumer demand and future sales of products/services to a fair degree of accuracy. This blog series describes a forecasting methodology that uses the desirable qualities of both time series analysis and AI models. In practice, minute details of the proposed methodology like the list of final models to use will need to be appropriately adjusted and tested based on the business need.

We can categorize the key benefits of forecasting methodology into two buckets:

  1. An innovative approach for feature engineering and selection: The predictive engine is not only able to extract relevant features from the existing dataset like lags, periodic differences etc but also allows the user to easily add in 3rd party datasets like flags for holidays, stock prices of companies and S&P 500 index. It is a very powerful combination as it allows the user to create a comprehensive list of drivers based on various angles. We then follow up with an AI-driven feature selection module that uses a set of 5 different algorithms. It has an innovative approach to combine the results of multiple algorithms which makes our feature selection more robust and help us identify the strongest drivers of our target variable.

2. Combining the best-in class AI and Time Series models: Forecasting essentially comes under the realm of time series analysis. Our forecasting engine not only uses a host of the most powerful time series models like SARIMA, VAR and VECM, but has an effective way of applying popular AI models like LSTM, XGboost and Random Forest to time series dataset. From ensuring stationarity and standardizing to selection of test dataset, the AI models protect the time series properties of the dataset at every step. Couple this up with extensive hyper parameter tuning, we have a robust and powerful forecast engine at our disposal.

I hope this blog series will help my fellow data scientists in their quest for predictive excellence! The code for the entire analysis can be found here. I will be following up with an interesting series on segmentation and clustering. If you liked the series kindly do put in a few claps. And as always, any thoughts, comments or feedback will be greatly appreciated.

Do you have any questions or suggestions about this blog? Please feel free to drop in a note.

Thank you for reading!

If you, like me, are passionate about AI, Data Science, or Economics, please feel free to add/follow me on LinkedIn, Github and Medium.

--

--

Indraneel Dutta Baruah
Analytics Vidhya

Striving for excellence in solving business problems using AI!