Microsoft Stock Data Predictions: ARIMA Model Implementation (Part 2)

In this article, we are going to continue from Part 1 and we will train and evaluate the model

Daksh Bhatnagar
Accredian
6 min readJan 19, 2023

--

We are going to be building the models for both Univariate and Multivariate Time Series Data and this is in continuation to the previous article. If you haven’t read Part 1 yet, click below

UNIVARIATE TIME SERIES

A univariate time series is a dataset that consists of a single variable that is observed over a period of time. It may represent the daily temperature for a specific location over the course of a year, the daily sales of a particular product, or the daily stock prices of a company.

It is a sequence of observations of a single variable recorded at successive points in time. Let’s go ahead and train the model now.

#Define the model          Variable       (p, d, q)
model = sm.tsa.arima.ARIMA(df.High, order=(1, 1, 1))
#fitting the model
model_fit = model.fit()
#printing the summary of the model
model_fit.summary()

The summary of the model will look like

We will go ahead and make the predictions

# making the predictions and storing them 
predictions = model_fit.predict().values
#Assigning the predictions to a new column in the data frame
df['Predictions'] = predictions

Now that we have got the predictions, let’s plot the predictions

df[-300:].plot(linestyle='--')
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.title('MSFT High Predictions, ARIMA Model', fontsize=20)
plt.show()

As evident, the model is doing a good job at making predictions and is closely following the line which is great. We can also evaluate the predictions using RMSE to quantify the error being made by the model.

rmse = np.sqrt(mean_squared_error(highdf.High,highdf.Predictions))
print(f'The Root Mean Squared Error between the Actuals and the Predictions is {round(rmse,4)}')

We can make the forecast for the future using the forecast method

#forecasting for the future
forecast = model_fit.forecast(steps=10)
#creating a date range
dates = pd.date_range(start='2023-01-12', periods=10)
#creating the forecast dataframe
forecast_df=pd.DataFrame({'Date': dates, 'Predictions': forecast})
forecast_df.index=forecast_df['Date']
forecast_df.drop('Date', axis=1, inplace=True)
#Plotting the forecast dataframe
forecast_df.plot(marker='o', color='orange')
plt.title('Forecast for the next 10 Days')
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.show()

MULTIVARIATE TIME SERIES

Multivariate time series is a type of time series data that involves multiple variables, each of which has an associated set of observations over a period of time. This type of data is often used in studies of complex systems, where multiple variables interact to produce a single outcome.

It can also be used to measure trends over time or to analyze the effects of multiple variables on one another. Examples of multivariate time series include stock market data, economic indicators, climate data, and customer behaviour data.

For this version of the Time series, we consider multiple features to predict the targeted values. For our purposes, we will go ahead and use the first 3 columns to predict the 4th column of the data frame which looks something like this.

We have to perform the steps for checking the stationarity of the other columns as well, plot the ACF and PACF plot and find out the 1st and 2nd order difference of the series which will enable us to find the p, d, and q values. We will be using 1 for p and d, and 2 for q.

Splitting the data here in Time Series is a bit different from the typical Machine Learning Algorithms dataset splitting. Since there is a sequence and we don’t want the model to learn the empty gaps (because of the random splitting), we will split the data based on the year.

We have picked the data prior to 2017 for training data and the remaining data for the testing purposes

#Splitting the data based on the year value
train = stock_df[stock_df.index.year<2017]
test = stock_df[stock_df.index.year>=2017]

The split looks something like this

The independent features in Time series terms are called Exogenous Features and we will now fit the data on the training dataset. In endog argument, we provide the dependent variable while in the exog argument, we provide the exogenous/independent features

#defining the model
model = sm.tsa.arima.ARIMA(endog=train['Close'],
exog=train[exogenous_features], order=(1, 1, 2)) #(p,d,q)
#fitting the model
model_fit = model.fit(
#making predictions on the training set
train['Predictions'] = model_fit.predict()
#Plotting the actual values and the predictions
train[['Close', 'Predictions']][-50:].plot()
plt.title('Predictions on Training Set')
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.show()

Looks like the predictions are quite close to the actual values. Let’s go ahead and make the predictions on the test set

#List comprehension to forecast the values for the length of the test set
forecast = [model_fit.forecast(exog=test[exogenous_features].iloc[i]).values[0] for i in range(len(test))]
#assigning the values to a new column in the dataframe
test['Forecast'] = forecast

#Plotting the actuals and the forecast
test[['Close','Forecast']][-50:].plot()
plt.title('Predictions on Test Set')
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
plt.show()

When we find out the RMSE of the actual values and the predictions it comes out to be

#Calculating the root mean squared error
rmse = np.sqrt(mean_squared_error(test['Close'],test['Forecast']))
print(f'The RMSE for Multivariate ARIMAX is {round(rmse,4)}')

CONCLUSION

  1. A univariate time series is a dataset that consists of a single variable that is observed over a period of time. It may represent the daily temperature for a specific location over the course of a year, the daily sales of a particular product, or the daily stock prices of a company.
  2. Multivariate time series is a type of time series data that involves multiple variables, each of which has an associated set of observations over a period of time. This type of data is often used in studies of complex systems, where multiple variables interact to produce a single outcome.
  3. Splitting the data here in Time Series is a bit different from the typical Machine Learning Algorithms dataset splitting. Since there is a sequence and we don’t want the model to learn the empty gaps (because of the random splitting), we will split the data based on the year.
  4. The independent features in Time series terms are called Exogenous Features and we fit the data on the training dataset. In endog argument, we provide the dependent variable while in the exog argument, we provide the exogenous/independent features

Final Thoughts and Closing Comments

There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website.

If you liked this article, I recommend you go with the Global Certificate in Data Science & AI because this one will cover your foundations, machine learning algorithms, and deep neural networks (basic to advance).

--

--

Daksh Bhatnagar
Accredian

Data Analyst who talks about #datascience, #dataanalytics and #machinelearning