Forecasting Apple Stock Price Using SARIMA

Arnab Chakraborty
AI Skunks
Published in
11 min readApr 16, 2023

Introduction

We will explore the exciting world of time series analysis, its significance in data-driven decision making, and how to effectively visualize and forecast data. The use case we will focus on is stock price analysis, a domain where time series plays a crucial role in understanding market trends and predicting future stock movements.

Overview of Time Series Analysis

A time series is a sequence of data points indexed in time order. Time series analysis is the process of using statistical techniques to model and understand the underlying patterns and structures within the data. Time series data is prevalent across various domains, such as finance, economics, weather forecasting, and even in monitoring vital signs in healthcare.

The primary components of a time series are:

  • Trend: The long-term movement of the data.
  • Seasonality: Regular and predictable fluctuations that repeat over a specific period.
  • Cyclicity: Fluctuations that do not follow a fixed pattern or period but are affected by external factors.
  • Irregular/Noise: Random fluctuations that are unpredictable and not associated with any pattern.

Importance of Time Series Visualization and Forecasting

Visualization and forecasting are vital aspects of time series analysis. They provide valuable insights and allow us to make data-driven decisions.

Time Series Visualization helps us:

  • Understand the data’s underlying patterns, trends, and seasonality.
  • Identify anomalies or outliers in the data.
  • Communicate insights effectively to stakeholders.

Time Series Forecasting enables us to:

  • Predict future values based on historical data.
  • Make informed decisions, such as investment strategies or resource allocation.
  • Evaluate the effectiveness of interventions or policy changes.

Use Case: Stock Price Analysis

Stock price analysis is a perfect example of how time series analysis can be applied in the real world. In finance, stock prices represent a series of data points over time, and understanding the trends, seasonality, and irregularities can provide valuable insights for investors.

One popular stock that has consistently engaged audiences and generated interest is Apple Inc. (AAPL). Apple’s stock price movements often attract significant attention from both investors and the general public. The company’s stock price history can reveal interesting insights into how the market has responded to various events, such as product launches, earnings reports, and global economic trends.

Data Description

The Apple stock data (AAPL) available on Yahoo Finance offers historical and up-to-date stock prices for Apple Inc. Link: https://finance.yahoo.com/quote/AAPL/history/

Here’s a description of each column in the dataset:

  • Date: The date when the stock market data was recorded (YYYY-MM-DD format).
  • Open: The opening price of Apple’s stock on that specific day.
  • High: The highest price at which Apple’s stock was traded during the day.
  • Low: The lowest price at which Apple’s stock was traded during the day.
  • Close: The closing price of Apple’s stock on that specific day.
  • Adj Close: The adjusted closing price, which accounts for dividends, stock splits, and new stock offerings, providing a more accurate reflection of the stock’s value over time.
  • Volume: The number of Apple’s shares that were traded during the day.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
df = pd.read_csv('https://raw.githubusercontent.com/chakraborty-arnab/DataSphere/main/AAPL.csv')

Data Preprocessing

Converting Date to DateTime Object and Plotting the entire history of the stock prices over time

df['Date'] = pd.to_datetime(df.Date)
plt.figure(figsize=(20, 6))
plt.plot(df["Adj Close"])
plt.xlabel("Date")
plt.ylabel("Adj Closing Price ($)")
plt.title("Apple Stock Prices Over Time")
plt.show()

Key Moments in History

Monthly Annualized Volatility is a financial metric used to measure the variability or dispersion of an asset’s returns over a one-month period, scaled up to represent a full year.

  • It is commonly used to quantify the level of risk associated with an investment, with higher volatility indicating greater uncertainty and potential price fluctuations.
  • This metric is calculated by taking the standard deviation of the asset’s monthly returns, then multiplying it by the square root of 12 (the number of months in a year) to annualize the figure.

Here we will use the Monthly Annualized Volatility metric to see the impact of key economic events on the stock price.

# Set interceptate column as the index
newdf=df.set_index('Date')

#To model returns we will use daily % change
daily = newdf['Adj Close'].pct_change()
daily.dropna(inplace = True)

#Resample returns per month and take STD as measure of volatility
monthly=daily.resample("M").std()*np.sqrt(12)

import matplotlib.patches as mpatches

#Visulize major market events show up in the volatility
plt.figure(figsize=(20, 6))
plt.plot(monthly)
plt.axvspan('1987','1988',color='grey',alpha=.5)
plt.axvspan('2000','2002',color='purple',alpha=.5)
plt.axvspan('2008','2009',color='r',alpha=.5)
plt.axvspan('2020','2021',color='g',alpha=.5)
plt.title("Monthly Annualized volatility")
l1=mpatches.Patch(color='grey',alpha=.5, label="Black Monday")
l2=mpatches.Patch(color='purple',alpha=.5, label="Dot-com bubble & Sep 11 attack ")
l3=mpatches.Patch(color='red',alpha=.5, label="2007-2008 Financial Crisis")
l4=mpatches.Patch(color='green',alpha=.5, label="2020 Stock Market Crash")
plt.legend(handles=[l1,l2,l3,l4])

Monthly Resampling

Monthly sampling of daily stock prices can be advantageous for prediction and analysis in certain contexts. Some reasons for using monthly data include:

  • Reduced noise
  • Lower computational complexity
  • Easier interpretation
  • Reduced impact of micro-events
  • Smoothing seasonality
  • Aligning with reporting periods
# Resample the data to the monthly level
monthly_mean = newdf['Adj Close'].resample('M').mean()
monthly_data = monthly_mean.to_frame()
##Monthly Stock Price
fig = plt.figure(figsize=(20,6))
plt.plot(monthly_data['Adj Close'],label='Monthly Averages Apple Stock')
plt.legend(prop={'size': 12})
plt.show()

Observation:

  • We can observe an overall uptrend from 2007 to 2021
  • However, we can observe that stock entered into a consolidation phase in 2022
monthly_data['Year'] = monthly_data.index.year
monthly_data['Month'] = monthly_data.index.strftime('%B')
monthly_data['Quarter'] = monthly_data.index.quarter
fig, ax = plt.subplots(figsize=(20,6))
palette = sns.color_palette("mako_r", 4)
a = sns.barplot(x="Year", y="Adj Close",hue = 'Month',data=monthly_data)
a.set_title("Stock Prices Year & Month Wise",fontsize=15)
plt.legend(loc='upper left')
plt.show()
quarter = monthly_data.groupby(["Year", "Quarter"])["Adj Close"].mean().unstack()
plt.figure(figsize=(14, 8))
sns.heatmap(quarter, cmap="coolwarm", annot=True, fmt=".2f", linewidths=.5)
plt.title("Apple Stock Prices: Quarterly Averages")
plt.show()

Observation:

  • We can observe that Quarter 3 & 4 have generally higher prices compared to Quarter 1 & 2
  • The primary reason for this is as Apple has a product cycle release date during this time,the Wallstreet is excited about upcoming products and the Holiday period
  • We can observe a decline in the stock price in 2022, due to production issues and price adjustments for all the Big Tech Firms We will remove the 2022 data from the forecasting piece as it is driven by external factors

Decomposition of Time Series

from statsmodels.tsa.seasonal import seasonal_decompose as sd

plt.figure(figsize=(30,12))
decomposed_series = sd(monthly_data['Adj Close'])
decomposed_series.plot()
plt.show()

Inferences :-

  • Trend : Overall an Upward Trend
  • Seasonality :There appears to be seasonality. AAPL has rallied during the Holiday season as expected. Since Holiday period has good sales for Apple Over the Years.
##Drilling Down and Observing Seasonality
fig = plt.figure(figsize=(9,5))
decomposed_series.seasonal['2007':'2008'].plot()

Stationarity Test of Time Series

For the ease of automation, we will be Using Augmented Dickey-Fuller(ADF) Test

Null Hypothesis : Time series is non-stationary

Alternate Hypothesis : Time series is stationary

Time Series is Stationary if we have constant mean, constant variance and No Trend and Seasonality.

from statsmodels.tsa.stattools import adfuller

def ad_fuller_func(X):
result_ad_fuller = adfuller(X)
print('ADF Statistic: %f' % result_ad_fuller[0])
print('p-value: %f' %result_ad_fuller[1])
print('Critical Values:')
for key, value in result_ad_fuller[4].items():
print('\t%s: %.3f' % (key, value))

if result_ad_fuller[0] < result_ad_fuller[4]['5%']:
print('Reject Null Hypothesis- Time Series is Stationary')
else:
print('Failed to Reject Null Hypothesis- Time Series is Non-Stationary')

ad_fuller_func(monthly_data['Adj Close'])
ADF Statistic: 2.422043
p-value: 0.999020
Critical Values:
1%: -3.478
5%: -2.882
10%: -2.578
Failed to Reject Null Hypothesis- Time Series is Non-StationaryObservation

Time Series is Not Stationary as observed earlier also by Decomposition(Trend and Seasonality Present)

AutoCorrelation Function(ACF)

  • ACF measures the linear dependence between data points in a time series at different time lags. It calculates the correlation coefficient between a variable and its lagged version for various time lags. A high autocorrelation value indicates that the observations are closely related to their preceding observations.
  • ACF is used to identify the appropriate order of the moving average (MA) component in a time series model.
from statsmodels.graphics.tsaplots import plot_acf

fig,(ax1,ax2) = plt.subplots(2,figsize=(12,9))
acf = plot_acf(monthly_data['Adj Close'],lags=90,ax=ax1)
ax1.set_title('AutoCorrelation Long Term')
acf = plot_acf(monthly_data['Adj Close'],lags=30,ax=ax2)
ax2.set_title('AutoCorrelation Short Term')
ax1.set_ylabel('Correlation')
ax1.set_xlabel('Lags')
ax2.set_ylabel('Correlation')
ax2.set_xlabel('Lags')

Partial Autocorrelation Function(PACF)

  • PACF measures the direct linear dependence between data points in a time series at a specific time lag, after removing the effect of any correlations at shorter lags. In other words, it calculates the correlation between a variable and its lagged version, while controlling for the influence of all intervening data points.
  • PACF is used to identify the appropriate order of the autoregressive (AR) component in a time series model.
from statsmodels.graphics.tsaplots import plot_pacf

fig,(ax1,ax2) = plt.subplots(2,figsize=(12,9))
pacf = plot_pacf(monthly_data['Adj Close'],lags=60,ax=ax1)
ax1.set_title('Partial AutoCorrelation Long Term')
pacf = plot_pacf(monthly_data['Adj Close'],lags=30,ax=ax2)
ax2.set_title('Partial AutoCorrelation Short Term')
ax1.set_ylabel('Correlation')
ax1.set_xlabel('Lags')
ax2.set_ylabel('Correlation')
ax2.set_xlabel('Lags')
plt.tight_layout(pad=1)

Transformation to make series stationary

We will use First Order Differencing to detrend the series

##Differencing By 1
monthly_diff = monthly_data['Adj Close'] - monthly_data['Adj Close'].shift(1)
monthly_diff[1:].plot(c='grey')
monthly_diff[1:].rolling(20).mean().plot(label='Rolling Mean',c='orange')
monthly_diff[1:].rolling(20).std().plot(label='Rolling STD',c='yellow')
plt.legend(prop={'size': 12})

Observation

The series looks stationary as its having constant mean and variance

##Checking if Time Series is Stationary by Running ADF Test
ad_fuller_func(monthly_diff[1:])
fig,(ax1,ax2) = plt.subplots(2,figsize=(10,6))
acf = plot_acf(monthly_diff[1:],lags=30,ax=ax1)
pacf = plot_pacf(monthly_diff[1:],lags=30,ax=ax2)
ax1.set_title('Autocorrelation For Differenced(1)')
ax1.set_ylabel('Correlation')
ax1.set_xlabel('Lags')
ax2.set_title('Partial Autocorrelation For Differenced(1)')
ax2.set_ylabel('Correlation')
ax2.set_xlabel('Lags')
plt.tight_layout(pad=1)

Seasonal ARIMA

Seasonal ARIMA (Autoregressive Integrated Moving Average) is a time series forecasting method that extends the basic ARIMA model to account for seasonality. Seasonal ARIMA is denoted as ARIMA(p, d, q)(P, D, Q)s

modelling_series = monthly_data['Adj Close']
from sklearn.model_selection import train_test_split as split
train,test = split(modelling_series,train_size=0.6,shuffle=False)

Hyperparameters

SARIMA (p,d,q) (P,D,Q)m

There are three trend elements that require configuration.

p: Trend autoregression order.
d: Trend difference order.
q: Trend moving average order.

There are four seasonal elements that are not part of ARIMA that must be configured; they are:

P: Seasonal autoregressive order.
D: Seasonal difference order.
Q: Seasonal moving average order.
m: The number of time steps for a single seasonal period.
import itertools
p = d = q = range(0, 3)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]

list_param = []
list_param_seasonal=[]
list_results_aic=[]

from statsmodels.tsa.statespace.sarimax import SARIMAX
for param in pdq:
for param_seasonal in seasonal_pdq:
try:
model = SARIMAX(train,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)

results = model.fit()

print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))

list_param.append(param)
list_param_seasonal.append(param_seasonal)
list_results_aic.append(results.aic)
except:
continue

Observation

The Lowest AIC , we come to Seasonality Order of (2,2,0)12 and non-seasonal component is (1,1,1) as derived earlier by correlograms.

Seasonal Arima is used as we have seasonality component present. During Fall time period (July-Nov) the stock seems to rally on the news of product launch and product releases in that cycle of the year.

Back-Testing Training and Testing Data

  • Since we cannot use cross validation in our time series based datasets,as it can jumble the datasets during different folds.
  • This is not true of time series data, where the time dimension of observations means that we cannot randomly split them into groups. We can use backtesting method for time series.
  • In backtesting we can create multiple train-test splits keeping in mind the temporal order of our data during splits . For example if I have dataset between Jan to Dec
## Using TimeSeriesSplit from sklearn library
time_series_splits = TimeSeriesSplit(n_splits=4)
X = modelling_series.values
plt.figure(1)
fig = plt.figure(figsize=(24, 10))

index = 1
for train_index, test_index in time_series_splits.split(X):
train = X[train_index]
test = X[test_index]
print('Observations: %d' % (len(train) + len(test)))
print('Training Observations: %d' % (len(train)))
print('Testing Observations: %d' % (len(test)))

plt.subplot(360 + index)
plt.plot(train)
plt.plot([None for i in train] + [x for x in test])
# pyplot.title(''.format())
index += 1
plt.show()
import statsmodels.api as sm
def backtest_model(train,test):
model = sm.tsa.SARIMAX(train,order=(1,1,1),seasonal_order=(2,2,0,12))
results=model.fit()

forecasts_train = results.predict(start=0,end=len(train))
forecasts_test = results.predict(start=len(train),end=len(train)+len(test))


fig,(ax1,ax2) = plt.subplots(2,figsize=(14,6))

train = pd.DataFrame(train)
test = pd.DataFrame(test)

forecasts_train = pd.DataFrame(forecasts_train)
forecasts_test = pd.DataFrame(forecasts_test)

forecasts_train.plot(label='Forecasts',ax=ax1,title='SARIMA Forecasting -Train Data')
train.plot(label='Actual',ax=ax1)
ax1.set_ylabel('Stock Price')
ax1.set_xlabel('Time')

forecasts_test.plot(label='Forecasts',ax=ax2,title='SARIMA Forecasting -Test Data')
test.plot(label='Actual',ax=ax2)
ax2.set_ylabel('Stock Price')
ax2.set_xlabel('Time')

ax1.legend()
ax2.legend()
plt.tight_layout(pad=2)

By using backtesting we can validate our model on multiple train-test splits. Red is the training set and blue indicates test set.

Backtest 1
Backtest 2
Backtest 3

Let’s Forecast Now

We will set the non-seasonal order =(1,1,1) and seasonal order as (2,2,0,12) and use the SARIMA model to forecast the stock price

model = sm.tsa.SARIMAX(modelling_series,order=(1,1,1),seasonal_order=(2,2,0,12))
results=model.fit()
forecasts_train = results.predict(start='2007-01-31',end='2016-09-30')
forecasts_test = results.predict(start='2016-10-31',end='2019-12-31')

sd='2007-01-31'
ed='2016-09-30'
sd2='2016-10-31'
ed2='2019-12-31'

fig,(ax1,ax2) = plt.subplots(2,figsize=(18,10))

forecasts_train.plot(label='Forecasts',ax=ax1,title='SARIMA Forecasting -Train Data')
modelling_series.loc[sd:ed].plot(label='Actual',ax=ax1)
ax1.set_ylabel('Stock Price')

forecasts_test.plot(label='Forecasts',ax=ax2,title='SARIMA Forecasting -Test Data')
modelling_series.loc[sd2:ed2].plot(label='Actual',ax=ax2)
ax2.set_ylabel('Stock Price')

ax1.legend()
ax2.legend()
plt.tight_layout(pad=2)

Let’s evaluate the forecasts

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, median_absolute_error, mean_squared_log_error
##Function to Calculate Result Metrics
def result_metrics(test_series,forecast_series,model_name):
print('Result Metrics for {}'.format(model_name))
print('R2 Score : ',round(r2_score(test_series,forecast_series),3))
print('Mean Squared Error : ',round(mean_squared_error(test_series,forecast_series),3))
print('Mean Absolute Error : ',round(mean_absolute_error(test_series,forecast_series),3))

print(result_metrics(modelling_series[sd:ed],forecasts_train,'SARIMA-Train Data'))
print('----')
print(result_metrics(modelling_series[sd2:ed2],forecasts_test,'SARIMA-Test Data'))
Result Metrics for SARIMA-Train Data
R2 Score : 0.964
Mean Squared Error : 2.426
Mean Absolute Error : 1.149
None
----
Result Metrics for SARIMA-Test Data
R2 Score : 0.828
Mean Squared Error : 15.602
Mean Absolute Error : 3.068
NoneConclusion
print(results.summary())
SARIMAX Results                                      
===========================================================================================
Dep. Variable: Adj Close No. Observations: 156
Model: SARIMAX(1, 1, 1)x(2, 2, [], 12) Log Likelihood -317.125
Date: Mon, 10 Apr 2023 AIC 644.250
Time: 03:56:32 BIC 658.626
Sample: 01-31-2007 HQIC 650.092
- 12-31-2019
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.1173 0.307 0.383 0.702 -0.484 0.718
ma.L1 0.1468 0.313 0.469 0.639 -0.466 0.760
ar.S.L12 -1.2045 0.062 -19.322 0.000 -1.327 -1.082
ar.S.L24 -0.5728 0.079 -7.285 0.000 -0.727 -0.419
sigma2 6.3563 0.602 10.567 0.000 5.177 7.535
===================================================================================
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 34.68
Prob(Q): 0.99 Prob(JB): 0.00
Heteroskedasticity (H): 24.47 Skew: -0.23
Prob(H) (two-sided): 0.00 Kurtosis: 5.48
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Conclusion

We have fit a SARIMA model capturing the trend, seasonality. The model seems to have a good fit. By adding external regressors we will be able to further improve the model.

--

--

Arnab Chakraborty
AI Skunks

I’m currently pursuing a Master in Information systems at Northeastern University and I've 4 years of experience as a Data Scientist at Quantiphi