Predicting Stock Prices is as easy as 123

Jumei Lin
8 min readAug 1, 2021

--

A trader’s goal is to predict the stock price and then sell it before the price decreases or buy the stock before the price increases. This article will show you how to predict the stock price by using the most popular time series model — ARIMA.

Disclaimer: This article is purely educational and should not be taken as professional investment advice.

Stock Market Prediction by Lumei Digital

The stock price data is a financial time series data that is difficult to predict due to its characteristics and dynamic nature. If you are just getting started embracing AI and plan to ditch the traditional way, please don’t and you are not missing out on anything big if you are a casual investor.

There is a large percentage of high-frequency trading algorithms that predict milliseconds into the future and lots of them are simple methods like hardcoded rules or a simple linear regression model. If you are an institutional investor, for sure you need a machine to race against milliseconds and analyze the trends in investing strategies. After all, an institutional investor buys in large amounts, and the way you execute the trades can make a huge difference. A machine learning model can help to decide how to split up the sales over time to avoid causing big price movements. Also, a simple model like ARIMA or LSTMs can help you predict the price movements.

Today we will use Apple’s stock data as an example to build an ARIMA model and predict its future 30 days prices.

ARIMA, AutoRegressive Integrated Moving Average, is an algorithm that forecasts the future values based on the information in the past values of the time series. It’s widely used because of its simplicity and ability of the generalization non-stationary series.

It’s important to know that ARIMA is not suitable for seasonal data. A seasonal ARIMA (SARIMA) is suggested to remove additive seasonal effects. If you are interested in more detail about ARIMA, please refer to this page.

Your support would be awesome❤️

Please help me get 100 followers.

When it comes to time series analysis, you must first hear about three models:

  1. Autoregressive model (AR)
  2. Moving Average Model (MA)
  3. The autoregressive moving average model (ARIMA): is the integration of the above 2 models which assume the series is stationary. ARIMA model is designed to process any non-seasonal time series that shows patterns and is not random white noise.
ARIMA(p, d, q) where,
p is the lag order
d is the order of differencing
q is the size of moving average window

A time series is considered to be “stationary” when it satisfies 3 conditions including:

  1. Mean is constant
  2. Standard Deviation is constant
  3. Seasonality does not exist

As you can imagine, most real-world data is not stationary. For example, the stock prices below are obvious that the mean increases over time.

Apple stock value over time by Lumei Digital

There are several methods to remove the trend and seasonality from the time-series data and make the data “stationary” including:

✓ There are many ways to de-trend a time series

  • Log Transformation.
  • Power Transformation.
  • local smoothing — Applying moving window functions to time-series data.
  • Differencing a time series.
  • Linear Regression.

✓ There are various ways to remove seasonality.

  • Average de-trended values.
  • Differencing a time series.
  • Use the loess method.

In this article, we will use Log Transformation and ADF Test to check if the data is stationary.

Firstly, collecting data by using DataReader to access Yahoo Finance stock data.

from pandas_datareader.data import DataReader
# Stock No
stockNo = 'AAPL'
end = datetime.now()
start = datetime(end.year - 15, end.month, end.day)
df = DataReader(stockNo, 'yahoo', start, end)
print(df.shape)
df.head()
Apple’s stock price from 2006–07–31 to 2021–07–30 by Lumei Digital

Before we get started, a time series consists of 3 systematic components including level, trend, seasonality, and one non-systematic component called noise.

The components are defined as follows:

  1. Level: The average value in the series.
  2. Trend: The increasing or decreasing value in the series.
  3. Seasonality: The repeating short-term cycle in the series.
  4. Noise: The random variation in the series.

Before building the model, it’s essential to know if the series is stationary or not because time series analysis only works with stationary data. When applying ARIMA mode, it’s critical to find the order of differencing (d) since the purpose of difference is to make the time series stationary.

We will only need differencing if the series is non-stationary, otherwise, the d is 0 (zero) in ARIMA. In this article, we will use ADF (Augmented Dickey-Fuller) Test to check if a series is stationary or not.

The ADF test is one of the most popular statistical tests to determine the presence of unit root in the series and help to understand if the series is stationary or not. The null and alternate hypothesis of this test is:

Null Hypothesis: The series has a unit root (value of a =1)

The null hypothesis of the test is that the time series can be represented by a unit root.

If failed to be rejected, it suggests the time series has a unit root, meaning it is non-stationary. It has some time-dependent structure.

Alternate Hypothesis: The series has no unit root.

The alternate hypothesis (rejecting the null hypothesis) is that the time series is stationary.

The null hypothesis is rejected; it suggests the time series does not have a unit root, meaning it is stationary. It does not have a time-dependent structure.

If the p-value < 0.05 (significance level), we reject the null hypothesis and infer the series is stationary.

from statsmodels.tsa.stattools 
import adfuller from numpy
import log
df_close = df['Close']
result = adfuller(df_close.dropna())

In this case, the p-value > 0.05, so we fail to reject the null hypothesis and we can say that the series is non-stationary. This means that the series can be linear or difference stationary.

We then go ahead with finding the order of differencing. Instead of choosing the value of p, d, and q by observing the plots of ACF and PACF, we use Auto ARIMA to get the best parameters.

Auto ARIMA, which Automatically discovers the optimal order for an ARIMA model, can identify the optimal parameters for an ARIMA model based on the R function.

model_autoARIMA = auto_arima(train_data, start_p=0, start_q=0,
test='adf', # use adftest to find optimal 'd'
max_p=3, max_q=3, # maximum p and q
m=1, # frequency of series
d=None, # let model determine 'd'
seasonal=False, # No Seasonality
start_P=0,
D=0,
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)
print(model_autoARIMA.summary())
model_autoARIMA.plot_diagnostics(figsize=(15,8))
plt.show()

Standardized residual: fluctuates around a mean of zero and has a uniform variance.

Histogram plus estimated density: suggests normal distribution with mean zero.

Normal Q-Q: falls in line with the red line. If there are any deviations, it implies the distribution is skewed.

Correlogram: the ACF plot displays residual errors that are not autocorrelated. If there is any autocorrelation, it implies there are some patterns in the residual error that are not explained in the model and we need to find more predictors for the model.

Overall, it seems to be a good fit. Let’s start creating an ARIMA model with provided optimal parameters p, d, and q. forecasting the stock prices.

#Log Transformation
df_log = np.log(df_close)
#split data into train and training set
train_data, test_data = df_log[3:int(len(df_log)*0.9)], df_log[int(len(df_log)*0.9):]
# Build Model
ARIMA_model = ARIMA(train_data, order=(0,1,1))
fitted = ARIMA_model.fit(disp=-1)
print(fitted.summary())
ARIMA Model Results by Lumei Digital

On the above results, the coefficients table in the middle is the weights of the respective terms. How to interpret ARIMA Results

▹Determining if it’s statistically significant: if P-value < 0.05, the coefficient is statistically significant.

▹How well does the model fit the data: Use the mean square error (MS) to determine whether the model fits the data. Smaller values imply a better fitting model.

▹Reviewing assumptions: to make sure the residuals are independent, known as white noise.

# Forecast
fc, se, conf = fitted.forecast(378, alpha=0.05) # 95% conf
# Make as pandas series
fc_series = pd.Series(fc, index=test_data.index)
lower_series = pd.Series(conf[:, 0], index=test_data.index)
upper_series = pd.Series(conf[:, 1], index=test_data.index)
# Plot
plt.figure(figsize=(10,5), dpi=100)
plt.plot(train_data, label='training data')
plt.plot(test_data, color = 'blue', label='Actual Stock Price')
plt.plot(fc_series, color = 'orange',label='Predicted Stock Price')
plt.fill_between(lower_series.index, lower_series, upper_series,
color='k', alpha=.10)
plt.title('%s Stock Price Prediction' %stockNo)
plt.xlabel('Time')
plt.ylabel('%s Stock Price' %stockNo)
plt.legend(loc='upper left', fontsize=8)
plt.show()
Stock Price Prediction by Lumei Digital
# report performance
mse = mean_squared_error(test_data, fc)
print('MSE: '+str(mse))
mae = mean_absolute_error(test_data, fc)
print('MAE: '+str(mae))
rmse = math.sqrt(mean_squared_error(test_data, fc))
print('RMSE: '+str(rmse))
mape = np.mean(np.abs(fc - test_data)/np.abs(test_data))
print('MAPE: '+str(mape))

Around 3.4% MAPE implies the model is about 96.6% accurate in predicting the next 15 observations. For sure, there are several things we can do to improve this model.

There are several ways that you can try to predict the stock market, for example, analyzing the text from financial news websites or Twitter to know whether a stock will go up or down. Moreover, carrying out a sentiment analysis to analyze the way a piece of information is announced. However, the efficient market hypothesis states that asset prices reflect all available information. In other words, any information that can impact the prices already incorporated in the price by the time the machine learning model gets the information from the web and predicts the price.

If this hypothesis is true, then using any data beyond the prices could be excessive. We should use nothing but historical price information to build a predictive model. This just makes data scientist's life a lot easier!

Day traders use technical analysis and predict the direction of prices by finding patterns in the past market data, which is something that neural networks are very good at, finding the patterns in data. so who can blame us to build a model and play around with the market data? However, will the market patterns in the past show and generalize in the future?

Two useful Python libraries for the stock trading:

1️⃣ PyAlgoTrade: helps you backtest stock trading strategies

2️⃣ TA-Lib: is widely used by trading software developers who carry out technical analysis of financial market data

More Stock Market Topics:

Become a pro in the stock market in 5 mins

Your support would be awesome❤️

Having more followers will encourage me to write more articles.

Keywords: Stock Market, Price Prediction, Time Series Forecasting

--

--

Jumei Lin

Entrepreneur, Writes about artificial intelligence, AWS, and mobile app. @Lumei Digital