Crash course in Forecasting

Published in

AI Skunks

10 min readApr 10, 2023

-by Venkata Bhargavi Sikhakolli & Krishnakanth Jarapala

Introduction

Forecasting the future value of stocks is an important aspect of financial analysis that enables investors to make informed decisions about buying or selling shares. In this article, we will explore the process of predicting the closing price of a stock using New York Stock Exchange data. We will cover the essential steps involved in this process, including data exploration, data cleaning, data preprocessing, feature engineering, and model building. By the end of this article, you will have a clear understanding of the techniques used to predict stock prices and how they can be applied to real-world scenarios. Whether you are a beginner or an experienced data scientist, this article will provide you with valuable insights into the world of stock price forecasting.

Dataset: New York Stock Exchange Data
Dataset has 851264 Observations and 7 Features
We have Stocks Data from 2010–01–04–2016–12–30

import numpy as np 
import pandas as pd 
import os
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
warnings.filterwarnings("ignore")

# As the Dataset is very large (58MB), we have splitted the data into 3 parts. So, Lets Load all the Data from Github and concat it. 

stocks1 = pd.read_csv("https://raw.githubusercontent.com/jkkn31/KrishnakanthNaik/main/prices1.csv",parse_dates=['date'])
stocks2 = pd.read_csv("https://raw.githubusercontent.com/jkkn31/KrishnakanthNaik/main/prices2.csv",parse_dates=['date'])
stocks3 = pd.read_csv("https://raw.githubusercontent.com/jkkn31/KrishnakanthNaik/main/prices3.csv",parse_dates=['date'])
df = pd.concat([stocks1, stocks2, stocks3])

cols = ['date', 'symbol', 'open', 'close', 'low', 'high', 'volume']
stocks = df[cols]
stocks.head()

stocks.shape

(851264, 7)

Dataset has 851264 obseravtions with 7 features including the Timestamp.

Check Datatype of the features

stocks.info()

stocks.isnull().sum()

As we can see that There is No Missing Data in the Dataset

stocks["symbol"].nunique()

We have 501 stocks in the dataset.

stocks.date.min(), stocks.date.max()

We have Stocks Data from 2010–01–04 till 2016–12–30.

We have enough Data for analysis. Since there are many company in our Dataset, we will consider a Company (Pfizer / Google) for futhur Analysis and Model Building

Let’s Analyze Google Stock Price Data

google = stocks[stocks["symbol"] == 'GOOG']
google.head()

px.line(google,x="date",y=["open","close"],title="Difference between open and close prices of Google stocks")

We can see a sharp dip in the prices of shares during the month of March 2014 and April 2014, it may be due to Stock Split.

px.line(google,x="date",y=["high","low"],title="Difference between high and low prices of Google stocks")

Plotting Bollinger Bands

Bollinger Bands represent the Volatility of a Stock. If a stock is more volatile it would have a wider Bollinger band.

BOLU=MA(TP,n)+m∗σ[TP,n]
BOLD=MA(TP,n)−m∗σ[TP,n]
BOLU — Upper Bollinger Band
BOLD — Lower Bollinger Band
MA — Moving Average

TP (typical price)=(High+Low+Close)÷3

σ[TP,n] = Standard Deviation over last n periods of TP

n — Number of days in smoothing period (typically 20)
m — Number of standard deviations (typically 2)

google['TP'] = (google['close'] + google['low'] + google['high'])/3
google['std'] = google['TP'].rolling(20).std(ddof=0)
google['MA-TP'] = google['TP'].rolling(20).mean()
google['BOLU'] = google['MA-TP'] + 2*google['std']
google['BOLD'] = google['MA-TP'] - 2*google['std']

plt.figure(figsize=(20,20))
ax = google[['close', 'BOLU', 'BOLD']].plot(color=['blue', 'orange', 'yellow'])
ax.fill_between(google.index, google['BOLD'], google['BOLU'], facecolor='orange', alpha=0.1)
plt.show()

Bollinger Band for Google is not very wide, that means the stock is not very volatile and could be a good stable investment.

Let’s Understand the Volume of the Stock?

Stocks can be categorized as high volume or low volume, based on their trading activity.
High volume stocks trade more often.
Meanwhile, low volume stocks are more thinly traded. There’s no specific dividing line between the two.
However, high volume stocks typically trade at a volume of 500,000 or more shares per day.

If a stock with a High Trading volume is Rising, it means there is Buying Pressure, as investor demand pushes the stock to higher and higher prices. One the other hand, if the price of a stock with a high trading volume is falling, it means more investors are selling their shares.

px.line(google,x="date",y=["volume"],title="Volume of stock traded")

We see a reduction in the volume of stocks traded after 2014

Let’s Deep Dive into Pfizer Stock

pfizer = stocks[stocks["symbol"] == "PFE"]
px.line(pfizer,x="date",y=["open","close"],title="Difference between open and close prices of Pfizer stocks Since 2010")

Pfizer reached its Highest Stock Price since its listing.

pfizer['TP'] = (pfizer['close'] + pfizer['low'] + pfizer['high'])/3
pfizer['std'] = pfizer['TP'].rolling(20).std(ddof=0)
pfizer['MA-TP'] = pfizer['TP'].rolling(20).mean()
pfizer['BOLU'] = pfizer['MA-TP'] + 2*pfizer['std']
pfizer['BOLD'] = pfizer['MA-TP'] - 2*pfizer['std']
plt.figure(figsize=(20,20))
ax = pfizer[['close', 'BOLU', 'BOLD']].plot(color=['blue', 'orange', 'yellow'])
ax.fill_between(pfizer.index, pfizer['BOLD'], pfizer['BOLU'], facecolor='orange', alpha=0.1)
plt.show()

we see a similar trend even with Pfizer , Bollinger Band is Tight indicating Limited Volatility.

Stock prizes have been steadily increasing and the volume of trading over the years seems to have dropped.

pfe = pfizer.copy().reset_index()
pfe["date"] = pd.to_datetime(pfe["date"])
pfe.head()

plt.figure(figsize=(15,4))
pfe =pfe.set_index("date")
pfe['volume'].resample('Y').mean().plot.bar(title="Volume of pfe stocks over the years")
plt.show()

Volumes of the stock have seen an upward trend in last few years.

Univariate Analysis:

Lets Analyze each Feature Distribution for Pfizer to better understand the Data.

df1 = pfe.copy()
colors = ['orange','green']

fig=plt.figure(figsize=(20,8), tight_layout=True)
plt.suptitle("Distribution of the Continuous variables", size=20, weight='bold')
ax=fig.subplot_mosaic("""AB
                         CC
                         DE""")
sns.kdeplot(df1['high'], ax=ax['A'], color=colors[0], fill=True, linewidth=2)
sns.kdeplot(df1['low'], ax=ax['B'], color=colors[1],fill=True, linewidth=2)
sns.kdeplot(df1['open'], ax=ax['C'], color=colors[0],fill=True, linewidth=2)
sns.kdeplot(df1['close'], ax=ax['D'], color=colors[1],fill=True, linewidth=2)
sns.kdeplot(df1['volume'], ax=ax['E'], color=colors[0],fill=True, linewidth=2)


ax['B'].yaxis.set_visible(False)
ax['E'].yaxis.set_visible(False)
ax['A'].yaxis.label.set_alpha(0.5)
ax['C'].yaxis.label.set_alpha(0.5)
ax['A'].yaxis.label.set_alpha(0.5)
ax['C'].yaxis.label.set_alpha(0.5)
ax['D'].yaxis.label.set_alpha(0.5)

for s in ['top','left', 'bottom','right']:
    ax['A'].spines[s].set_visible(False)
    ax['B'].spines[s].set_visible(False)
    ax['C'].spines[s].set_visible(False)
    ax['D'].spines[s].set_visible(False)
    ax['E'].spines[s].set_visible(False)

Here we can see the Volume has Outliers in the Data, let’s verify it using Box-Plots to confirm it

#integer columns
fig=plt.figure(figsize=(20,8), tight_layout=True)
plt.suptitle("Boxplot of the Continuous variables", size=20, weight='bold')
ax=fig.subplot_mosaic("""AB
                         CC
                         DE""")
sns.boxplot(x= df1['high'], ax=ax['A'], color=colors[0],  orient="h")
sns.boxplot(x= df1['low'], ax=ax['B'], color=colors[1],  orient="h")
sns.boxplot(x=df1['open'], ax=ax['C'], color=colors[0],  orient="h")
sns.boxplot(x=df1['close'], ax=ax['D'], color=colors[1],  orient="h")
sns.boxplot(x=df1['volume'], ax=ax['E'], color=colors[1],  orient="h")
ax['B'].yaxis.set_visible(False)
ax['E'].yaxis.set_visible(False)
ax['A'].yaxis.label.set_alpha(0.5)
ax['C'].yaxis.label.set_alpha(0.5)
ax['A'].yaxis.label.set_alpha(0.5)
ax['C'].yaxis.label.set_alpha(0.5)
ax['D'].yaxis.label.set_alpha(0.5)
for s in ['left','right','top','bottom']:
    ax['A'].spines[s].set_visible(False)
    ax['B'].spines[s].set_visible(False)
    ax['C'].spines[s].set_visible(False)
    ax['D'].spines[s].set_visible(False)
    ax['E'].spines[s].set_visible(False)

Data seems to have outliers in the Volume, but, those values can’t be considered as outlier as they may be extreme values during peak selling days

Bivariate Analysis

df1.head()

df1.reset_index().columns

sns.pairplot(df1[['symbol', 'open', 'close', 'low', 'high', 'volume']],corner=True)

Visually we can see that Open, Close, Low, High are Linearly Correlated with each other

df1[['symbol', 'open', 'close', 'low', 'high', 'volume']].corr()['close']

Lets Split the dataset into train and test

As we are using the Time Series data, it is always recommended to split the data as a Continuous Data (I Mean the Test Data should be after the Train Data Timestamp, so based on the Historical Data, we try to predict the future.)

pfe.head()

pfe.reset_index().columns

columns = ['date', 'symbol', 'open', 'close', 'low', 'high', 'volume']
train = pfe.reset_index()[columns][:-180].set_index("date")
test = pfe.reset_index()[columns][-180:].set_index("date")

# pfe =pfe.set_index("date")

test.head()

plt.plot(train["close"], label='Training set', color='orange')
plt.plot(test["close"], label='Test set', color='red')
plt.legend();

Lets to fit Linear Regression Model.

Standardize the data, so data will be distributed 1 SD around its Mean before Model Building.

# Standardize the data, so data will be distributed 1 SD around its Mean
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(train.drop(['symbol'], axis=1))
X_test =scaler.transform(test.drop([ 'symbol'], axis=1))

xtrain = train[['volume','open']]
xtest = test[['volume','open']]
ytrain = train["close"]
ytest = test["close"]

Load Linear Regression Model

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
model = LinearRegression()
model.fit(xtrain,ytrain)
pred=model.predict(xtest)

Checking the Goodness of the Fit and Model Metrics

sc = np.round(model.score(xtest, ytest),2) * 100
r2 = np.round(r2_score(ytest,pred),2)
mse = np.round(mean_squared_error(ytest,pred),2)
mae = np.round(mean_absolute_error(ytest,pred),2)

fig=plt.figure(figsize=(15,6))
p=pd.Series(pred, index=ytest.index)
plt.plot(ytest)
plt.plot(p)
plt.legend(['Test Set','Predicted Set'])
plt.title("Compare Test and Predicted Values", size=20, weight='bold')

print("\n--------------------- Here are the Model Metrics ---------------------")
print('Accuracy score : {} %'.format(sc))
print('R2 Score : {}'.format(r2))
print('Mean Squared error : {}'.format(mse))
print('Mean Absolute error : {}\n'.format(mae))

--------------------- Here are the Model Metrics ---------------------
Accuracy score : 97.0 %
R2 Score : 0.97
Mean Squared error : 0.07
Mean Absolute error : 0.2

We can see the Linear Model Performance is very Good and Stable.

Let’s Check the Stationarity of Time Series Data

Stationarity describes that the Time-Series has

constant mean & mean is not dependent on Time
constant variance & variance is not dependent on Time
constant covariance and covariance is not dependent on Time

sns.lineplot(x=pfe.index,y=pfe["open"])
sns.lineplot(x=pfe.index,y=pfe["open"].rolling(52).mean(),  color='black', label='Rolling Mean (52)')
sns.lineplot(x=pfe.index,y=pfe["open"].rolling(52).std(),  color='orange', label='Rolling STD (52)')
plt.title("Checking stationary of Timeseries Data by smoothing for last 52 Days")

We can clearly see that the our Data is Not Stationary, models like ARIMA would not work well, as ARIMA models make an assumption that the underlying input data is stationary, which is why the above ARIMA Model Perfomance is not upto the mark.

Let’s Transform the data to see if we can make the series stationary

# Log Transform of absolute values and Log transoform of negative values will return NaN values
pfe['Open_log'] = np.log(abs(pfe['open']))

sns.lineplot(x=pfe.index,y=pfe["Open_log"])
sns.lineplot(x=pfe.index,y=pfe["Open_log"].rolling(52).mean(),  color='black', label='Rolling Mean (52)')
sns.lineplot(x=pfe.index,y=pfe["Open_log"].rolling(52).std(),  color='orange', label='Rolling STD (52)')
plt.title("Checking stationary of Timeseries Data for log(open)")

We are getting closer to a constant mean and std. so now let’s try Second Order Differencing and check it!

ts_diff_2 = np.diff(pfe['Open_diff_1'])
pfe['Open_diff_2'] = np.append([0], ts_diff_2)

sns.lineplot(x=pfe.index,y=pfe["Open_diff_2"])
sns.lineplot(x=pfe.index,y=pfe["Open_diff_2"].rolling(52).mean(),  color='black', label='Rolling Mean (52)')
sns.lineplot(x=pfe.index,y=pfe["Open_diff_2"].rolling(52).std(),  color='orange', label='Rolling STD (52)')
plt.title("Checking stationary of Timeseries Data for SOD(open)")

We can clearly see that by Second Order Differencing, our Data has almost constant Mean and SD.

Let’s Check the Stationarity using ADFuller.

from statsmodels.tsa.stattools import adfuller
result = adfuller(pfe["Open_diff_2"].values)
result

As per the ADFuller Statistical Test signifies that using Second Order Difference, we have a p value of 0.00 which is lesser than the significance value of 0.05 and the 1% critical value, hence we say that the Series is Stationary.

pfe.head()

Refernces:

Scikit learn Documentation
Referred Medium Articles
Referred Analytics Vidhya Articles
Referred Towards Data Science Articles
Referred Kaggle Notebooks.

I have used python libraries to do I have developed my own functions to plot the charts, to compute required metrics, to calculate required information and Every single line of code was written by myself and not copied from anywhere. for Data Exploration, I have referred to few Towards Data Science, Kaggle, Analytics vidya articles and have developed my knowledge from it. For Feature Scaling, I rerferred to Analytics Vidya article to understand more about encoders.

License

All code in this article is available as open source through the MIT license.

All text and images are free to use under the Creative Commons Attribution 3.0 license. https://creativecommons.org/licenses/by/3.0/us/

These licenses let people distribute, remix, tweak, and build upon the work, even commercially, as long as they give credit for the original creation.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.