Predict the Sales Volume of the Google Stock Price

Sachchithananthan Thanusan
Analytics Vidhya
Published in
11 min readJun 20, 2021

Dataset: Click

Setup for Reproducible Results

To get reproducible results you will have to do the following steps otherwise you will get different results :

1 ) Set the PYTHONHASHSEED environment variable at a fixed value

2 ) Set the python built-in pseudo-random generator at a fixed value

3 ) Set the numpy pseudo-random generator at a fixed value

4 ) Set the tensorflow pseudo-random generator at a fixed value

5 ) Configure a new global tensorflow session

# Seed value
# Apparently you may use different seed values at each stage
seed_value= 0
# 1. Set the `PYTHONHASHSEED` environment variable at a fixed value
import os
os.environ['PYTHONHASHSEED']=str(seed_value)
# 2. Set the `python` built-in pseudo-random generator at a fixed value
import random
random.seed(seed_value)
# 3. Set the `numpy` pseudo-random generator at a fixed value
import numpy as np
np.random.seed(seed_value)
# 4. Set the `tensorflow` pseudo-random generator at a fixed value
import tensorflow as tf
tf.random.set_seed(seed_value)
# for latest versions:
# tf.compat.v1.set_random_seed(seed_value)
# 5. Configure a new global `tensorflow` session
from keras import backend as K
session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
K.set_session(sess)

By following above code you can get reproducible results

we have to import some libraries. Because most of python libraries are having a set of useful functions which can simply eliminate the need for writing codes from scratch. So that i have imported below libraries

import warnings
warnings.filterwarnings('ignore')
import os
import numpy as np
import pandas as pd
import scipy.stats as stats
from matplotlib import pyplot as plt
%matplotlib inline
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import preprocessing

Mount Google Drive and Load data set

I have added below codes for mount the google drive account to google colaboratory to access the files which is available on drive. When we execute below code we have to go to the generated URL and get authorization code to enter here.

from google.colab import drive
drive.mount('/content/drive')

After enter the authorization code. you will get a message as ‘Mounted at /content/drive’. There after by below code you can access the files which is available on your connected google drive.

import pandas as pddf = pd.read_csv('/content/drive/My Drive/Colab Notebooks/data/Google_Stock_Price.csv')

By following below code you can see the top 10 stored data rows in that variable. Pandas libary provide a method called head() is widely used to return top n rows of a data frame or series. that method by default return top 5 rows of stored data set.

df.head(10)

By below code we can check the number of records which is available on the data set

len(df)

1258 is output of above code. there are 1258 records which is available on dataset.

Step 01 — Data Pre-Processing

Data pre-processing is a main step as the useful information which can be derived it from data set directly affects the model quality so it is extremely important to do atleast necessary preprocess for our data before feeding it into our model.

Step 01 A- Handle Missing Values and Outliers

Step 01 A — Handle Missing Values

By following below code we can check for the Missing values in the data set.

df.isnull().any()

From above output we can come to conclusion that we don’t have any missing values in the data set. so we can go for the next step.

Step 01 B — Handle Duplicate Values

By following below codes we can check for the duplicate row in data set.

print(df.duplicated().value_counts()) # To check duplicated values

From above output we can come to conclusion that there were no duplicated rows in data set

Step 01 C — Handle Outliers Values

A) Check for outliers on column name ‘Open’

The opening price is the price at which a security first trades when an exchange opens for the day. An opening price is not identical to the previous day’s closing price.

plt.rcParams["figure.figsize"] = (24, 3)
temp_df = pd.DataFrame(df, columns=['Open'])temp_df.boxplot(vert=False)
from matplotlib import pyplot
plt.rcParams["figure.figsize"] = (24, 8)
plt.plot(df['Open'])
plt.title("Google Stock Open Price Changes")
plt.xlabel("Time")
plt.ylabel("Open Price")
plt.show()

By looking above outputs we can say that there is no considerable issues in features,

B) Check for outliers on column name ‘High’

The high is the highest price at which a stock traded during a period.

plt.rcParams["figure.figsize"] = (24, 3)
temp_df = pd.DataFrame(df, columns=['High'])
temp_df.boxplot(vert=False)
plt.rcParams["figure.figsize"] = (24, 8)
plt.plot(df['High'])
plt.title("Google Stock High Price Changes")
plt.xlabel("Time")
plt.ylabel("High Price")
plt.show()

By looking above outputs we can say that there is no considerable issues in features,

C) Check for outliers on column name ‘Low’

The low is the lowest price of the period.

plt.rcParams["figure.figsize"] = (24, 3)
temp_df = pd.DataFrame(df, columns=['Low'])
temp_df.boxplot(vert=False)
plt.rcParams["figure.figsize"] = (24, 8)
plt.plot(df['Low'])
plt.title("Google Stock Low Price Changes")
plt.xlabel("Time")
plt.ylabel("Low Price")
plt.show()

By looking above outputs we can say that there is no considerable issues in features,

D) Check for outliers on column name ‘Close’

The closing price is the last price at which a security traded during the regular trading day. A security’s closing price is the standard benchmark used by investors to track its performance over time. The closing price will not reflect the impact of cash dividends, stock dividends, or stock splits.

df['Close'] = df['Close'].str.replace(',','')
df['Close'] = df['Close'].astype('float')
plt.rcParams["figure.figsize"] = (24, 3)
temp_df = pd.DataFrame(df, columns=['Close'])
temp_df.boxplot(vert=False)
plt.rcParams["figure.figsize"] = (24, 8)
plt.title("Google Stock Close Price Changes")
plt.xlabel("Time")
plt.ylabel("Close Price")
plt.plot(df['Close'])
plt.show()

By looking above outputs we can say that there is considerable issues in feature,

So I checked below condition,

df[ df['High']< df['Close']]

I have dropped Close value because most of those values are greater than high price value so it is practically not possible to be like that.

df = df.drop(‘Close’, axis = 1)

E) Check for outliers on column name ‘Volume’

plt.rcParams["figure.figsize"] = (24, 3)
df['Volume'] = df['Volume'].str.replace(',','')
df['Volume'] = df['Volume'].astype('float')
temp_df = pd.DataFrame(df, columns=['Volume'])
temp_df.boxplot(vert=False)
plt.rcParams["figure.figsize"] = (24, 8)
plt.title("Google Stock Volume Changes")
plt.xlabel("Time")
plt.ylabel("Volume")
plt.plot(df['Volume'])
plt.show()

By looking above outputs we can say that there is no considerable issues in feature

Step 02 — Feature Coding

ANN models are require all input and output values should to be numerical. So if your dataset have categorical data, you must have to encode it into the numbers before fit and evaluate a model. There are several methods available such as One-hot Encoding, Integer (Label) Encoding to do the task. Here i have used One-hot Encoding

What Is the Weekend Effect?

The weekend effect is a phenomenon in financial markets in which stock returns on Mondays are often significantly lower than those of the immediately preceding Friday. The weekend effect is also known as the Monday effect.

What is the end of month Effect?

The ‘End of Month’ effect has been the subject of many scientific studies. Statistics show that stock prices, and in particular US stock prices, tend to go up during the last days and the first days of the month.

What Is the Best Month to Buy Stocks ?

The markets tend to have strong returns around the turn of the year as well as during the summer months. September is traditionally a down month. The average return in October is positive historically, despite the record drops of 19.7% and 21.5% in 1929 and 1987.

By considering above Practices , I have added additional column as Day-week to show which kind of day in week. More than that I derived month, day also.

df['Date'] = pd.to_datetime(df['Date'])
df['Day_week'] = df['Date'].dt.day_name()
#df['Year'] = pd.DatetimeIndex(df['Date']).year
df['Month'] = pd.DatetimeIndex(df['Date']).month
df['Day'] = pd.DatetimeIndex(df['Date']).day
df = df.drop('Date', axis = 1)
df.head(10)

After derive some additional Features from the date,

month_dummies = pd.get_dummies(df[‘Month’],prefix=’M’)
df=pd.concat([df, month_dummies], axis=1)
day_dummies = pd.get_dummies(df['Day_week'],prefix='W')
df=pd.concat([df, day_dummies], axis=1)
dayNum_dummies = pd.get_dummies(df['Day'],prefix='D')
df=pd.concat([df, dayNum_dummies], axis=1)
df = df.drop('Month', axis = 1)
df = df.drop('Day_week', axis = 1)
df = df.drop('Day', axis = 1)
df.head(10)

Modified data set,

Step 03 Data Transformations

plt.rcParams["figure.figsize"] = (24, 12)
X[['High','Low','Open']].hist()

Volume Column Transformation not required because of Final value.

Step 04 — Scale and/or standardized the features

By following below code I have removed categorical columns in data set for scaling purpose,

dummies = day_dummies.columns.tolist() + month_dummies.columns.tolist()+dayNum_dummies.columns.tolist()Remove_columns_values = dummiesX_without_Cat=X.drop(Remove_columns_values, axis = 1)
X_without_Cat.head(5)

After removing categorical columns,

By following below code I have scaled continuous values in the table by using minmax scaler from sklearn

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_training = scaler.fit_transform(X_without_Cat)
data_training
columns_value_new=X_without_Cat.columns
X_Scaled_Except = pd.DataFrame(data_training, columns=columns_value_new)
plt.rcParams[“figure.figsize”] = (24, 12)
X_Scaled_Except.hist()

Show the scaling/ standardizing effect,

By following below code I have scaled volume column separately for inverse transform the model output ,

scalerVol= MinMaxScaler()
data_trainingVol= scalerVol.fit_transform(X_without_Cat.iloc[:,3:4])
data_trainingVol

Step 05 Correlation Matrix

import seaborn as sns
plt.rcParams["figure.figsize"] = (24, 8)
sns.heatmap(X_Scaled_Except.corr(),annot=True);
X_Scaled_Except.corr()

From above table and correlation matrix we can see that there is a high correlation between low high open. So we can choose a only 1 to balance work . because of having small variance between prices i have choose all features for balance work

data_Final =X_Scaled_Except
for f in dummies:
data_Final = data_Final.join(X[f])

Step 06 Recurrent Neural Network

Split a multivariate sequence into samples,

def split_series(series, n_past):
X, y = list(), list()
for window_start in range(len(series)):
past_end = window_start + n_past
if past_end >= len(series):
break
# slicing the past and future parts of the window
past, future = series[window_start:past_end, 0:4]
,series[past_end,3]
X.append(past)
y.append(future)
return np.array(X), np.array(y)

I have used above function to do the sample by following below code.

X, y = split_series(data_Final.to_numpy(), 6)

I have considered past 6 days records to predict the 7th day Volume

import math
n_test = math.floor(len(y)*0.2)
X_train, X_test, y_train, y_test = X[:-n_test], X[-n_test:], y[:-n_test], y[-n_test:]
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

I have spilt the data set into 20 % for testing and 80 % for training. I have used above code to it in proper way. Because we have to split the data set without change the order.

Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems.

By following below code you can define a model.

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from tensorflow.keras.layers import RNN
regressor = Sequential()
regressor.add(LSTM(units = 60, activation = 'relu', return_sequences = True, input_shape = (6, 4)))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units = 60, activation = 'relu', return_sequences = True))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units = 80, activation = 'relu', return_sequences = True))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units = 120, activation = 'relu'))
regressor.add(Dropout(0.2))
regressor.add(Dense(units = 1))
regressor.summary()

Compilation of the model

By following below given code to compile the model, I have used an optimizer as ‘adam’ and loss as ‘mean squared error’. in the part of the optimization algorithm, We have to calculate the error for the current state of the model and it should be estimated repeatedly. So it can be used to estimate the loss of the trained model so the weights of trained model can be get updated to reduce the loss as much as on the next stage.

regressor.compile(optimizer=’adam’, loss = ‘mean_squared_error’)

Fit the Model

The batch size is controls the number of training samples to work through before the model’s internal parameters are updated. The number of epochs is Controls the number of complete passes through the training dataset. I have given 150 for epochs. So 150 times of complete passes through the training dataset will happen.

history=regressor.fit(X_train, y_train, epochs=150, batch_size=32, verbose=2,validation_data=(X_test, y_test) )

Evaluate the model

By following the code we can get the mean squared error (MSE) tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line and squaring them. Root Mean Squared Error (RMSE) is the error rate by the square root of MSE.

from numpy import sqrt
mse = regressor.evaluate(X_test, y_test, verbose=0)
print('MSE: %.9f, RMSE: %.9f' % (mse, sqrt(mse)))

Here I have received MSE: 0.000936066, and RMSE: 0.030595192

Plot learning curves

By following below code we can draw the learning curve for the model. We can use the learning curves to diagnose the problems which is on learning such as an underfit of the model or overfit of the model during training.

plt.figure(figsize=(14,5))
from matplotlib import pyplot
pyplot.title('Learning Curves')
pyplot.xlabel('Epoch')
pyplot.ylabel('Root Mean Squared Error')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='val')
pyplot.legend()
pyplot.show()

By using learning curve a good fit is the exists between an overfit and underfit model conditions. We can say good fit as the both training and validation loss which is decreases to a point of stability with a minimal gap between the two final loss values.

Prediction

By following below code we can predict the volume for given input. However the final out come is not the volume. because we have scaled to give as input to the model. so we have to rescale the output value to get the predicted volume.

y_pred = regressor.predict(X_test)

Rescaling Output

We can rescale the output of model to get the volume by following below code. I have used inverse transform function to do the relevant task.

y_predVol = scalerVol.inverse_transform(y_pred)
y_test = y_test.reshape(y_test.shape[0], 1)
y_testVol = scalerVol.inverse_transform(y_test)
volume_pred= []
for i in range(len(y_predVol)):
volume_pred.append(y_predVol[i,0])
volume_test= []
for i in range(len(y_testVol)):
volume_test.append(y_testVol[i,0])

Visualizing the results

By following below code we can plot the graph for show the Real Google Volume vs Predicted Google Volume . I have shown Real Google Volume by red color line and Predicted Google Volume by blue color line.

plt.figure(figsize=(14,5))
plt.plot(volume_test, color = 'red', label = 'Real Google Volume')
plt.plot(volume_pred, color = 'blue', label = 'Predicted Google Volume')
plt.title('Google Stock Volume Prediction')
plt.xlabel('Time')
plt.ylabel('Google Stock Volume')
plt.legend()
plt.show()

Calculate the accuracy of the model using 𝑅^2 statistics

By following below code we can get the r2 score of the model. The most common interpretation of r-squared is how well the regression model fits the observed data. Generally, a higher r-squared indicates a better fit for the model

from sklearn import metricsaccuracy = metrics.r2_score(volume_test,volume_pred)
print(" Accuracy:", accuracy)

Accuracy: 0.33281983814745664

--

--

Sachchithananthan Thanusan
Analytics Vidhya

Final year Undergraduate, Faculty of Information Technology, University of Moratuwa.