Predicting Stocks

Rishab Das
The Deep Hub
Published in
13 min readJun 17, 2024

My goal for this project is to be able to accurately predict the price of a stock. At the end of the code, I want there to be a function that predicts the value of any stock at the users request. The request includes the starting date and the end date, so it could be from the beginning of the company to the day of the request.

So how will we do this? There are a few key parts to my project.

  • Get the data
  • Get to know how to use the data
  • Predict on the data (using the best model out of five model choices)
  • Making a big function that someone could use to just do all the little things my project does at once

Along the way, I intend to learn how to analyze time series data. And also, understand how time series data works.

Let us Begin!!!

Get the Data

For this, I will write a simple function that uses the pandas_data_reader library. It just makes life easy by being able to input a start date and an end date and then getting the data. I don’t know how often its updated, I think its just based on the most recent, but frankly I don’t care.

def save_to_csv(ticker, syear, smonth, sday, eyear, emonth, eday):
start = dt.datetime(syear, smonth, sday)
end = dt.datetime(eyear, emonth, eday)

yfin.pdr_override() # I don't know what this line does, it just was from stackoverflow since the code wasn't working

df = web.get_data_yahoo(ticker, start, end) # we overided with yfinance
df.to_csv("/Users/rish/Finance/Project" + ticker + ".csv") # check if this works, do you need to add file path?

return df

Now we have this function, lets get the data and analyze it.

save_to_csv("AMZN", 2020, 9, 5, 2024, 1, 1)

EDA (Exploratory Data Analysis)

Analyzing stock market data, in my opinion, is unlike any other type of data. Other types of data don’t include time, and if they do, they aren’t really affected by it.

And with stock market data, the trends are clearly visible, and their isn’t much to analyze, but we will just figure out how many samples we have, and we will reduce the data to just have the date, and the closing price.

df = pd.read_csv("DataAMZN.csv", 
usecols=["Date", "Close"])

len(df) # 587 samples

It only has 587 samples, but the difference in days between September 5th 2020 and January 1st 2024 is 1,213 days. I don’t know why, but I also don’t care. There is no way there is that many days off from the stock market, but who knows, 2020 was a weird year, and maybe this API doesn’t have all the data. Who knows, but I don’t care.

Now that I think about it, and after reading an article with example of EDA, I don’t really know how to do EDA on stock market data, also, do I really need to get to know the data. I know the dimensions and all of that, and there is just a date column which will be of type object. Thats all simple. I don’t feel as if it is necessary to do EDA. And also, there is no benefit to it, all I really need to do is display the graph of the data, and then we can just see. There is no point in trying to figure out the shape of the data when their is no trend but (usually) going up. Let me know if this is wrong and a wrong way of thinking.

import matplotlib.pyplot as plt
import seaborn as sns


plt.figure(figsize=(10,6))
sns.lineplot(x="Date", y="Close", data=df)
Graph of Amazon Data

Something I can just see off the top is that the faster the numbers climb up the faster they go down. But that doesn’t matter, I just want to predict the future, I think. So lets split train and test. But I need to explain how before we do it.

Train Test Split

Explanation: We can’t just train test split this problem with scikit-learn. The data we are using doesn’t have a target variable. Well it does, I guess, the closing price is the target variable, but the date isn’t the independent variable. Think about, there is not really a dependent variable (with the data we have, maybe if we analyze the sentiment of the news and the reports that the company have given, then we would have a dependent variable), but for now we don’t.

Also, this data has time, most of the other data doesn’t have time, and since train test split is a random splitting of data, then the time would mess up, and therefore we wouldn’t be able to “predict” the future since we have random data points in random places. Anyways, lets get to coding.

The Wrong Way

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(#what do we use as a target?)

The Correct Way

dates_train = dates[:split_size]
close_train = close[:split_size]

dates_test = dates[split_size:]
close_test = close[split_size:]

print(len(dates_train), len(close_train), len(dates_test), len(close_test))

plt.scatter(dates_train, close_train)
plt.scatter(dates_test, close_test)

Now, when one looks at the code above (in my opinion, I throw up, holy crap is that ugly looking code, we can write this in a much shorter way, but the idea is there, below is the better way). They both do the exact same thing, one is just better looking code (something I’m trying to work on).

train_dates, train_close = dates[:split_size], close[:split_size]
test_dates, test_close = dates[split_size:], close[split_size:]

plt.scatter(train_dates, train_close, s=5)
plt.scatter(test_dates, test_close, s=5)
plt.title("AMZN")
plt.figure(figsize=(10, 7))

That’s looks so much better. Copying and pasting that is much easier than the one above. Here are the results:

Plot for train test split of AMZN data

Now, the next step in to find a good model, and I will test a few out. I am reading through a textbook https://otexts.com/fpp3/. Hopefully it will give me a few models, if I can’t, then I will use GPT, but I will try an learn. It will take me a while.

Models

Here are the algorithms I have chosen (I don’t know the right name to call them, models??

  1. Naive Bayes
  2. Drift Method
  3. AR Model (Autoregression)

I will explain each one and then I will implement them. I have used Naive Bayes before, drift is a variation of Naive Bayes, and AR Model is something I found in the reading (and asking chatGPT). I will explain what I asked ChatGPT when I am explaining the AR Model.

Naive Bayes

I have used Naive Bayes, and its really popular because of its simplicity and usability. Life is smooth with Naive Bayes. Naive Bayes is an algorithm that simply uses the previous value as the next forecast value. Its represented by the equation:

Equation for Naive Bayes

And its super easy to implement, I think. I think that you just set the previous one to the next one by shifting the whole thing by 1 day so that it just continues it trend. Or you put the beginning of the forecast at the end of the forecast. I think, but I will look at an example to figure out how to do it. Before we do that, the notebook I was looking through has a nice plot function, so I’m going to be using it. Credit to the great Daniel Bourke for this.


# Create a function to plot time series data
def plot_time_series(timesteps, values, format='.', start=0, end=None, label=None):
"""
Plots a timesteps (a series of points in time) against values (a series of values across timesteps).

Parameters
---------
timesteps : array of timesteps
values : array of values across time
format : style of plot, default "."
start : where to start the plot (setting a value will index from start of timesteps & values)
end : where to end the plot (setting a value will index from end of timesteps & values)
label : label to show on plot of values
"""
# Plot the series
plt.plot(timesteps[start:end], values[start:end], format, label=label)
plt.xlabel("Time")
plt.ylabel("AMZN Price")
if label:
plt.legend(fontsize=14) # make label bigger
plt.grid(True)

Now lets get to doing the naive forecast. Again, the notebook is by Daniel Bourke. We just need to set the index back by one? I think thats what this code does.

dates = df["Date"]
close = df["Close"]

split_size = round(len(dates) * 0.8)

train_dates, train_close = dates[:split_size], close[:split_size]
test_dates, test_close = dates[split_size:], close[split_size:]

# This code is from the notebook
naive_forecast = test_close[:-1]

plt.figure(figsize=(10, 7))
plot_time_series(timesteps=train_dates, values=train_close, label="Train data")
plot_time_series(timesteps=test_dates, values=test_close, label="Test data")
plot_time_series(timesteps=test_dates[1:], values=naive_forecast, format="-", label="Naive forecast");
Graph of Time series

We can see that its really really accurate, there will be nothing this accurate, but we can try and see, and even if it is, there will be nothing that is this simple to make this accurate. This is a very power algorithm, next up is the Drift Method.

Drift Method

The drift method is a variation of the Naive Bayes algorithm. It “drifts” the forecast, allowing it to “increase or decrease over time”. The equation for this is much much more complicated and I have literally no clue how it works, but I will use ChatGPT to figure out what it means, but here is the equation:

Equation for Drift Method

You basically just use a line and predict into the future with some rate of change up or down. I think a better wording for it would be a number that will allow for some change in the forecast. I don’t know, it just a number to push the individual forecasts (I think) up or down. What is that value represented by in the equation I don’t know yet. But I think its the h(blah blah blah) part. I think, but you can correct me if I’m wrong.

Anyways, let us implement it. I think I just need to do what I did before and then to each value add that yT-y1/T-1 or something. But I need to know what those values are. After looking it up in the GPT, I was correct, h is just the number of forecasts you want. Which makes sense. So we could set the h value to the length of our dataset-1 and then find the average change overtime which is the yT-y1/T-1. But I don’t think we need to do that, if you look at the Naive Bayes equation, I think we just add the drift value to it. So let us do that in code. We just copy the code and then find the average change by doing something in pandas or numpy or something I don’t know. Lets code it!

df["shifted_column"] = df["Close"].shift()
df["difference"] = df['Close'] - df["shifted_column"]
df['difference'] = df['difference'].abs()
df["difference"].abs()
average_roc = df['difference'].mean()

The above code finds the average change over time. Since we are just doing the entire test dataset we don’t need and h value. I think. Now lets do the drift forecast.

drift_forecast = test_close[:-1] + average_roc

As you can see, we just add the average rate of change (roc) to the forecast. Lets plot.

plot_time_series(timesteps=train_dates, values=train_close, label="Train data")
plot_time_series(timesteps=test_dates, values=test_close, label="Test data")
plot_time_series(timesteps=test_dates[1:], values=drift_forecast, format="-", label="Naive forecast");
Plot of the Drift Forecast

As you can see from the plot, it just shifted the whole predicted forecast up a certain amount. And that certain amount is the average rate of change. I think this is correct, correct me if I am wrong. Now let us go onto our AR Model (Autoregression).

AR Model (Autoregression)

Autoregression is a regression technique that uses the past values to predict the future values. And the main idea is that there is a linear relationship between the past observations (the past values) and the current values.

Equation representing AR

So, this looks really complex, and I will try (to the best of my understanding) to explain it, please correct me if I am wrong. So, we have c, the weird O and yt-1. yt is the current time series, the thing we want to predict, and the yt-1, -2, -3, -p are just past time series values. These are called lagged values, but they just mean past time series values. The weird O represents the coefficients of the model. They quantify the influence of the past values on the current (to be predicted) data. They just move it up or down, something to affect the past values. I think. Then we have C, the C value just moves the whole thing up or down. The whole graph. Then we have that E, that E is just the error value, the shock value, the random shock that affects the whole thing.

Now to whole premise (I think) of this AR model hubba jubba is to create a autocorrelation coefficient. This coefficient is between a lagged value and the current value. The higher the number, the greater the correlation between those two values in time. This value provides details into the temporal dependence of the data. Wow, that sounds cool to say.

Now I have zero clue how to do this in code, I kind of understand the concept, using a bunch of effector values that just kind of like show how much the past affects the future or something along those lines. I think. I Doubt that this model will do any better than the other two but lets give it a try.

for i in range(1,20):
df_for_use[f'Lag_{i}']=df_for_use["Close"].shift(i)

df_for_use.dropna(inplace=True)

train_size = int(0.8 * len(df_for_use))
train_data = df_for_use[:train_size]
test_data = df_for_use[train_size:]

y_train = train_data["Close"]
y_test = test_data["Close"]

So we used a smaller sample size of 50 data points, the first two lines of code shift the dataframe and give us those lagged values, the rest is just splitting the data, so what I am getting from this is we just need to create lagged values by shifting the whole thing by i. Now we just plot the ACF (autocorrelation function) using statsmodel. Its gonna be magical.

from statsmodels.graphics.tsaplots import plot_acf
series = df_for_use["Close"]
plot_acf(series)
plt.show()
ACF Plot

I have zero clue what that plot means, let me read and find out really quickly. This part is after I have read a little bit about it. So, the horizontal axis (which I probably should have labelled) are the lagged values. So past values, so 16 means (i think) 16 passed values. The little line dot thing that looks like its just generating from the the horzontal axis is the correlation coefficient. I believe (at 16) this means there is a negative correlation between the lagged value (at 16) and the current value, meaning, it probably pulls the current value down with a negative slope. Maybe I’m thinking of this completely wrong, but I can visualize a line with a negative slope, and this line is the relationship between the lagged values and the current value. So now we need to just train a model, but before we do that I have more information that I think is cool.

So we see this blue shaded region, this is the comfort zone, where the autocorrelation coefficient, (so far I have forgotten to write the word auto), is in between 0.50 and -0.50, I guess this is the good area. The values that are like crazy above or below are the significant lags which just have like really really high confidence level (according to my reading, the blue area is a confidence zone) or really really low confidence if the autocorrelation coefficient value is really really low. We can find the exact value, for example with our graph we size that lag value 1 has a really high autocorrelation coefficient so we can see this by doing this in code.

df_for_use["Close"].corr(df_for_use["Close"].shift(1))
# 0.6932489805617724 (output)

We see that the autocorrelation coefficient much higher than our confidence zone. So now that I got that out of the way let us train our model, of course, we will do it on the small dataset.

from statsmodels.tsa.ar_model import AutoReg
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.api import AutoReg
from sklearn.metrics import mean_absolute_error, mean_squared_error

lag_order = 1
ar_model = AutoReg(y_train, lags =lag_order)
ar_results = ar_model.fit()

We import all the neccesary dependencies, and then we use the lag_order value which we found was one which had the highest autocorrelation coefficient. We then just plug in the y_train value (there is no x_train in timeseries) and the we give it a lag, this we will use one again because of what I explained earlier. Then we fit the model.

y_pred = ar_results.predict(start=len(train_data), end = len(train_data)+len(test_data) -1, dynamic = False)

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'Mean Absolute Error: {mae:.2f}')
print(f'Root Mean Squared Error: {rmse:.2f}')
MAE and RMSE for AR Model

We see that we just plug in the beginning of the test data since we trained our model on our training data (obviously), then we end with the end of our test data (obviously), the dynamic=False I am told means that we are using “out-of-sample forecasting”, I think just means this is just values that it hasn’t seen, out of sample meaning not in the previous sample I think, correct me if I am wrong. Finally we find our MAE and RMSE and its not good, but once we look at the graph you can see why, I just think our data is kind of hard to predict on.

It DID NOT go well, haha, I don’t really care, but it seems really really cool to me. So we can apply to the larger data, but I don’t think that it will do much better. So for now, we are done.

Summary

Summary, summary, summary, I don’t really have a summary, if you scrolled here, unfortunetly, you are going to need to read the entire thing. But I do have a few questions, specifically about the AR Model.

  • What is confidence?
  • Exactly how is this done, like down to the bone?
  • Does dataset size have anything to do with it?
  • What is lag order?

Thats all for this one, it took me a full two months, this article was partly for me to focus on myself, and focus on code, and wow did this take a long time but I am happy that it is done, so now I can just do another one. Haha. If I have made any errors, please let me know, and be mean if you need to. Goodbye for now, I will be back.

--

--