Getting Started with Machine Learning in Python: Stock Predictions

Joseph Arch
14 min readSep 6, 2022

--

Learn how to train a model to predict the performance of individual stocks using Python machine learning.

About the Authors

Joe Arch is an intern software engineer at NCR Corporation. He is a rising freshman at Duke University planning on majoring in Math and Computer Science.

Cameron Kaplinger is an intern software engineer at NCR Corporation. He is a rising freshman at the University of North Carolina at Chapel Hill planning on majoring in Statistics and Analytics.

Background

Machine learning is quickly becoming one of the most prominent fields within computer science thanks to its wide range of interesting applications, from self-driving cars to image recognition. We wanted to get some hands-on experience with machine learning with a beginner-friendly project that would only require a basic background in Python and decided on creating a predictive model for the stock market. (Disclaimer: this article is about a fun project to get familiar with machine learning, not investment strategies. Be careful with your investment decisions.)

Decision Trees and Random Forests

Before diving into the code, let’s begin with an overview of the machine learning model being used. A decision tree is a way to conceptualize an algorithm as a step-by-step process of dividing inputs into different branches based on their characteristics until they’re eventually categorized. We call each individual set of information about a day a sample.

A common approach to machine learning is taking a set of samples previously assigned to certain categories and generating a decision tree that sorts new samples into the same categories based on their characteristics. This strategy is known as supervised machine learning since it involves feeding a model a set of training data with previously classified sample data and therefore involves some level of “supervision”. Because we have access to historical stock performance data, it is easy to classify samples as either a day where the stock increased or a day where the stock decreased. We can then use a number of statistics, such as the open, close, high, and low, of a stock on each trading day to start fitting a model.

It doesn’t stop at just one decision tree though. We can actually train a number of decision trees on the same data, each of them being randomly generated to fit slightly differently in order to improve the overall consistency of the model. Even if one decision tree happens to form so that it poorly predicts certain samples, in the aggregate the collection of trees should be able to reliably categorize samples into the same categories it was trained on. Because this model is composed of a large set of randomly generated decision trees, it is referred to as a random forest.

Packages

This project has a number of dependencies you might need to install if you do not already have them.

#make sure to install any packages you don’t have through pipimport pandas as pd
import pandas_datareader
#installed through pip as pandas-datareader
import datetime as dt
from sklearn.ensemble import RandomForestClassifier
#installed through pip as scikit-learn
import matplotlib.pyplot as plt
import finnhub
#installed through pip as finnhub-python
import time
from textblob import TextBlob

Processing Sample Data

To train a model first we need to collect some historical data using Pandas data-reader. Start by identifying the ticker symbol for the stock you want to train the model on, for today we’ll use Apple, as well as the start and end dates of the data you want to access.

ticker = 'AAPL'
start_date = dt.datetime(2020,1,1)
end_date = dt.datetime(2022,7,30)
stock_info = pandas_datareader.DataReader(ticker, 'yahoo', start_date, end_date)

Pandas data-reader will return a DataFrame containing information about the stock history such as the daily open and close price as well as other useful statistics. We will also need to create a column of target values to designate what information it should be trying to predict for each row of data. Since we’re interested in predicting whether or not a stock will rise or fall the next day, the target column is calculated by rolling over pairs of two rows and comparing their closing prices. If the closing price of the second day is higher than that of the first, the target is recorded as a 1 to indicate a price increase, and if it’s lower than or equal to the first the target is recorded as a 0 to indicate a price decrease.

data = stock_info[['Close']]
data = data.rename(columns={'Close': 'True_Close'})
data['Target'] = stock_info.rolling(2).apply(lambda x: x.iloc[1] > x.iloc[0])['Close']

We need to be careful to only train the model on the data we would have available to make predictions in the present. As the table is formatted now, each row has statistics about a given day, as well as a target value that indicates if the stock value went up that very same day. This raises an issue, since we want our model to make predictions about the behavior of the stock tomorrow based only on the information we have today. We fix this issue in the code by simply shifting the prediction data up by one row, so that for each date the target value is whether or not that day was an increase and the prediction data is actually the stock information from the day before. Pandas data-reader returns many columns of information that could be used in such a model, but there are also excess columns that we can do without, so we only add the predictors we’re truly interested in into the training data.

predictors = ['Close', 'High', 'Low', 'Open', 'Volume']
stock_prev = stock_info.copy()
stock_prev = stock_prev.shift(1)
cut_off = 1
data = data.join(stock_prev[predictors]).iloc[cut_off:]
#the cut off serves to remove the row where the prediction data
#no longer exists since the model cannot be trained on those dates

By the end of it all, our data should look something like this.

Training the Model

Now that we have a table of properly formatted data, we can plug it into Scikit-learn’s random forest classifier to train a model. While instantiating the model we must provide a few parameters, such as the number of decision trees we want to include in the forest, the minimum number of samples we require to accumulate in a given leaf node before we allow further splits, and the random state of the model which we can set to a constant value to ensure that if we run the same input through training it will always train the same model.

estimators = 500 # number of decision trees
samples_split = 3 # minimum number of samples required before split
model = RandomForestClassifier(n_estimators=estimators, min_samples_split=samples_split, random_state=1)

The next step is to split our data into a training portion and a prediction portion. If you were interested in predicting the next day you could simply leave it as 1, but since we’d like to see how our model performs over time we’ll set aside the last 75 days of the stock information to serve as testing data.

pred_days = 75
train = data.iloc[:-pred_days]
test = data.iloc[-pred_days:]

All that’s left to do is fit the model to the data, which is done with one simple line

model.fit(train[predictors], train['Target'])

Now that the model is trained we can use it to make predictions on the testing data using the predict_proba() method, which returns a list of the proportion of decision trees that predicted an increase for each testing day. For example, if 237 of the 500 decision trees predicted a decrease in the stock price on a given day, the corresponding value in the list would be 0.474, indicating that 47.4% of the trees had a positive prediction. From there we can convert the list into a series and interpret the proportions into actual predictions. For now, we’ll side with the majority of the trees, so if 50% or more of them predict an increase the model will predict an increase.

preds = (model.predict_proba(test[predictors]))[:, 1]
preds = pd.Series(preds, index=test.index)
target_precision = 0.5
preds[preds >= target_precision] = 1
preds[preds < target_precision] = 0

With the predictions made, it’s time to compare them to the actual stock performance by creating a combined DataFrame which we can then plot with Matplotlib.

combined = pd.concat({'Target': test['Target'], 'Predictions': preds}, axis=1)
plt.plot(combined['Predictions'])
plt.plot(combined['Target'])
plt.show()

The orange line represents the actual performance of the stock, with the value of 1 showing a day the stock increased and a value of 0 showing where it decreased. The blue line represents the model’s day-to-day predictions of the stock. Whenever the two lines are in sync that means the model accurately predicted the performance of the stock for that day. Ideally if up to this point you’ve followed along exactly with our code, your chart will look exactly the same, but it’s always possible that changes made to the packages being used in the project since the publication of this article will result in your results looking slightly different than the ones shown here. If any of your charts or numbers are different, don’t worry, just make sure the code is functioning as intended.

While the chart is interesting to look at, it can be a little difficult to gauge how successful the model was from the visual alone. It would be helpful to calculate a few quantifiable statistics to measure our model’s success. There are four possible outcomes for each individual day:

  • True positive: when the model predicted an increase and the stock actually increased
  • False positive: when the model predicted an increase but the stock actually decreased
  • True negative: when the model predicted a decrease and the stock actually decreased
  • False negative: when the model predicted a decrease but the stock actually increased

We can calculate the count of each outcome quite easily.

true_p = 0
false_p = 0
true_n = 0
false_n = 0
for i in combined.index:
if combined.loc[i, 'Predictions'] == 1 and combined.loc[i, 'Target'] == 1:
true_p += 1
elif combined.loc[i, 'Predictions'] == 1 and combined.loc[i, 'Target'] == 0:
false_p += 1
elif combined.loc[i, 'Predictions'] == 0 and combined.loc[i, 'Target'] == 0:
true_n += 1
elif combined.loc[i, 'Predictions'] == 0 and combined.loc[i, 'Target'] == 1:
false_n += 1

The first and most obvious choice would be the accuracy of the model, which is simply the proportion of days that the model’s prediction matched the actual stock performance to the total number of predictions.

accuracy = (true_p + true_n) / (true_p + true_n + false_p + false_n)

When choosing what metrics to analyze, we should consider the practical applications of the model and which type of error is more costly to make. Although of course you should be very careful with your money, you could imagine using this type of model to inform buying decisions when investing. If the model predicts a false negative we do nothing, since of course we would not buy a stock we expect to lose value. While we would miss out on the potential profit of having bought the stock, we aren’t losing any money. On the other hand, a false positive would result in us buying a stock because we believed it was rising in value, but then when the stock actually falls we’ve lost our investment.

Since false positives are more costly than false negatives in this context, we want to maximize another metric called precision, which is the ratio of true positives to the total number of predicted positives.

precision = true_p / (true_p + false_p)

With the input we gave the model so far, it has an accuracy of 0.547 and a precision of 0.581, not bad for a start. (Again, don’t worry too much if your numbers vary from ours.)

Adjusting Parameters of the Model

If we’re interested in maximizing the precision of the model we can make a few simple changes to the way we interpret the results of the random forest to minimize the number of false positives we see. Earlier we chose to side with the majority of decision trees in the forest, but if we’re doing our best to maximize precision, we might want to raise the bar on how sure we want the model to be of its positive predictions before we would actually buy a stock. We only need to change the value of the target_precision variable from earlier.

target_precision = 0.7
preds[preds >= target_precision] = 1
preds[preds < target_precision] = 0

Now when we run our predictions, we only count days where 70% or more of the decision trees in the forest predicted an increase as a positive prediction. The result is a more cautious model that predicts positives less often, and usually only when the signs are strongly in favor of an increase.

Accuracy: 0.520

Precision: 0.647

As you can see, accuracy took a minor hit, but precision jumped up by 0.066 points. This is because the model began returning more false negatives, lowering accuracy, but far fewer false positives, raising precision.

You might wonder why we don’t try to raise the target_precision even higher and see if we can push precision even higher, let’s try something like this and see what happens.

target_precision = 0.9
preds[preds >= target_precision] = 1
preds[preds < target_precision] = 0

Accuracy: 0.467

Precision: 1.000

Looking at the precision alone we might be tempted to think the model has achieved some kind of miracle, but the graph reveals that this perfect precision is something of a fluke. In the entire 75 days, the model only predicted one increase. Run it on a different set of days with such a high target precision, and that number will likely drop to zero. While we wanted the model to be more cautious with its predictions, this extreme standard makes it difficult to ever use in any practical sense. We’ve found that keeping target precision anywhere between 0.6 and 0.8 to be mostly effective in striking the balance between high precision while still making a reasonable number of positive predictions, though results may vary.

Other parameters of the model can also be tweaked to impact performance. Adding more estimators generally increases the accuracy and precision of the model, up to a certain point, since beyond 1000 decision trees we only saw marginal improvements to the effectiveness of the model. Reducing the minimum sample split can also allow the decision trees to grow faster and possibly become more accurate, but this can also lead to a phenomenon known as overfitting, where the model becomes so closely trained on the limited data it is trained on that it is ineffective at predicting trends in other data. For this reason, sometimes raising the minimum sample split is actually more effective in increasing the effectiveness of the model, especially when it’s being trained on larger data sets that go back further in the stock’s history.

Adding Additional Predictors

The next step would be to add additional predictors into the model, giving it more information to train on and predict with. Anything could be a predictor, as long as you can find data day by day to include in the data table. If you wanted to try to train the model using the global penguin population each day then you could, but of course, it makes much more sense to use information that actually correlates to the performance of the stock.

For example, you could easily calculate rolling averages of the stock price to give the model a wider picture of the stock’s performance history with just a few lines of code placed before the training of the model.

predictors = ['Close', 'High', 'Low', 'Open', 'Volume', 'Weekly Average', 'Quarterly Average']
#make sure to add any new predictors into the predictors variable

stock_info['Weekly Average'] = stock_info['Close'].rolling(7).mean()
stock_info['Quarterly Average'] = stock_info['Close'].rolling(91).mean()

cut_off = 91 #updated to account for the fact the first quarter of dates will have a value of None for Quarterly Average

When it comes to what predictors can be added to predict a stock’s performance, the sky’s the limit. One exciting possibility is using sentiment analysis to gauge the public perception of a stock and then quantify it so the model can use it in its training and predictions. There are many ways to go about adding sentiment analysis into a project like this one, from using Twitter’s API to gather Tweets to web scraping comments from the Yahoo Finance comment sections. The important thing is that every piece of sentiment data (be it Tweet, comment, or whatever else) is sorted by the date of its publication and we have access to enough historical data to train the model on. Twitter’s API (the free version) is unfortunately limited to only accessing Tweets made in the last seven days and web scraping can be finicky and difficult for beginners, so we’ll be using the Finnhub Stock API for the sake of simplicity, which will allow us to easily access a whole year’s worth of headlines relating to individual stocks going day by day.

The first step is to get your own personal API key from https://finnhub.io/ by making a free account.

As long as you aren’t planning on sharing your code with anyone you can store it in a variable in the code, but if you are planning on uploading your project anywhere on the internet it’s best to look into using environment variables to avoid accidentally sharing your key with others. Using our API access and the finnhub python package, we’ll write a function directly under our import statements that gathers headlines about a particular stock and returns them in a DataFrame.

FINNHUB_TOKEN = 'your api key here'
#only hardcode your token into your code like this if you’re sure it won’t be shared with others

def finnhub_data(ticker, start, end, delta):
client = finnhub.Client(api_key=FINNHUB_TOKEN)

delta = dt.timedelta(days=delta)

df = pd.DataFrame(columns=['date', 'headline', 'summary'])
calls = 0

while start <= end:
news = client.company_news(ticker, _from=start.strftime('%Y-%m-%d'), to=(start + delta - dt.timedelta(days=1)).strftime('%Y-%m-%d'))
calls += 1
for item in news:
sub_dict = {'date': [dt.date.fromtimestamp(item['datetime'])], 'headline': [item['headline']], 'summary': [item['summary']]}
row = pd.DataFrame.from_dict(sub_dict)
df = pd.concat([df, row], ignore_index=True)
start += delta
if calls % 10 == 0:
print(str(calls) + 'api calls')
time.sleep(10)

return df

You might have noticed that the function keeps track of the number of times the API is being called and will sleep for 10 seconds after 10 calls have been made. This is because the free version of the API designed for personal use is limited to 60 calls per minute. Each individual call is also limited in the number of articles it can return, so it’s best to increment over the range of dates in small chunks, making individual calls to gather headlines for each chunk. The delta parameter dictates the size of these chunks, and we’ve found that normally 3 days is normally sufficient to gather enough headlines for each individual day without taking too long.

We can use the package TextBlob to run some basic sentiment analysis on our headlines by writing a simple function to gather the average sentiment of multiple headlines for a given day.

def avg_sentiment(strings):
total = 0
for string in strings:
blob = TextBlob(string)
total += blob.sentiment.polarity
if len(strings) != 0:
return total / len(strings)
else:
return 0

Bringing together these two functions, we can add daily sentiment values to our set of predictions right before calculating the rolling closing averages.

predictors = ['Close', 'High', 'Low', 'Open', 'Volume', 'Weekly Average', 'Quarterly Average', 'Sentiment']
#make sure to add any new predictors into the predictors variable

stock_info['Sentiment'] = ''
news = finnhub_data(ticker, start_date, end_date, 3)
for date in stock_info.index:
daily = news.copy().loc[news['date'] == date.date()]
if daily.empty:
stock_info.loc[date, 'Sentiment'] = 0.1
else:
stock_info.loc[date, 'Sentiment'] = avg_sentiment(daily['headline'].tolist())

As implemented in this project, the sentiment analysis can only be used for a select few companies that have news published around them daily. Companies such as Amazon, Apple, or Alphabet are good examples, but there are many large companies that are not as commonly written about in the news. Trying to include sentiment analysis as a predictor on companies that have insufficient headlines in the Finnhub database will be detrimental to the performance of the model, since the sentiment value for days without headlines will be recorded as a 0, which might be an inaccurate reflection of the actual sentiment around the stock. The free version will only return headlines published within the last year, so make sure to adjust the start date to account for this.

Going Beyond

There are many directions this project could be taken once the basic framework for training the model is established. You could decide to continue processing additional information to use as predictors when training the model, or perhaps compare the effectiveness of other machine learning models when using the same training data. Whatever ideas come to mind, the same underlying principles used in machine learning in this article can be leveraged to accomplish even greater things.

--

--