Building a Stock Price Predictor using Python

Published in

CodeX

10 min readAug 18, 2021

1. Project Overview

Financial institutions around the world are trading in billions of dollars on a daily basis. Investment firms, hedge funds and even individuals have been using financial models to better understand market behavior and make profitable investments and trades. A wealth of information is available in the form of historical stock prices and company performance data, suitable for machine learning algorithms to process.

2. Problem Statement

In this project, I will be using yahoo finance data to build a stock price predictor that takes daily trading data over a certain date range as input, and outputs projected estimates for given query dates. Note that the inputs will contain multiple metrics, such as opening price (Open), highest price the stock traded at (High), how many stocks were traded (Volume) and closing price adjusted for stock splits and dividends (Adjusted Close).

Some questions I would like to answer from this project are the following:

Can one really predict the price of stocks? Or is the price of stocks dependent on some factors e.g. economic factors.
Are there good models in practice that one can use to predict stock prices?

In this project, I would be trying out different models and comparing their performance. However, I would only be predicting the Adjusted Closing price of stocks here.

3. Metrics

I will be using the mean absolute error to compare three different models and their performance. The formula is given as follows:

The model with the lowest mean absolute error is said to perform better than others with higher values. The goal is to select the model with the least error.

4. Data Exploratory Analysis and Visualization

Before I begin the modelling process, I would be taking some time to view the data, compute some statistics and plots to better understand the data. This is what the first 5 rows of the data looks like for the Apple, Amazon, Ford, Google, Johnson & Johnson, Pfizer and S&P 500 stocks which I would be using for my analysis.

From the above, we can see that Google has some missing values, and I would be using Python’s fillna method to handle it. More details can be found in my Jupyter notebook. We would need to normalize the data, but before that, let us have a look at some statistics for adjusted closing price.

Fig. 2 Adjusted Closing Price Statistics

Normalizing the Data

We want to know how the different types of stocks went up and down with respect to the others. In order to do this, we will normalize the data. We do this by dividing the values of each column by day one to ensure that each stock starts with $1.

From the above cumulative return plot, we can see that Apple has the highest return over the years, while Amazon was second and Google third and Microsoft fourth. The growth of Google and Microsoft looks much more stable than Apple and Amazon. Taking a closer look at the plot, we can see that Apple has a lot of volatility and risky stocks especially in recent years.

Cumulative Returns

I will compute cumulative returns to see how the pandemic affected stock prices for these companies.

From the above plots, let’s take note of the following:

2019: Before the pandemic, we notice that most of the companies stocks were doing relatively well with Apple and Microsoft taking the lead and Pfizer trailing behind.
2020: On the onset of the pandemic around spring, we notice that there was a fall in stock prices for all the companies, but afterwards the technology companies like Amazon, Apple, Microsoft and Google started to grow again. But companies like Pfizer, Ford and S&P 500 did not do very well particularly Ford.
2021: As the vaccine rollout began and the lockdown began to be lifted, we can see significant growth in the stock prices of Ford in particular given that its stock prices were low in 2020 due to the pandemic. Companies like Google and Microsoft, S&P 500 also grew. In general, there was an improvement in stock prices of all the companies we considered.

Rolling mean and Bollinger Bands

The rolling mean may give us some idea about the true underlying prices of a stock. If there is a significant deviation below or above the rolling mean, it may give us an idea about a potential buying and selling opportunity. Bollinger Bands is a statistical chart that contains the volatility of a financial instrument over time. Bollinger observed that looking at the recent volatility of the stock, if it is very volatile, we might want to discard the movement above and below the mean. But if it is not very volatile we may want to pay attention to it.

From the above plots, we can see that the initial values for the rolling mean are missing. This is as a result of the 20 days window period I used at the beginning which had no values. We can also observe that the rolling mean follows the movement of the raw stock prices and it is less spiky. We can also see that Ford has lower stock prices than Microsoft in 2020 as expected.

Daily Returns

Daily returns tells us how much the stock price go up and down on a particular day. We can compute it using the following formula:

where price(t) is the price of today’s stock and price(t-1) is the price of yesterday’s stock.

From the above plots, we can see that the volatility range for Ford is higher than Microsoft. This could be as a result of technology companies like Microsoft bouncing back faster during the pandemic.

5. Modelling Methodology and Results

In this section I will be trying out some models to predict the Adjusted closing price of a stock. Before starting modelling I used the python fillna method to handle missing data. More details can be seen in my Jupyter notebook.

Prediction using Long Short-Term Memory (LSTM):

LSTM is an artificial recurrent neural network (RNN) architecture used in deep learning that is capable of learning long-term dependencies. It processes data by passing on information as it propagates forward and have a chain like structure. I used Adam optimizer for my model and the mean squared error for my loss function. Below is my LSTM model summary.

For my initial model, I used a batch size of 1 and 5 epochs, which gave me a mean absolute error of 0.0942. This isn’t so bad, but there is room for improving my model by tuning the parameters to hopefully get better predictions.

Refinement

I will now try to tune a couple of my model parameters to see how my model performs. Below is a table of the different parameter I tuned for Microsoft stocks and their corresponding mean absolute error.

From the above table, we can see that as the batch size and number of epoch increased the model performed better (i.e. a lower mean absolute error). Also including an activation function (Relu) did not improve the model performance.

The results from my final (5th trial in the refinement table above) LSTM prediction has a batch size of 800 and number of epochs of 50 is given below:

From the above, we can see that the predicted and actual adjustable stock prices plots looks are relatively similar with little variation, but with a mean absolute error of 0.0591 which isn’t too bad. We can also conclude that spending more time tuning the parameter does improve the model as shown in the above table. However, there is still room for improvement and trying out other models to compare.

Prediction using Linear Regression

Linear Regression attempts to model the relationship between a response and one or more explanatory variables by fitting a linear equation to the observed data. The results from my Linear regression prediction is given below:

From the above, we can see that the predicted and actual adjustable stock prices plots have variations but with a mean absolute error of 0.215 which a bit worst than the LSTM model. However, there is still room for improvement and trying out other models to compare. Let’s try one more model and see how it performs.

Prediction using Random Forest Regression

Random Forest Regression is a supervised learning algorithm that uses ensemble learning methods for regression. A Random Forest operates by constructing a multitude of decision trees during training time and outputting the average prediction of the individual tress for regression tasks. For classification tasks, it outputs the class selected by most trees.

Below is a table of actual and predicted values of Adjusted closing stock price for Microsoft using a Random Forest Regressor.

From the table, we can see that the Random Forest Regressor performed very well and the actual and predicted Adjusted close value are fairly close. Let us now view the plots.

Fig. 13 Actual and Predicted Values of Microsoft Stocks

From the above, we can see that the predicted and actual adjustable stock prices plots are relatively similar with a mean absolute error is 0.0497 which is good. Let us see how it would perform with the Google stocks.

Fig. 14 Actual and Predicted Values of Google Stocks

From the above, we can see that the predicted and actual adjustable stock prices plots are very similar with a mean absolute error of 0.000824 which is very good. Given that the plots overlap, I plotted it separately so we can see its similarity clearly.

Model Evaluation and Results

From my investigation of three different models, I observed that Random Forest Regressor delivered a much lower mean absolute error than the LSTM or Linear Regression for both Microsoft and Google stocks (see Fig. 15 below). I also observed that taking time to tune the parameters for the LSTM model (e.g. the number of epochs and batch size) resulted in a better prediction.

Justification

From my analysis, we can see that one can actually predict the price of stocks and that economic factors do have some effect on the prices of stock. Secondly, there are several models that deliver good results in practice that one can use to predict stock prices.

The Random forest regressor, an ensemble method which combines multiple machine learning algorithms together is a good fit to use as it makes more accurate predictions than any individual model as shown in my analysis section above.

6. Conclusion

In conclusion the Random Forest Regressor delivered a much lower mean absolute error than the LSTM or Linear Regression for both Microsoft and Google stocks. I also observed that tuning the parameters for LSTM (e.g. the number of epochs and batch size) resulted in better prediction but this could take some time.

When exploring the data, it was interesting to see how the stock prices of different companies changed due to the pandemic and how the technological companies stock prices bounced back more quickly than the other companies considered. It was also interesting to see how Pfizer stocks improved as the vaccine rollout began.

Potential Improvements

Some potential improvement to my work could be the following:

Take some significant time to tune the model parameters as well as include more features that might be relevant for stock price prediction.
Try out more models and see if there might be one with a better performance than Random Forest Regression. I only tried three models for simplicity and time constraints.
Explore other companies stocks to see how well one can predict their stocks prices with different models.

For a more detailed analysis including the code, check out my GitHub page.