Predicting Stock Prices with Machine Learning
In the world of finance, predicting stock prices has always been a challenge that captures the imagination of investors, researchers, and data scientists alike. The ability to anticipate future price movements could potentially lead to significant gains, but it’s no secret that the stock market is notoriously unpredictable. In this blog post, we delve into a machine learning project aimed at predicting stock prices using historical data and the insights gained from the process.
The Project’s Purpose
This project’s main goal was to develop a predictive model that could forecast stock prices for a given future date. To achieve this, we turned to historical stock data available from Yahoo Finance. With this data, we embarked on a journey of data analysis, preprocessing, model selection, and evaluation.
Strategy for Solving the Problem
Our approach involved several key steps: acquiring data from a trustworthy source, identifying pertinent features for training, selecting the optimal model, and fine-tuning its parameters to achieve the highest accuracy score. This strategy was devised to effectively tackle the intricate task of predicting stock prices.
Description of Input data
As mentioned before we turned to Yahoo Finance for historical stock data, by using yfinance API to download the information related to stocks. The definition of each input data is presented int he table below.
Data Preprocessing
The data retrieved from Yahoo Finance already boasts a time-series format, with dates acting as the index. However, the challenge arises from the stock market’s operation solely on working days, leading to gaps in the dataset due to weekends and holidays. In the table below, we can see that the time series jumps from 2022–01–07 to 2022–01–10, skipping the weekend.
To address this, the .asfreq()
method was applied, converting the data into a regular frequency.
Additionally, missing data points were addressed through forward filling.
Furthermore, as part of the data preparation process to ensure compatibility with the model architecture, we undertook the task of reshaping the input data. To achieve this, we scaled the preprocessed data. This scaling operation is essential to bring the input features within a consistent and manageable range, facilitating effective training of the predictive model.
Snippet for data scaling: We utilized the MinMaxScaler to perform the scaling operation. To scale the training data, we used the fit_transform
method. As for the test data, we exclusively applied the transformation, avoiding the fitting process to prevent any data leakage.
Feature Selection and further data reshaping
A crucial decision in this project was selecting the most relevant features for modeling. Through experimentation, it was discovered that using only the adjusted closing price as a feature yielded the best prediction results. This choice was motivated by the fact that other features are price-related and remain unknown before the future date in question.
Please note that in the graph below, the variable ‘Volume’ has been excluded due to its magnitude, which could distort the visualisation of the other variables’ values.
Further visualisation to confirm that the other features are highly correlated with the target variable ‘Adj Close’. In regards to the ‘Volume’ feature, we decided to exclude the it from our machine learning model due to its weak correlation of -0.3 with other variables. Moreover, these features are being excluded because a substantial number of them only become known once the target variable is determined. This inherent unavailability renders them ineffective for making predictive inferences.
Creating Training Sequences: to enhance our model’s ability to learn patterns from the time series data, we employed a technique known as sequence creation. This approach involved organizing the data into sequences of a specified length, with each sequence representing a set of past time steps. By doing so, we aimed to provide the model with context and history, enabling it to capture temporal dependencies and make informed predictions.
The code snippet below demonstrates how we implemented this sequence creation process and then transformed the data into numpy arrays, preparing it for training:
Model Selection and Evaluation
Our journey led us to explore two powerful types of recurrent neural networks (RNNs): Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU). These RNNs are well-suited for sequential data like time series.
We employed a grid search to find the best hyperparameters for both LSTM and GRU models. The LSTM model emerged as the clear winner, offering superior prediction accuracy. To quantify our model’s performance, we used the Mean Absolute Percentage Error (MAPE), a metric that calculates the percentage difference between predicted and actual values.
Visualizing Model Performance: Comparing Predicted and True Values To assess the accuracy and effectiveness of our LSTM model, we employed visualizations to juxtapose the predicted values generated by the model with the actual true test values. This graphical representation provided a clear and intuitive means of understanding how well our model’s predictions aligned with the ground truth data.
Model Architecture — LSTM
The LSTM model chosen for this project is well-suited for sequential data like time series. LSTM networks contain memory cells that can retain information over long sequences, making them apt for capturing patterns in stock price movements.
We wanted to make our LSTM model work as accurately as possible. So, we tested it with different settings using GridSearch. This helped us find the best way to set up the model for predicting stock prices. After finding these best settings, we adjusted the model accordingly. Then, we used this well-tuned model to see how well it predicts stock prices using the chosen settings. This way, we made sure our model performs its best.
We then proceeded to implement the optimal parameters obtained through grid search on the final LSTM model shown below:
Similarly, we applied the identical process to the GRU model; however, the LSTM model performed slightly better, leading to a lower MAPE score. This small performance advantage led us to select the LSTM model as the final choice for our stock price prediction project.
Metrics and Justification
The success of the project hinges on the choice of evaluation metrics. The Mean Absolute Percentage Error (MAPE) was selected as the primary metric. MAPE calculates the percentage difference between predicted and actual values, providing an intuitive measure of prediction accuracy.
The Mean Absolute Percentage Error (MAPE) is an appropriate evaluation metric for a stock prediction model due to its suitability for regression models dealing with continuous outcome values, such as stock prices. MAPE offers several advantages that make it particularly relevant in this context. The choice of this metric is justified by its ability to provide a clear, interpretable, and scale-invariant measure of prediction accuracy.
Code snippet below for calculating MAPE from LSTM Model. It’s important to mention that prior to computing the MAPE, a crucial step involved reversing the transformation applied to the predicted output to restore it to its original price magnitude. Since we had earlier standardized the input data to ensure compatibility with the model, it’s essential to bear in mind the need to revert the scaling process now.
Likewise, we followed the same procedure for the GRU model. However, the results were not as promising as those achieved with the LSTM model. The LSTM model yielded a score of 0.5800, while the GRU model achieved a score of 0.58648. Both models performed quite similarly, with the LSTM model slightly ahead by a small margin.
Results and Key Insights
The culmination of the project lies in evaluating the model’s performance. Through rigorous testing and evaluation, the LSTM model achieved a MAPE of 0.58 — a promising indication of its predictive capabilities.
I discovered a particularly intriguing aspect during the project: the predictive model demonstrated improved performance when solely considering the target variable, the adjusted closing price, rather than incorporating additional features such as open, high, and low prices. This observation prompted me to consider whether the high level of correlation between these features and the target variable might be contributing to this phenomenon. It’s possible that the inclusion of closely related features could introduce multicollinearity or noise, potentially impacting the model’s predictive accuracy.
Conclusion
In the world of stock market prediction, success is measured by the ability to harness data-driven insights to make informed decisions. While the LSTM model showcased commendable performance, it’s crucial to acknowledge the intricacies of the stock market. Factors like economic indicators, geopolitical events, and market sentiment are just a few of the variables that influence stock prices, contributing to their inherent unpredictability.
Future Enhancements
As the project concludes, the path forward becomes clear. Enhancements could involve integrating a wider range of financial and economic indicators into the model. This holistic approach could result in more robust and accurate predictions, providing investors with greater confidence in their decision-making.
Acknowledgment
We extend our appreciation to Yahoo Finance for providing access to valuable historical data. Their contribution played a pivotal role in the project’s success.