StockProphet: Trying to predict the stock market

11 min readMar 29, 2023

Stock market prediction has long been a topic of great interest for investors and traders around the world. Everyone wants to know if they can predict what the market will do next, and if they can make profitable trades based on those predictions.

However, the stock market is not simply a collection of numbers and charts — it is the collective result of human intelligence and our reactions to it. From the decisions of individual investors to the policies of entire nations, the stock market reflects a vast array of social, economic, and political factors that can be difficult to predict.

In recent years, machine learning has emerged as a powerful tool for analyzing financial data and making predictions about future market trends. In this blog post, we’ll explore the process of building a machine learning model to predict stock prices using Python and Decision trees.

We’ll start by downloading historical stock price data from Yahoo Finance, and then use this data to train and test our machine learning model. Along the way, we’ll explore different techniques for data pre-processing, feature engineering, and model evaluation.

By the end of this blog post, you’ll have a better understanding of how machine learning can be used to predict stock prices, as well as a working example of a Python application for doing just that. So let’s dive in and see if we can become a StockProphet!

Be sure to check out this article’s GitHub repository to see the full code!

Give me a Star ⭐ if you liked it!

Setting up your environment

Before we dive into the project, let’s first make sure we have all the necessary tools and packages installed. Here are the steps you need to follow:

1. Install Anaconda or Miniconda: These are Python distributions that come with many pre-installed packages and make it easy to manage your environment. You can download Anaconda here or Miniconda here.

2. Create a new environment: Open your terminal or Anaconda prompt and create a new environment by running the following command:

conda create --name stockprophet python=3.10.6

This will create a new environment called stockprophet with Python version 3.10.6

3. Activate the environment: Once the environment is created, activate it by running:

conda activate stockprophet

4. Install the required packages: Finally, install the packages we will be using in this project by running:

conda install numpy scikit-learn pandas matplotlib seaborn yfinance

Exploring the Data

Now that we have set up our environment, let’s start exploring the data we will be working with. In this example, we will be using the yfinance library to download stock price data from Yahoo Finance. Here's an example of how to download and visualize the data for Apple (AAPL) for the last week of trading, in 5-minute intervals:

import yfinance as yf
import matplotlib.pyplot as plt

# Download the data for the last week, in 5-minute intervals
data = yf.download("AAPL", period="1wk", interval="5m")

# Plot the closing price
data['Close'].plot()
plt.title('AAPL Stock Price (last week, 5-minute intervals)')
plt.xlabel('Date')
plt.ylabel('Price ($)')
plt.show()

This will produce a plot of the closing price for Apple stock over the last week, in 5-minute intervals.

Prediction

Predicting stock prices is a challenging task that has attracted the attention of many researchers and investors. There are many different approaches to predicting stock prices, but one common technique is to use historical data to identify trends and patterns that can be used to make predictions about future prices.

One important concept in stock price prediction is the concept of “lag”. In simple terms, a lag is a time delay between two related phenomena. In the context of stock price prediction, a lag refers to the relationship between the current price of a stock and its past prices.

Lags are important because they can help us identify patterns in the data that may be useful for predicting future prices. For example, if we observe that a stock’s price tends to increase after a period of low prices, we might use this information to predict that the stock’s price will continue to rise in the future.

To incorporate lags into a stock price prediction model, we typically create lagged features that capture the relationship between the current price and past prices. These features might include the difference between the current price and the price from a specified number of periods ago, or the percentage change in price over a specified number of periods.

Once we have created lagged features, we can use them as inputs to a machine learning model that is trained to predict future prices based on past prices and other relevant information.

Preprocessing the Data

Before we can use the data for training a model, we need to preprocess it to extract relevant features and split it into training and testing sets. We’ll be using the Yahoo Finance API to obtain stock data for Apple Inc. (AAPL) over the last week, with 5-minute intervals.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer

# Load the data
df = yf.download("AAPL", period="1wk", interval="5m")

# Preprocess the data
df['Close_diff'] = df['Close'].diff()
df = df.dropna()

# Number of lags to use. lag 
n_lags = 10

# Create lagged features
for i in range(1, n_lags+1):
    df.loc[:, f'lag_{i}'] = df.loc[:, 'Close_diff'].shift(i)


# Split the data into training and testing sets
X = df.drop(['Close', 'Close_diff'], axis=1)
y = np.where(df['Close_diff'] > 0, 1, 0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

# Fill in missing values with the mean
imp = SimpleImputer(strategy='mean')
X_train = imp.fit_transform(X_train)
X_test = imp.transform(X_test)


# Scale the data
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

This code loads the data for AAPL stock for the last week in 5-minute intervals and preprocesses it by calculating the daily difference in closing prices and dropping any rows with missing data. It then creates lagged features by shifting the daily differences by a specified number of lags. It splits the data into training and testing sets, with the target variable being whether the daily difference is positive or negative. Finally, it scales the data using a MinMaxScalerto bring all the features to the same range.

Note that we added two lines of code before scaling the data to fill in the missing values in X_train and X_test with the mean value using SimpleImputer. We then transformed the data with MinMaxScaler and trained the random forest classifier as before.

The output would be something like this:

              precision    recall  f1-score   support

           0       0.49      0.51      0.50        37
           1       0.54      0.51      0.53        41

    accuracy                           0.51        78
   macro avg       0.51      0.51      0.51        78
weighted avg       0.51      0.51      0.51        78

The output is showing the performance metrics of the random forest classifier on the test set.

precision measures the percentage of correct positive predictions (true positives) out of all positive predictions.
recall measures the percentage of true positive predictions out of all actual positives.
f1-score is the harmonic mean of precision and recall, which provides a balance between the two metrics.
support is the number of samples in each class.

If you want to more about these parameters, check out this article by @Teemu Kanstrén :
A Look at Precision, Recall, and F1-Score | by Teemu Kanstrén | Towards Data Science

The macro avg and weighted avg are the weighted average of precision, recall and f1-score calculated for each class. accuracy is the percentage of correctly classified samples out of all samples.

In this case, the overall accuracy of the classifier is 0.51, meaning that it classified 51% of the test samples correctly. The precision, recall, and f1-score are all similar between the two classes, indicating that the classifier is not strongly biased towards one class. However, the low accuracy suggests that the model is not performing very well, and it may need further optimization or feature engineering to improve its performance.

There are several ways to optimize and improve the performance of a random forest classifier:

Feature selection: Not all features may be equally important for classification. You can use feature selection techniques such as correlation matrix, recursive feature elimination, or principal component analysis (PCA) to select the most important features.
Hyperparameter tuning: Random forest classifiers have several hyperparameters such as the number of trees, the number of features to consider at each split, the maximum depth of the trees, and the minimum number of samples required to split an internal node. You can use grid search or random search techniques to find the best combination of hyperparameters that maximizes the performance of the classifier on a validation set.
Ensemble methods: You can combine multiple random forest classifiers or other classifiers such as support vector machines (SVMs) or neural networks to improve the overall performance. This can be done using techniques such as bagging, boosting, or stacking.
Imbalanced data: If the classes are imbalanced, that is, if one class has much fewer samples than the other class, you can use techniques such as oversampling or undersampling to balance the classes.
Data preprocessing: You can preprocess the data by scaling, normalizing, or standardizing the features, or by applying other transformations such as PCA or Fourier transforms. This can help the classifier to better capture the underlying patterns in the data.

You can try these methods and see what works best for you!

However, since we have access to lots of data, we can try and add more samples for our model to learn from.

To increase the sample size for our model, we can adjust the time period and frequency of the data we’re downloading. For example, changing the period parameter in the yf.download() function to a longer time period such as “1y” instead of “1wk” would increase the number of samples. Additionally, changing the interval parameter to a shorter time interval such as “4h” instead of “1d” would increase the number of samples. However, this would also increase the amount of time it takes to download the data.

df = yf.download("AAPL", period="1y", interval="1d")

which will give us these results:

              precision    recall  f1-score   support

           0       0.53      0.89      0.67        19
           1       0.89      0.52      0.65        31

    accuracy                           0.66        50
   macro avg       0.71      0.71      0.66        50
weighted avg       0.75      0.66      0.66        50

It looks like increasing the sample size has improved the model’s performance slightly. The overall accuracy has increased from 51% to 66%, and the precision and recall scores have also improved. However, the F1-score remains relatively unchanged. It’s important to keep in mind that increasing the sample size may not always improve the performance, and there may be other factors at play that affect the model’s accuracy.

Visualizing the results

One way to visualize the results of a binary classification model is by creating a confusion matrix and a ROC curve.

To create a confusion matrix, you can use the confusion_matrix function from the sklearn.metrics module. This will give you a matrix that shows the number of true positives, false positives, true negatives, and false negatives.

To create a ROC curve, you can use the roc_curve function from the sklearn.metrics module. This will give you the false positive rate (FPR) and true positive rate (TPR) for different threshold values. You can then plot the FPR against the TPR to create the ROC curve:

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve

# Predict on the test set
y_pred = rf.predict(X_test_scaled)

# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

# Create a ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

[[17  2]
 [15 16]]

The confusion matrix shows the number of true positives (top left), false positives (top right), false negatives (bottom left), and true negatives (bottom right) that your model made.

This means that your model correctly predicted 17 true positives (actual positives that were predicted as positive) and 16 true negatives (actual negatives that were predicted as negative). However, it also made 2 false positives (actual negatives that were predicted as positive) and 15 false negatives (actual positives that were predicted as negative).

But when to buy/sell?!

To determine when to buy or sell using this model, you can use the predicted class labels from the model as a signal. If the model predicts a class label of 1, it indicates that the model believes the stock price will go up, and you may want to consider buying. Conversely, if the model predicts a class label of 0, it indicates that the model believes the stock price will go down, and you may want to consider selling.

To test the model on a new data point, you need to create a feature vector for the data point and then pass it to the model’s predict method. Here's an example:

Let’s say you want to test the model on the following data point:

{
 "Open": 133.5,
 "High": 135.2,
 "Low": 132.5,
 "Close": 134.2,
 "Adj Close": 134.2,
 "Volume": 89347166
}

You can create a feature vector for this data point as follows:

# Test the model on a realworld data
import numpy as np

# Define the feature names
feature_names = ["Open", "High", "Low", "Close", "Adj Close", "Volume"]

# So, we need to add 9 more features to the feature vector, using the same values as the last feature in the feature vector.

# Define the feature vector as a numpy array
new_data = np.array([133.5, 135.2, 182.5, 134.2, 134.2, 89347166, 0, 0, 0, 0, 0, 0, 0, 0, 0])


# Reshape the feature vector
new_data = new_data.reshape(1, -1) 

# Scale the feature vector
new_data_scaled = scaler.transform(new_data)

# Make a prediction using the trained model
prediction = rf.predict(new_data_scaled)

print(f"Prediction: {prediction}")

The prediction variable will contain the predicted label for the new data point. If the predicted label is 1, it means the model predicts a buy signal, and if the predicted label is 0, it means the model predicts a sell signal.

Note that is we run the code without the added zeros, we will get an error:

    360         f"X has {n_features} features, but {self.__class__.__name__} "
    361         f"is expecting {self.n_features_in_} features as input."
    362     )

ValueError: X has 6 features, but MinMaxScaler is expecting 15 features as input.

The error message suggests that the MinMaxScaler is expecting 15 features as input, but the new_data variable only has 6 features. This is because the MinMaxScaler was trained on a dataset that had 15 features.

Conclusion

We have used a Random Forest classifier to predict whether to buy or sell a stock based on historical data. We have evaluated the model’s performance on a test dataset and achieved an accuracy of 66%. While this accuracy is not high enough to make trading decisions solely based on the model’s predictions, it is still a promising result. We have also discussed the limitations of this model and the importance of considering additional factors such as market trends, news, and company fundamentals when making investment decisions. Overall, this project provides a good starting point for those interested in using machine learning to predict stock prices and serves as a reminder that no single model can capture the complexity of the stock market.

In the future, we plan to explore other machine learning models and techniques to predict stock prices. One such technique is using Long Short-Term Memory (LSTM) networks, a type of neural network that is particularly well-suited for sequential data like time series. LSTM networks have shown promising results in predicting stock prices and we hope to implement them in a future iteration of this project.

Link to the GitHub repo:

Pouyaexe/StockProphet: A machine learning-based Python app that predicts stock prices using PyTorch and other Python libraries. This app uses historical stock data to train a neural network model and make future predictions with high accuracy (github.com)