Stock Buy-Sell-Hold Prediction using CNN

Aidan Thompson
15 min readApr 29, 2024

What are CNNs?

Convolutional Neural Networks (CNNs) are a class of deep learning (DL) models specifically designed for processing and understanding visual data, like images and videos. They are inspired by how our brains, which are composed of many layers of neurons, process information and recognize patterns in the visual world.

At the heart of CNNs is an operation called convolution. So, what is convolution, especially in the realm of image-processing?

What are Convolutions?

Let us take a simple example of convolving two 1D arrays. Suppose the arrays are a = [1, 2, 3, 4, 5] and b = [0, 1, -1, 2, -2]. Then, the convolution of a and b, denoted by (a * b), is another array whose ith element (indexing starts from 0) is given by:

, which turns out to be [0, 1, 1, 3, 3, 3, -3, 2, -10] in this case.

In image processing too, convolution works similarly, albeit with a 2D matrix. The original image is convolved with a smaller, 2D matrix which is called kernel, to produce a new image with the desired modifications.

Every colour unit of an image is called a pixel. Each pixel can be viewed as a vector containing 3 values, for red, green and blue respectively, with each value varying from 0 to 255.

When we convolve an image with a 2D (ideally n x n square) kernel, we go over each pixel, and in each round we map the current pixel as the central element of the kernel, and its immediate neighbours to the corresponding elements of the matrix.

Then each pixel’s RGB vector is multiplied with the corresponding kernel element. All the scaled RGB vectors are then added and this is divided by the number of kernel elements and assigned to the pixel under consideration, and here is a simple visualization of it:

One of the simplest kernels is the 3 x 3 matrix with all elements 1. If we convolve an image with this matrix, each pixel gets a value equal to the average of its neighbouring pixels, which gives the effect of each pixel “bleeding” into its neighbours, causing a blurring effect on the image. As an alternative to vectors, pixels can sometimes be assigned values based on certain predetermined criteria.

Back to CNNs

The new numbers obtained are mapped onto another matrix, which is called a feature map. This feature map represents various aspects of the image, like brightness, edges or sharpness, based on the kernel used and how pixels are assigned values.

The elements of the feature map are then passed through an activation function, usually ReLU (Rectified Linear Unit). This introduces non-linearity and allows the network to learn and model complex data by specifying which elements of the feature map are to be activated. The ReLU function is defined as max(0, x), which activates only those elements which are strictly greater than 0.

Now for the usual images we encounter (with resolutions like 300 PPI), the feature maps would be ENORMOUS, with millions (literally) of elements in the feature map. We would like to somehow summarise this vast amount of information to something compact. This can be done with a process called pooling. This runs a filter through each section of the image (without going over any of them twice, and thus moving in steps called “stride”s) and summarises the information in a specified manner. The pooling process helps to reduce the dimensionality of the problem. We could have:

  1. MaxPooling: From each section the filter passes through, we only take the maximum value. This works pretty well when the ReLU returns a lot of zeroes, so that the few positive entries become defining characteristics of the particular section.
  2. Average Pooling: This works as specified above, but from each section, we take in the average value.

3. Global Pooling: Returns the average or max value from the entire feature map. Typically used near the end of a CNN process to evaluate the learnings of the neural network.

This sequence of events can be repeated multiple times. Finally we feed this filtered feature map as a flattened set of nodes into the Dense Layer.

Dense Layer: this layer consists of multiple neurons which helps in building the function which fits the best in predicting the classes, the working of a single neuron is explained below.

WORKING OF THE MODEL HOW WE USE CNN FOR SIGNAL PREDICTION

The idea is fairly simple: Calculate multiple technical indicators with different period lengths (explained below) summing up to a total features of 116. Then reduce the no of features by taking the best 81 features among them and then convert the 81 new features into 9x9 images. Label the data as buy/sell/hold based on the algorithm(we will discuss further). Then train a Convolutional Neural Network like any other image classification problem.

We will break our model into 10 parts:

  1. Feature Extraction
  2. Feature Engineering and Feature Selection
  3. Labeling Strategy
  4. Feature importance in the prediction
  5. Image Creation
  6. Convolutional Neural Network Architecture
  7. Training, Model Evaluation and Results
  8. Reason for bad performance in Confusion matrix
  9. Dealing with unbalanced classes
  10. The final performance of the model
  11. References

Now we will proceed through each of these parts to understand the whole process of the model.

Feature Extraction:

Here we will understand how we prepare our data for training and testing of our model. Basically, we need to preprocess our raw data to extract meaningful insights.

Data Source: Our data processing begins with gathering historical stock price data. Here we have used Google stock price data which we have taken from Alpha Vintage.

Technical Indicators: For creating features for our model we have used various indicators with different time periods. By this method we have created nearly 106 features and by adding all these OHLC and Volume and Adjusted Close we will have a total sum of 112 features.

We would explain the concept of technical indicators and time periods with a Simple Moving Average (SMA) since it’s simpler. This should be enough for you to understand the idea.

A moving average for a list of numbers is like an arithmetic average but instead of calculating the average of all the numbers, we calculate the average of the first ’n’ numbers (n is referred as window size or time period) and then move (or slide) the window by 1 index, thus excluding the first element and including the n+1 element and calculate their average. This process continues. Here is an example to drive this point:

This is an example of SMA on a window size of 6.

The SMA of the first six elements is shown in orange.

Feature Engineering and Feature Selection

The reason behind having this step in our pipeline is that Feature engineering involves creating, transforming, or selecting the most relevant variables in your dataset to improve model performance. This process is critical because it enables the model to learn from the data more effectively, leading to better predictions and insights. Feature selection, in particular, trims down the feature set to keep only the most valuable attributes, reducing complexity, computation time, and the risk of overfitting. In essence, these techniques streamline the modelling process, making it more accurate, interpretable, and efficient, while also saving time and resources.

First, we Normalize our data set because Normalizing a dataset is crucial to ensure that all features are on a consistent scale, preventing one feature from dominating the analysis, improving model convergence, aiding in the interpretation of feature importance, making distance-based algorithms more reliable, and aligning with assumptions of regularisation techniques, all of which contribute to more effective and robust data analysis and machine learning.

Then we will just apply a small filter of variance with a threshold of 0.1 i.e. any feature which has a variance less than or equal to 0.1 will be removed from the dataset as having this low variance does not contribute much to detecting the trend in the dataset.

Now the main part is that we will calculate the importance of each feature by using Random Forest Classifier and then reduce the feature size to 81. Here is the bar graph of features with their importance.

A rough idea of how random forest works is for each decision point (split) in each tree, the Random Forest algorithm measures the decrease in impurity(Gini impurity) resulting from the split. The impurity reduction brought about by each feature is averaged across all the trees in the forest. Features that consistently reduce impurity more effectively when used in splits are considered more important

We can also visualise the interpretation from features a good feature give a nice and clear interpretation hence making easy for model to classify the classes

Here are three examples of features with their density distribution WILLR3, RSI1 and OPEN. From these three, WILLR3 has the highest importance, and we can also see from this that we can easily interpret that before -60, it is a buy signal, and after it, is sell and hold is evenly distributed but for the RSI1 which has a moderate score in importance but still it can be distinguishable but for open which has the lowest value it is very difficult hence it has lowest importance.

Labeling Strategy :

The labeling strategy for our model is that we basically used a window of 11 days on close price what we do is if middle value is maximum within the window then we will label the middle day as ‘sell’ or, i the middle number is minimum then we will label the day as ‘buy’ else label as ‘hold’.The idea is to buy at troughs and sell at crests for any 11 day window.Here is the direct implementation of it:

windowSize = 11
def labl(df = df, windowSize =11):
labels = []
values = []
for i in range(len(df.close) - windowSize):
mx= df.close[i]
mn df.close[i]
mxIndex, mnIndex = i, i
for j in range(i + 1, i + windowSize + 1):
if df.close[j] > mx:
mx = df.close [j]
mxIndex = j
if df.close[j] < mn:
mn = df.close[j]
mnIndex = j
        if mnIndex == i + 11:
labels.append(1)
values.append(i + 11)
elif mxIndex == i + 11:
labels.append(0)
values.append(i + 11)
else:
labels.append(2)
values.append(i + 11)
return labels, values

After adding all these features and labeling our data looks somewhat like this :

Feature importance in prediction:

Used mutual info classification for creating heat map for our image. Basically it tells us which part of the image is most useful to the model (will explain this part more in further ).

IMAGE CREATION:

As of now we have a tabular data of 20 features, we need to convert this into image of 5 x 4 , to do this basically we transform our data from pandas dataframe to numpy array and then reshape that array to 5,4 the implementation of it is here :

from imblearn.over_sampling import RandomOverSampler 
from sklearn.preprocessing import StandardScaler
X = d[d.columns[0:20]]
Y = d['labeldd']
# TRANSFORMING DATA
scaler = StandardScaler()
X = scaler.fit_transform(X)
#SPLITTING THE DATA
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
#REPLACING THE nan VALUE WITH THE MEAN OF X_TRAIN, X_TEST
X_train = np.nan_to_num(X_train, nan=X_train.mean())
X_test = np.nan_to_num(X_test, nan=X_test.mean())
X_a = np.array(X_train)
X_b = np.array(X_test)

Here is how the images look like :

Convolutional Neural Network Architecture:

Now we are done with all the steps necessary for processing data to feed into our model in this step we will understand the model architecture and the various layers which we use for creating our model to predict the classes, here is the flow chart of our whole model with its implementation in python :

def simple_model():
model = Sequential()
model.add(Conv2D (32, 5, 5, padding='same', input_shape=(1,5,4), activation='relu'))
model.add(MaxPooling2D (pool_size=(2,2), padding='same'))
model.add(Dense (10, activation='relu'))
model.add(Dropout (0.1))
model.add(Flatten())
model.add(Dense (10, activation='relu'))
model.add(Dense (3, activation='softmax'))
model.compile(loss=' categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
return model

As we have learned about all the steps in detail in the above part here we will go through a brief working and will proceed to the evaluation part the layers which we used are:

  • Conv2D (32 filters, 5x5): Detects features using 32 filters in a 5x5 grid.
  • MaxPooling2D (2x2): Reduces data dimensions while preserving important features.
  • Dense (10 neurons, ReLu): Connected layer for complex feature representation.
  • Dropout (0.1): Prevents overfitting by randomly disabling 10% of neurons.
  • Flatten: Reshapes data for fully connected layers.
  • Dense (10 neurons, ReLu): Further processing for higher-level features.
  • Dense (3 neurons, softmax): Output layer for class probabilities.
  • Compile (categorical_crossentropy, Adam): Prepares model for training with loss and optimizer

Here is the summary of our model :

Training , Model evaluation and Result :

As we have split our data into train and test with 80% as train and the rest as test data now we will train our model with the training dataset and evaluate it with the test dataset we have set our epochs to 20 with a batch size of 32, here is the implementation of it:

And here is the confusion matrix:

As you can see that the performance of the model in predicting 2 i.e hold is good but in case of 0 and 1 it is worse so what is the reason for this performance and how to resolve it we will explore it further.

Reason for bad performance in Confusion matrix:

The main reason behind the low correct predictions of classes 0 ans 1 is imbalance classes and here is how it affects our model.

An imbalanced dataset occurs when the distribution of class labels is uneven, meaning that one or more classes have significantly fewer instances compared to others. Training a Convolutional Neural Network (CNN) on such imbalanced data can lead to various issues and challenges, which may explain the poor performance of your model and the resulting confusion matrix. Here are some reasons behind the poor performance:

Bias Towards Majority Class: In an imbalanced dataset, the majority class has more representation, and the model may become biased towards predicting this majority class more often. This bias can lead to poor generalization and lower performance on minority classes.

Limited Learning from Minority Classes: With fewer examples of minority classes, the model may struggle to learn their distinguishing features and variations. As a result, it may fail to correctly classify instances from these classes.

Difficulty in Decision Boundary: When classes are imbalanced, the decision boundary learned by the model may be biased towards the majority class. As a result, it might not effectively capture the underlying patterns of the minority class, leading to misclassifications.

Loss Function Imbalance: If using a standard loss function like cross-entropy, the model may prioritize minimizing the error on the majority class due to its higher representation. This can lead to weaker learning in minority classes.

Now as we have got a clear idea why our model is not performing well we will resolve this in further steps.

Dealing with unbalanced classes:

The issue of imbalance classes is very common while training models for example if you were building a model to predict the fraud in credit card and in that case the number of fraud will be very low and hence will be difficult for a model to learn from it hence to address the issue of imbalance in classes these are some of the ways for tackling it.

Data Augmentation: Generate new examples for the minority classes using techniques like rotation, flipping, scaling, and cropping. This can help balance the dataset and provide the model with more data to learn from.

Resampling: Either oversample the minority class (duplicate instances) or undersample the majority class (remove instances) to balance class distribution.

Weighted Loss Function: Modify the loss function to assign higher weights to the minority class during training. This can help the model pay more attention to the minority class and reduce the bias towards the majority class.

Transfer Learning: Utilize pre-trained models on a larger dataset and fine-tune them on your imbalanced dataset. This can help the model learn more general features before focusing on the imbalanced classes.

Synthetic Data Generation: Use techniques like Synthetic Minority Over-sampling Technique (SMOTE) to generate synthetic instances for the minority class, helping to balance the data distribution.

In our model we have used Random Over Sampling for our data a brief info of how random over sampling works:

  1. Identify the minority class: Determine which class has fewer samples compared to the other class(es).
  2. Randomly duplicate samples: Randomly select samples from the minority class and create duplicate copies of them. This is done to increase the number of samples in the minority class until it reaches a desired balance.
  3. Adjust the class distribution: The Random Over Sampler continues to randomly duplicate samples from the minority class until its size is close to that of the majority class. This helps to create a more balanced dataset.

Here is how we applied it, we have used ‘imblearn’ lib for it:

from imblearn.over_sampling import RandomOverSampler 
from sklearn.preprocessing import StandardScaler
X = d[d.columns [0:20]]
Y = d['labeldd']
# TRANSFORMING DATA
scaler =StandardScaler()
X = scaler.fit_transform(X)
#APPLIED RANDOM OVER SAMPLER
os = RandomOverSampler()
Xs, Ys = os.fit_resample (X, Y)
#SPLITTING THE DATA
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Xs, Ys, test_size=0.2)
#REPLACING THE nan VALUE WITH THE MEAN OF X_TRAIN, X_TEST
X_train = np.nan_to_num(X_train, nan=X_train.mean())
X_test = np.nan_to_num(X_test, nan=X_test.mean())
X_a = np.array(X_train)
x_b = np.array(X_test)

After the execution of Random Over Sampler our data look like this fig(1) and as you can see the no of 0,1,2 classes is same hence this would help in improving the performance as our model will get the samples of data with labels 0,1 and this would provide our model amore diverse range of data to learn from and hence increasing accuracy and performance

Final performance of model

Now we will train our model with this modified data and lets see how it perform,

The CNN model is the same with a slit change of no of neurons in the dense layer .

Here is the model:

def CNN_model():
model = Sequential())
model.add(Conv2D (32,5,5, padding='same', input_shape=(1,5, 4), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2), padding='same'))
model.add(Dense (200, activation='relu'))
model.add(Dropout (0.2))
model.add(Flatten())
model.add(Dense (200, activation=' relu'))
model.add(Dense (3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
return model

And the summary:

Now the final part i.e evaluation of our model we train our model with 500 epochs( refers to a single iteration through the entire training dataset during the model training process)

history = modeld.fit(X_a, y_train, epochs 500, batch_size=32, verbose=1) loss, accuracy = modeld.evaluate(X_b, y_test, verbose=1)
print("Test acc:", (accuracy*100))

And after 500 iterations the final accuracy of our model comes out as 93%

Now the main parameter the confusion matrix:

As you can see this time it has performed really well compared to the last time with an imbalance dataset .

REFERENCES:

  1. ‘Algorithmic Financial Trading with Deep Convolutional Neural Networks: Time Series to Image Conversion’,Omer Berat Sezer, Murat Ozbayoglu.

--

--