ML Algorithms in the Markets. Part 5: Using the k-Nearest Neighbour algorithm to Improve a Mean Reversion Strategy in Python

Thornexdaniel
9 min readMay 30, 2023

--

This article will present how using a K-Nearest Neighbour model in a mean reversion strategy can improve its performance. Furthermore, this will be backtested to provide a performance indicator. This is part 5 of 5 in the ML Algorithms in the Markets series. At the end of this series, the ML methods will be evaluated against each other.

Intro
The main idea behind the k-NN algorithm is that similar things are near each other. When a prediction is needed for an unseen data point, the algorithm looks at the ‘k’ closest data points, or the ‘k’ nearest neighbours, in the training set and assigns a value or class based on these neighbours.

It works via two prominent methods:

  1. Classification: For a given data point that needs to be classified, the algorithm identifies the ‘k’ nearest neighbours from the training set. The data point is then assigned to the class that has the most representatives within these ‘k’ neighbours. For example, if k=5, and 3 out of the 5 nearest neighbours are class A and 2 are class B, the data point would be classified as class A.
  2. Regression: For regression problems, the algorithm calculates the average (or sometimes the median) of the ‘k’ nearest neighbours.

The distance between data points can be calculated in various ways, but the most common method is the Euclidean distance. Other methods include Manhattan distance, Minkowski distance, etc.
A key parameter to the k-NN algorithm is the value of ‘k’. If ‘k’ is too small, the model may be overly sensitive to noise in the data. If ‘k’ is too large, the model may include points that are too far away and thus less relevant. Selecting the right ‘k’ often involves some trial and error, or techniques such as cross-validation.

The k in k-Nearest Neighbors (k-NN) algorithm is a hyperparameter that you choose, and it represents the number of nearest neighbours to consider when making a prediction. k can take on any integer value, from 1 up to the total number of observations in your training set.

For the very reason that kNN is constructed, it is not a ‘forecasting’ algorithm, since it takes a data point that has already occurred and then classifies it. Therefore, in this case, I have decided that the kNN will be used to decide whether the data (price) is in a downtrend or uptrend and then place a trade accordingly. In theory, it will operate as follows:
For a given interval to predict the stock price, find the ‘k’ number of intervals in the training set that had the most similar feature values to the given interval. Then, take the average (for regression) of the stock direction on those ‘k’ intervals to be the predicted movement for the prediction interval.

import yfinance as yf
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
import numpy as np

# Download the data
tesla = yf.download('TSLA', start='2023-04-14', end='2023-04-28', interval='15m')

# Create labels: 1 if the price increased in the next interval, 0 if it decreased or stayed the same
tesla['Direction'] = np.where(tesla['Close'].shift(-1) > tesla['Close'], 1, 0)

# Define the features and target
features = tesla.drop(['Direction', 'Close'], axis=1) # removing Close as it directly influences the label
target = tesla['Direction']

# Normalize the features
scaler = StandardScaler()
features = scaler.fit_transform(features)

# Split the data into a training set and a test set
features_train, features_test, target_train, target_test = train_test_split(features[:-1], target[:-1], test_size=0.2, random_state=42)

# Test different k values
for k in range(1, 20):
model = KNeighborsClassifier(n_neighbors=k)
model.fit(features_train, target_train)
predictions = model.predict(features_test)
accuracy = accuracy_score(target_test, predictions)
print(f"For k = {k}, accuracy = {accuracy}")

For k = 1, accuracy = 0.5961538461538461
For k = 2, accuracy = 0.5
For k = 3, accuracy = 0.4230769230769231
For k = 4, accuracy = 0.4807692307692308
For k = 5, accuracy = 0.46153846153846156
For k = 6, accuracy = 0.5192307692307693
For k = 7, accuracy = 0.5384615384615384
For k = 8, accuracy = 0.5961538461538461
For k = 9, accuracy = 0.5961538461538461
For k = 10, accuracy = 0.5769230769230769
For k = 11, accuracy = 0.5192307692307693
For k = 12, accuracy = 0.5192307692307693
For k = 13, accuracy = 0.5384615384615384
For k = 14, accuracy = 0.5961538461538461
For k = 15, accuracy = 0.5384615384615384
For k = 16, accuracy = 0.5576923076923077
For k = 17, accuracy = 0.5
For k = 18, accuracy = 0.5192307692307693
For k = 19, accuracy = 0.5

This script first downloads the data from Yahoo Finance. Then it creates a label that indicates whether the price went up or down in the next interval. It normalizes the features, randomly splits the data into a training set and a test set, and then tests different values of ‘k’ to see which gives the highest accuracy on the test set.

import matplotlib.pyplot as plt

# Choose the best 'k' based on previous step
best_k = 1

# Retrain the model using the best 'k'
model = KNeighborsClassifier(n_neighbors=best_k)
model.fit(features_train, target_train)

# Predict the direction for all data points
tesla['Predicted Direction'] = model.predict(features)

# Calculate returns from the strategy
tesla['Return'] = np.where(tesla['Predicted Direction'].shift() == 1, tesla['Close'].pct_change(), 0)

# Calculate cumulative returns
tesla['Cumulative Return'] = (1 + tesla['Return']).cumprod()

# Plot cumulative returns
plt.plot(tesla['Cumulative Return'])
plt.title('Cumulative Returns from k-NN Strategy')
plt.xticks(rotation=90)
plt.show()

The trading strategy involves “buying” the stock when the predicted direction is up and “selling” it when the predicted direction is down. The plot is the cumulative returns.

for k=1

This plot is giving very ‘too good to be true’ vibes. This is often a result of overfitting the model and in this case, it is true for randomising the training and test data sets and when ‘k’ is very small, as the model may be overly sensitive to noise in the data. It might end up “memorizing” the training data rather than learning to generalize from it.

for k=9
Price action for the Tesla data in 15 minute intervals across the date range

For the k=9 plot, the returns are also relatively promising and show signs that the model is a good fit. However, there are some issues with the construct of the model, which I will now address.

Instead of splitting the data randomly, the training and test sets are now going to be created using a hard split point (80/20 — see below). Randomising the data does allow training and test sets to be representative of any overall distribution of data. But in the case of financial time-series data, it often makes sense to preserve the temporal order of the data (This is because the goal is often to predict future prices based on past prices, so it wouldn’t make sense to train on data from the future).

# Calculate the split point
split_point = int(len(features) * 0.8)

For k = 1, accuracy = 0.6153846153846154
For k = 2, accuracy = 0.5384615384615384
For k = 3, accuracy = 0.5192307692307693
For k = 4, accuracy = 0.4807692307692308
For k = 5, accuracy = 0.5769230769230769
For k = 6, accuracy = 0.5384615384615384
For k = 7, accuracy = 0.5769230769230769
For k = 8, accuracy = 0.5192307692307693
For k = 9, accuracy = 0.5192307692307693
For k = 10, accuracy = 0.46153846153846156
For k = 11, accuracy = 0.5769230769230769
For k = 12, accuracy = 0.5576923076923077
For k = 13, accuracy = 0.5769230769230769
For k = 14, accuracy = 0.5576923076923077
For k = 15, accuracy = 0.5769230769230769
For k = 16, accuracy = 0.5769230769230769
For k = 17, accuracy = 0.5769230769230769
For k = 18, accuracy = 0.5961538461538461
For k = 19, accuracy = 0.5769230769230769

using k = 7
Price action for the Tesla data in 15 minute intervals across the date range

Even with the hard split of data for the training and test sets, the returns from this strategy are also quite promising; with limited downside across the forecast.

Adding the kNN to a mean reversion strategy

import yfinance as yf
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# Download the data
tesla = yf.download('TSLA', start='2023-04-14', end='2023-04-28', interval='15m')

# Define the moving average window
window = 15

# Calculate the moving average
tesla['Moving Average'] = tesla['Close'].rolling(window=window).mean()

# Create labels: 1 if the price is below the moving average (expecting it to rise towards the mean),
# 0 if it is above (expecting it to fall towards the mean)
tesla['Below Moving Average'] = np.where(tesla['Close'] < tesla['Moving Average'], 1, 0)

# Define the features and target
features = tesla.drop(['Below Moving Average', 'Close', 'Moving Average'], axis=1)
target = tesla['Below Moving Average']

# Normalize the features
scaler = StandardScaler()
features = scaler.fit_transform(features)

# Calculate the split point
split_point = int(len(features) * 0.8)

# Split the data into a training set and a test set
features_train = features[:split_point]
features_test = features[split_point:]
target_train = target[:split_point]
target_test = target[split_point:]

# Test different k values
for k in range(1, 20):
model = KNeighborsClassifier(n_neighbors=k)
model.fit(features_train, target_train)
predictions = model.predict(features_test)
accuracy = accuracy_score(target_test, predictions)
print(f"For k = {k}, accuracy = {accuracy}")

In this script, the moving average of the closing price over the last 15 intervals is calculated. Then, a new target variable ‘Below Moving Average’ is defined; which is 1 if the current price is below the moving average, and 0 if it’s above.
Features are normalised, data is split into training and test sets and a k-Nearest Neighbors classifier is tested on the training data for different values of k. For each k, we calculate the accuracy of the model on the test data.

For k = 1, accuracy = 0.5192307692307693
For k = 2, accuracy = 0.5192307692307693

For k = 3, accuracy = 0.4807692307692308
For k = 4, accuracy = 0.4807692307692308
For k = 5, accuracy = 0.46153846153846156
For k = 6, accuracy = 0.5192307692307693
For k = 7, accuracy = 0.46153846153846156
For k = 8, accuracy = 0.46153846153846156
For k = 9, accuracy = 0.46153846153846156
For k = 10, accuracy = 0.46153846153846156
For k = 11, accuracy = 0.46153846153846156
For k = 12, accuracy = 0.46153846153846156
For k = 13, accuracy = 0.46153846153846156
For k = 14, accuracy = 0.46153846153846156
For k = 15, accuracy = 0.46153846153846156
For k = 16, accuracy = 0.46153846153846156
For k = 17, accuracy = 0.46153846153846156
For k = 18, accuracy = 0.46153846153846156
For k = 19, accuracy = 0.46153846153846156

import matplotlib.pyplot as plt

# Test different k values and choose the one with the highest accuracy
accuracies = []
for k in range(1, 20):
model = KNeighborsClassifier(n_neighbors=k)
model.fit(features_train, target_train)
predictions = model.predict(features_test)
accuracy = accuracy_score(target_test, predictions)
accuracies.append(accuracy)
best_k = accuracies.index(max(accuracies)) + 1

# Train the model with the best k
model = KNeighborsClassifier(n_neighbors=best_k)
model.fit(features_train, target_train)

# Predict on the test set
predictions = model.predict(features_test)

# Create a DataFrame for calculating returns
test_data = tesla.iloc[split_point:]
test_data['Predictions'] = predictions

# Calculate returns
test_data['Returns'] = test_data['Close'].pct_change()
test_data['Strategy Returns'] = test_data['Returns'] * test_data['Predictions'].shift()

# Calculate cumulative returns
test_data['Cumulative Market Returns'] = (test_data['Returns'] + 1).cumprod()
test_data['Cumulative Strategy Returns'] = (test_data['Strategy Returns'] + 1).cumprod()

# Plot the cumulative returns
plt.figure(figsize=(10,5))
plt.plot(test_data['Cumulative Market Returns'], color='blue', label='Market Returns')
plt.plot(test_data['Cumulative Strategy Returns'], color='green', label='Strategy Returns')
plt.title('Cumulative Returns')
plt.legend()
plt.show()
Returns from the mean reversion KNN strategy
Price action for the Tesla data in 15-minute intervals across the date range

The plot indicates that when this particular model is applied to a mean reversion strategy, it does not perform particularly well. With that said, there is a large flat period when the market is shut and therefore the full potential of this strategy is not seen. A larger dataset would indeed be better at deducing the effectiveness of this strategy.

Additional Limitations

As the max k suggests, there are only 19 training data points in this model. This is likely to restrict the model’s reproducibility and change its performance for different periods.
Note that this is a very simple model and might not be suitable for real trading. In particular, it does not take into account transaction costs, the impact of the model’s decisions on the market, or the risk associated with different decisions. It also does not have any mechanism to handle missing data or outliers. Finally, it uses a very simple set of features and does not incorporate any information about the broader market or other potentially relevant factors.
The code uses a standard scaler taking the entire dataset into account before it is split into training and test sets. This means that the test set is altered by the whole dataset and can therefore be seen as a minor source of overfitting.

--

--

Thornexdaniel

Top Writer in Finance | Algorithmic Trading | Financial Markets | Machine Learning | Statistics MSc