PAIRS TRADING: USING MACHINE LEARNING FOR THE SELECTION OF PAIRS

Maike R Mota
𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨
12 min readMay 10, 2023

Hey there! Just wanted to let you know that the Medium article you’re reading is actually based on my final thesis project, which I did using the book A Machine Learning Based Pairs Trading Investment Strategy by Simão Moraes and Nuno Horta as a reference. It was a really interesting project and I learned a lot about applying machine learning techniques to financial data. Hope you find it useful too!

Summary

This paper aims to evaluate the effectiveness of the model proposed by Sarmento and Horta (2020) to solve the high-dimensionality problem of data and the clustering of financial asset pairs, and to develop a code to implement and test the strategy. The results show that the proposed model performs well in terms of profitability and is a promising alternative for investors seeking Pairs Trading strategies. However, it is necessary to constantly assess the risks and challenges involved in implementing this model and to continuously seek improvements and adaptations to ensure its long-term effectiveness.

Introduction

Machine Learning (ML) algorithms are increasingly used in various sectors, including investment decision-making, which often involves complex data interpretation. This paper focuses on Pairs Trading within quantitative analysis, which seeks to identify asset pairs based on correlation and other non-parametric decision rules. With the rise of ML in finance and the challenges of high data dimensionality, this paper aims to evaluate the effectiveness of the ML-based Pairs Trading model proposed by Sarmento and Horta (2020). The study utilizes automated clustering techniques to handle large data volumes and complex information, testing the model’s consistency over time and comparing it to the S&P 500 index.

Methodology

The following presents the methodological procedures used to test the applicability of Sarmento and Horta’s (2020) central hypothesis for selecting asset pairs. We chose to use a group of 700 assets divided between ETFs and the main stocks from the S&P500, Dow Jones, and Nasdaq indexes. According to the authors, the method using dimensionality reduction and clustering should present a set of cointegrated pairs with a higher probability of positive returns, provided certain mathematical formulation conditions are preserved. This is an important assertion, although not sufficient, to validate or discard the method described by Sarmento and Horta (2020), as there is more than one way to test the cointegration between two time series, and a series of parameters can be altered given the evaluated period and timeframe. The article is classified as applied research since the method is applied to real assets, and the obtained simulations have real-world applications. The approach is classified as quantitative since the decision-making is entirely based on the results of algorithm calculations. As it is an application of the model described by Sarmento and Horta (2020), it is explanatory research, aimed at justifying the potential of applied technologies to outperform the returns of the main stock indexes in the market. As for the procedure, it is experimental since it is possible to change various parameters of the machine learning models and mathematical tests and obtain different responses to the problem.

PAIR SELECTION RULES

Sarmento and Horta’s (2020) proposal can be illustrated in following Figure

● State 0: The initial state covers the price series for all possible pair constituents. It can be considered that this information is available to the investor. The main assets used in this study are called ETFs (Exchange-Traded Funds) and described by Sarmento and Horta (2020) as an interesting type of security to be explored for this model. For them, an ETF is a security that tracks an index, commodity, or basket of assets like an index fund but trades like a stock. According to Chan (2013), the only advantage of trading pairs of ETFs instead of pairs of stocks is that once cointegrated, ETF pairs are less likely to exit the sample base. This is because the fundamental economy of a basket of stocks changes more slowly than that of a single stock. As an additional feature to complement the analysis, stocks will be included in the basket of financial assets.

● State 1: Then, by reducing the data’s dimensionality, each security can be described not only by its price series but also by the compact representation that results from applying PCA to the return series;

● State 2: Using this simplified representation, the OPTICS algorithm is capable of organizing securities into clusters;

● State 3: Finally, it is possible to search for pairs combinations within the clusters and select those that verify the rules described in the following Figure.

Transaction Costs

We took into account the transaction costs in all the results presented in this work. The transaction costs were based on the estimates of Do and Faff (2012). These authors conducted a thorough study on the impact of transaction costs on Pairs Trading. The commission and market impact costs were adjusted to account for both assets in the pair. The costs associated with a transaction are calculated as shown in Figure 5.

Implementation Environment

The implementation environment for this work is Python. The motivation for this choice is the vast amount of resources available that facilitate the implementation of the chosen algorithms, namely data mining and data analysis procedures. Furthermore, Python has been, at the time of writing, the language of choice for Machine Learning-related projects, making it more suitable also from a collaboration standpoint. Some libraries are particularly useful in this work: sci-kit learn is helpful in implementing PCA and the OPTICS algorithm, while statsmodels provides an already implemented version of the ADF test, useful for testing cointegration.

The simulation is run on an EC2 VM on the AWS platform (AMD EPYC 7R13 32 threads and 64 GB of RAM). These models involve a large volume of matrix multiplications that result in long processing times when using the CPU.

Results

The results of the study show that the Machine Learning-based Pairs Trading strategy presented by Sarmento and Horta (2020) was successful in identifying tradable pairs. The approach was tested on a sample of 729 assets, including companies and ETFs, using adjusted closing data collected from Alpha Vantage.

To reduce the dimensionality of the data, a PCA (Principal Component Analysis) filter was applied to find a compact representation for each stock. The following figure shows the percentage weight that each financial asset receives in each principal component.

Cluster generation using unsupervised learning, aiming to group assets into potential pairs. Figure 7 illustrates in 3 dimensions the clusters generated by the first 3 principal components, and the colors highlight the separation of the clusters.

Application of selection criteria to identify tradable pairs within the generated clusters. The following Table presents the result of the PCA, clustering, and application of selection rules, generating data as illustrated below:

Results of the Model

Based on the daily execution carried out between 2005 and the end of 2022, it was possible to evaluate the effectiveness of the Pairs Trading strategy proposed by Sarmento and Horta (2020). The model daily generated a list of tradable pairs based on selection criteria such as the PCA filter and cluster generation, and used Beta and standard deviation parameters as entry and exit points.

Initially, the number of trades per day and the capital exposure in each operation were evaluated, which are crucial aspects to be considered in the implementation of a trading strategy. According to a study by Abdi et al. (2020), the appropriate selection of the number of trades per day can significantly influence the performance of the trading strategy. On the one hand, an excessive number of trades can lead to higher transaction costs, while on the other hand, an insufficient number of trades can limit the potential for profit. Thus, it is necessary to find a balance between the number of trades and the transaction cost in order to maximize the return on investment.

In addition, the capital exposure in each operation is also a critical factor to be evaluated, since it can significantly influence the risk and profitability of the trading strategy. According to a study by Albergaria et al. (2020), the capital exposure in each operation should be carefully managed, in order to avoid excessive losses and maximize the return on investment. For this, it is important to define clear limits on maximum loss and maximum capital exposure in each operation, based on a careful analysis of the risks involved and the investment objectives.

In summary, the evaluation of the number of trades per day and the capital exposure in each operation is fundamental for the implementation of an effective trading strategy. As stated by Abdi et al. (2020), “careful selection of the number of trades per day is essential to maximize the return on investment, while proper management of capital exposure in each operation can help minimize risk and maximize profitability”. Thus, it is important to consider these aspects during the implementation of the trading strategy, in order to achieve consistent and satisfactory performance.

Trading Volume

For the visualization and analysis of traded pairs, the moving average (21 days) of traded pairs over time was generated, allowing a clearer visualization of the observed trends and patterns. The use of the moving average smoothes daily fluctuations and highlights clearer patterns, making it easier to analyze the collected data.

The following figure shows a histogram of traded pairs over time, providing a more detailed view of the data distribution. Together, these graphs provide an overview of the turnover generated by the model and allow for a deeper analysis of the collected data. These data show that the model does not generate an excessive number of trades per day, avoiding the loss of profitability due to the costs associated with buying and selling assets.

Capital Exposure

To evaluate the performance of the Pairs Trading operations, it is important to analyze the financial exposure of each operation, which represents the percentage of the total equity invested in each pair of assets. Figure 8 shows the daily financial exposure of Pairs Trading operations, showing the variation of the percentage of exposure over time.

Through this figure, it is possible to verify the exposure of each operation and evaluate the risk management ability of the Pairs Trading model. In addition, the next figure presents a box plot that allows the visualization of the median and deviations of the financial exposure of each operation, allowing for a more detailed analysis of the data distribution and identifying possible outliers that may impact the strategy’s performance. The analysis of the financial exposure of each operation is essential for risk management and for decision making regarding Pairs Trading operations.

In this figure, it is possible to verify that the median financial exposure of Pairs Trading operations is zero, indicating that the operations are relatively balanced in terms of exposure. However, the lower and upper fences of the box show that outliers are found from +7% and -8% exposure. When outliers occur, it is important to highlight significant concerns about the risk in the operation due to high financial exposure. These abnormal exposures can be explained by the size of the beta of each asset. Assets with very high betas can generate abnormal exposures in relation to the asset pair in question.

Profitability

To evaluate the performance of the Pairs Trading strategy, it is important to consider several factors that affect the strategy’s profitability in real life. One of these factors is the existence of fees and commissions charged by brokers and exchanges, which make it difficult to accurately evaluate the profitability of the strategy. To address this issue, it is common to approximate the charged values in order to evaluate the impact of these costs on the strategy’s profitability. Chan et al. (1996) highlight the importance of proper cost management in the evaluation of trading strategies, emphasizing the need to realistically consider these costs. In addition, there is also uncertainty about the execution of the buy or sell order at the value determined by the model. To evaluate the impact of this uncertainty on the strategy’s profitability, it is important to perform simulations by inserting a random value into the asset prices, both increasing and decreasing the prices artificially. This way, it is possible to evaluate the strategy’s robustness in the face of price variations. Chan et al. (2013) discuss the importance of Monte Carlo simulation in evaluating the robustness of trading strategies, highlighting the need to evaluate the strategy’s performance in different market scenarios.

Based on these considerations, the realistic evaluation of the profitability of the Pairs Trading strategy involves careful consideration of additional costs and simulation of random prices. By considering these factors, it is possible to obtain a more accurate evaluation of the strategy’s profitability, as well as identify potential areas for improvement.

The next figure presents the capital curve of the Pairs Trading strategy during the backtest period. The highlighted line represents the capital curve considering the actual prices of the assets, without the inclusion of random factors in the prices. The other lines represent the capital curve considering prices with random factors inserted, artificially increasing or decreasing the asset prices. The $10MM initial amount.

As seen in figure, the capital curve considering actual prices shows consistent growth over the backtest period. However, the inclusion of random factors in the prices can lead to significant variations in the capital curve, with periods of profit followed by periods of loss.

These variations in the capital curve highlight the importance of simulating random prices in evaluating the robustness of the Pairs Trading strategy. By considering different market scenarios, it is possible to identify potential areas of weakness in the strategy and seek improvements that increase its profitability and robustness.

Although prices with random factors were inserted in the simulations, the Pairs Trading model showed capital growth in all 100 simulations performed. This result indicates the strategy’s robustness in the face of price variations and suggests that the strategy can be effective even in challenging market scenarios.

BENCHMARK

The results presented in next figure and Table are derived from calculations performed by the Pyfolio library in Python, which is a commonly used tool for evaluating trading strategies in the financial market. The Machine Learning-based Pairs Trading strategy described by Sarmento and Horta (2020) was tested using this tool in a backtest period from 2005 to 2022.

The results indicate that the Pairs Trading strategy showed superior performance to the S&P500 benchmark. The annual return was 10.8% and the cumulative return was 132.8%, which suggests above-average market performance. In addition, the Sharpe ratio of 1.53 indicates that the strategy is capable of generating attractive risk-adjusted returns, considering the annual volatility of 6.9%. The Calmar ratio of 1.20 suggests that the strategy is able to generate attractive returns without suffering large drawdowns, while the Omega ratio of 1.60 indicates that the strategy was able to generate risk-adjusted returns superior to the benchmark.

Discution

The results obtained are promising and suggest that the Machine Learning-based Pairs Trading strategy can be an interesting option for investors seeking above-average market returns. However, it is important to note that the results were obtained in a backtest period and that the strategy may not perform as well in a real trading environment.

In the practical implementation of the strategy, it is important to consider some operational risks that can affect the strategy’s profitability. One of the main risks is the uncertainty regarding the execution of buy and sell orders, which can affect the effectiveness of the strategy. In addition, the selection of inadequate asset pairs or the use of incorrect parameters can lead to unsatisfactory results.

Another important aspect to consider is the time required to implement the strategy. The Pairs Trading strategy requires constant monitoring of the selected asset pairs in order to identify trading opportunities and adjust the strategy’s parameters. This may require a significant investment of time and resources, which can affect the viability of the strategy for some investors.

Moreover, it is important to highlight that trading costs, such as transaction fees and brokerage costs, can significantly affect the strategy’s profitability. It is important to take into account that costs may vary depending on the exchange used, which can affect the viability of the strategy for some investors.

Finally, it is important to highlight that the surplus value that was not exposed in the operation could have been allocated to a risk-free asset, which could have increased the strategy’s profitability.

--

--