An Introduction to Statistical Arbitrage for Cryptocurrencies (Part 1)

Deep in this bear market, who’s still making money, and how?

Published in

TalkBitcoinTalk

8 min readOct 20, 2018

The Rise and Fall of Day Trading

The 2017 bull run of the cryptocurrency market led to an unprecedented influx of capital from uninformed investors, which in turn gave rise a sort of day trading community that hadn’t otherwise been seen since the dot-com era.

Crypto day traders overwhelmingly used chart patterns, technical analysis indicators, signalling groups, and paid trading bots, among other questionable strategies.

Needless to say, this wizardry didn’t seem to work as well when the market crashed.

The Plight of Crypto “Hedge” Funds

2017 saw the creation of 167 crypto hedge funds, according to Bloomberg. By April 2018, nine had already shut down after crypto funds posted an average loss of 23% for the first three months of 2018. Shouldn’t funds focused on such a volatile asset class be protecting themselves from systematic risk (i.e. volatility of the entire crypto market)? In other words, shouldn’t crypto hedge funds be, well, hedged? Here’s what happened:

Many funds were long-only, investing in ICOs in addition to BTC and ETH and thus relied on the upward trajectory of the overall market to generate returns
As a young, developing market, crypto had (and continues to have) a notoriously poor offering of hedging instruments such as Futures and Options, meaning many fund managers were simply unable to take the short positions that would have hedged their risky bets.

When Long-only Falls Short

In mature markets, institutions employ market-neutral strategies in order to generate consistent returns. Such strategies are not considered with whether or not the price of an asset goes up or down, only the fact that it moves.

Statistical Arbitrage is a popular market-neutral approach to trading that was pioneered by Morgan Stanley in the 1980s, and has since evolved to become the cornerstone of many major quantitative hedge funds.

The simplest form of Statistical Arbitrage (or “stat arb”) is known as pairs trading, a type of strategy which exploits a relationship between two or more assets to profit from their mispricings. Let’s take a look at how it works.

Mean-Reverting Relationships

The theory behind pairs trading is that two companies in the same sector will experience similar market forces, which will affect their fundamentals, consequently causing their stock prices to move together. For example, two North American companies A and B that produce consumer graphics cards may have such a relationship.

Sometimes, an event may occur that causes a rapid change in the price of one asset. Consider a scenario in which the CEO of company A is accused of sexual harassment, triggering a sell-off of company A’s stock, while the price of company B’s stock remains unaffected. If the relationship between A and B’s prices is mean-reverting, we can purchase shares of company A after the sell-off, on the assumption that market forces will push up the price of A, restoring the statistical “equilibrium” between the prices of A and B, allowing us to make a profit from the difference.

Cointegration Explained

A popular way to mathematically model such a mean-reverting relationship between two assets is by using the cointegration approach.

Before introducing a formal definition for cointegration, let’s go over a non-financial example (adapted from Murray’s 1994 paper A Drunk and her Dog) to explain the concept.

Imagine the path taken by a drunk person walking out of a bar at 4AM. Likewise, consider the path taken by a dog roaming around without a leash. Both are random, or at the very least, highly unpredictable. Now, imagine that the drunk person owns the dog and they are walking around in a park. In this scenario, they will both still follow random paths, but will always stay within a certain distance of each other, provided the drunk person calls for his dog periodically. The dog might wander off to follow its nose, but will eventually return towards its owner.

The path of the dog and the drunk person are said to be cointegrated.

Mathematically, two time series x and y are cointegrated if there exists some stationary (i.e. mean, variance, etc. do not change) linear combination of A and B.

In other words, there exists:

y(t) − βx(t) = u(t), where u(t) is stationary and β is some coefficient.

Identifying Cointegrated Asset Pairs

Before we can build a pairs trading strategy, we need to come up with a hypothesis and test it. After looking at a list of the top 100 cryptocurrencies by market cap, I came up with three asset pairs that might have a mean-reverting relationship.

1. Ethereum and Ethereum Classic

Ethereum Classic is a fork of Ethereum. Aside from recent upgrades to Ethereum (eg. Casper), Ethereum Classic is basically just a version of Ethereum in which the founding community did not agree with the decision to refund victims of the DAO hack.

2. Tron and EOS

Both Tron and EOS are competing DPoS blockchains with market cap over US $1 billion, both launched their main nets around the same time and migrated their tokens away from the ERC-20 standard. Also, both have controversial leadership, which may give them similar levels of volatility.

3. Monero and ZCash

Monero and ZCash are currently the leading privacy coins by market cap. Both serve the same market for anonymous transactions, and both are considered to have solid privacy tech. Furthermore, neither coin had an ICO.

The next step is gathering historical price data for all of the above assets. I will be using 1 minute data for the BTC pairs, as scraped from Binance.

Let’s scrape and clean 6 months worth of 1 minute OLHCV data using Binance’s API (I used this: https://github.com/Roibal/python-binance). The specifics of the data cleaning process are left as an exercise to the reader (but mostly just omitted for the sake of brevity).

Testing for Cointegration

Now that we have our closely related pairs of cryptoassets, we need to statistically verify the existence of a cointegrating relationship. For this task, we will employ the Cointegrated Augmented Dickey-Fuller (CADF) test, which can be broken down into the following steps:

Perform a linear regression between the two time series
Calculate the residuals (the differences between the observed values of the dependent variable and the values predicted by the linear fit)
Run the Augmented Dickey-Fuller (ADF) test to determine whether the residuals are stationary or random walking (null hypothesis)

Let’s dive into the code. We’ll start by importing all the libraries we need and loading our cleaned data into a Pandas dataframe.

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.tsa.stattools as ts
import pandas as pd
from sklearn.linear_model import LinearRegressionasset_pairs = pd.read_pickle("price_data/binance_pairs.pkl")
eth = asset_pairs['pETH']
etc = asset_pairs['pETC']
trx = asset_pairs['pTRX']
eos = asset_pairs['pEOS']
xmr = asset_pairs['pXMR']
zec = asset_pairs['pZEC']

Now let’s write a function to handle the linear regression step. Note that regression is a non-commutative operation, which means the regression of X on Y is not the same as the regression of Y on X — so we need to run the CADF test twice for each pair.

def regress_prices(x_data, y_data):
    reg = LinearRegression(fit_intercept=True)
    reg.fit(x_data.reshape(-1,1), y_data.reshape(-1,1))
    r_c, r_i = reg.coef_[0,0], reg.intercept_[0]
    return r_c, r_i

Aa sanity check, you can use matplotlib at any time to make sure you didn’t make any major mistakes during the previous step. Next, we’ll write a function to calculate the residuals.

def residuals(x_vals, y_vals, coeffs):
    return y_vals - (coeffs[0] * x_vals + coeffs[1])

At this point we’re ready to perform the first two steps of the CADF test for our first pair.

eth_etc_coeffs = regress_prices(eth.values, etc.values)
etc_eth_coeffs = regress_prices(etc.values, eth.values)eth_etc_resids = residuals(eth.values, etc.values, eth_etc_coeffs)
etc_eth_resids = residuals(etc.values, eth.values, etc_eth_coeffs)

Let’s plot the residuals and see if we can observe any stationarity.

eth_on_etc = plt.figure()
plt.plot(eth_etc_resids)
plt.title('Residuals ETH_ETC')
etc_on_eth = plt.figure()
plt.plot(etc_eth_resids)
plt.title('Residuals ETC_ETH')
eth_on_etc.show()
etc_on_eth.show()

Although the mean doesn’t seem to change much, the variance of both series is wildly inconsistent. Regardless, we’ll run the ADF test and analyze the results.

eth_etc_adf = ts.adfuller(eth_etc_resids, 2)
etc_eth_adf = ts.adfuller(etc_eth_resids, 2)print("ETH_ETC: \n Test Statistic: ", eth_etc_adf[0], "\n p-value: ", eth_etc_adf[1],
     "\n 1% threshold: ", eth_etc_adf[4]['1%'])
print("ETC_ETH: \n Test Statistic: ", etc_eth_adf[0], "\n p-value: ", etc_eth_adf[1],
     "\n 1% threshold: ", etc_eth_adf[4]['1%'])

Because the test statistic is greater than the 1% critical level, we cannot reject the null hypothesis that the time series is a random walk. In other words, we did not find the stationary series we were looking for.

ETH_ETC: 
 Test Statistic:  -2.9522775899006417 
 p-value:  0.0395978455904137 
 1% threshold:  -3.430374931237981
ETC_ETH: 
 Test Statistic:  -1.9226050331183062 
 p-value:  0.3215028443564184 
 1% threshold:  -3.430374931237981

Let’s try again with TRX and EOS. The residuals look a bit better this time — at least for the most recent half of our dataset.

How about the ADF test results?

TRX_EOS: 
 Test Statistic:  -4.219363538265842 
 p-value:  0.0006108975322151075 
 1% threshold:  -3.430374931237981
EOS_TRX: 
 Test Statistic:  -4.106029762946809 
 p-value:  0.0009465346777632221 
 1% threshold:  -3.430374931237981

The ADF test statistic is well below the 1% critical level, which confirms that the series we observed above is sufficiently stationary.

Finally, we test the Monero and ZCash pair, which yields results similar to Ethereum/Ethereum Classic, which leaves us with Tron/EOS as the only asset pair we can build a strategy for.

Pairs Trading Algorithm

Now that we’ve chosen our asset pair, it’s time to build the pairs trading strategy. Our algorithm must handle three key tasks:

Process new information
Generate entry/exit signals
Manage risk

In order to determine entry and exit, we need to keep track of the spread z between two assets x and y, defined as:

z(t) = y(t) − βx(t), where β is once again the coefficient obtained from the linear regression function we wrote earlier.

We propose a strategy that involves opening a position whenever the current spread z exceeds a certain threshold, which we measure using the z-score, defined as:

z-score = [z(t)-µ]/σ, where µ is the rolling mean of z and σ is the rolling standard deviation of z

Entry

If z-score ≥ 1.5, short y(t) and buy βx(t)
If z-score ≤ -1.5, buy y(t) and short βx(t)

Exit

If z-score ≤ 0 and we are short y(t), buy back y(t) and sell βx(t)
If z-score ≥ 0 and we are long y(t), sell y(t) and buy back βx(t)
If |z-score| ≥ 4, close any open positions

Challenges

The algorithm described above is quite primitive. A few of its weaknesses are described below:

Assumes a constant β instead of using recent data to update it
Uses an arbitrary multiplier of z-score for entry and exit, instead of an optimized parameter or even a dynamic multiplier based on a rolling lookback window.
Does not account for total portfolio loss in risk management strategy, allowing for the possibility of several consecutive losses
Does not allow for multiple pairs to be traded
Does not close positions and halt trading if the cointegrating relationship breaks down in the future

In Part 2, I will address the above issues, and continue building and refining the strategy. Feel free to leave questions and comments below. Stay tuned!

None of this article should be construed as investment advice and is posted for informational purposes only.

We hope you enjoyed this article! TalkBitcoinTalk is about building a trusted community of BitcoinTalkers and sharing great ideas and great services. If you have an idea for an article or are looking for help, drop us a line!