The new playground of statistical arbitrage: the fall and rise of cryptocurrencies

Peng Wang
amberdata
Published in
10 min readAug 1, 2019

Preface: If you are an individual day trader, please read on because this blog post could provide you with the quickest and easiest way of profiting from day trading. If you are the manager of an institutional investor, please read on because this blog post could provide you with the fastest and most accurate way of obtaining real-time market data for decision making. If you are a cryptocurrency broker/dealer, please read on because this blog post could provide you with a perfect example of trading bot where you could plug into your fabulous business!

Photo by Mark Finn on Unsplash

Bitcoin has left a permanent imprint in the financial world since its dramatic fall in price near the beginning of 2018. Many investors have since then cursed cryptocurrencies and some are still licking their wounds. However, recently there seems to be a strong rebound on the cryptocurrency side thanks to Facebook’s ambition of creating a new cryptocurrency and there are strong revived interest from mainstream investors on cryptocurrencies. As a result, this market is recovering at an amazing speed. Whenever there are volatilities and uncertainties, there are opportunities: opportunities for gold digging. Thanks to the openness and decentralized nature of blockchain technology (the underpinning technology behind cryptocurrencies), professional tradings on cryptocurrencies are widely available to everyone because it is orders-of-magnitude easier to gain access to high quality market data on cryptocurrencies than to traditional exchange traded assets (which are controlled by only a handful of powerful exchanges such as NYSE, NASDAQ, etc.). If you are a fan of statistical arbitrage and have some interest in trading bots, let us take a joint venture (no pun intended :) on it! We will do some fun coding together to discuss the what/why/how of statistical arbitrage and cryptocurrencies. For this purpose we will be using the most straightforward programming language: Python (in particular Python3). But the ideas and logics could be easily ported to any other programming languages. The major libraries/dependencies that we are going to leverage are Python’s “requests” library and “websocket-client” library. The former will be used to make http requests and the latter will be used to create websocket clients.

There is a golden rule in data science: garbage-in-garbage-out. To put it in layman’s term, if we feed a data science model with garbage data, what we’d get out of the model will be garbage results. Therefore, it is of vital importance to obtain high quality market data if you are going to bet serious money on the trading model we are going to build. Sometimes the data themselves take more precedence than the mathematical model. At the moment of this writing, almost every cryptocurrency exchange on the planet provides their market data to the public free of charge (except that there are rate limits imposed on their public endpoints). Everyone could write a program to collect those data and store them, although that indeed would be a ton of data to collect and store. For several trading pairs, certainly you could do the data collection and storage by yourself. If we want to analyze patterns, discover trends, and/or exploit opportunities among many crypto currencies pairs (nowadays there are thousands of them), perhaps we’d better get it from a market data vendor/provider. Plus the focus of this blog post is about trading rather than data-warehousing. Therefore we will obtain our data from a third-party market data vendor: amberdata.io. The lovely trading robot that we are going to build in this blog post depends on such high-quality market data.

To use Amberdata’s data, we need to sign up to get an api key (which I already did. There are different tiers/plans to choose from: some endpoints/capabilities are completely free, while some others aren’t but provide free trials). For me, there weren’t any hesitation: remember the rule in data science? (garbage-in-garbage-out, or we could put it another way: gold-in-gold-out.) So…Let us take off our gold digging journey and start by looking at some of the basic information about our market data (the complete repository is located at https://github.com/amberdata/ethereum-arbitrage) so that we know what crypto exchanges and what crypto trading pairs are supported by Amberdata:

import requestsfrom log import loggerfrom constants import Constantsif __name__ == ‘__main__’:    logger.info(‘main start’)    url = Constants.AMBERDATA_BASE_MARKET + ‘/exchanges’    headers = { ‘x-api-key’: Constants.AMBERDATA_API_KEY }    response = requests.get(url, headers=headers)    payload = response.json()[‘payload’]    exchange_count_pair = [(k, len(v)) for (k,v) in payload.items()]    exchange_count_pair.sort(key=lambda x: x[1], reverse=True)    logger.info(exchange_count_pair[:10])    pair_count_exchange_dict = {}    for k, v in payload.items():        for pair in list(v.keys()):            if pair not in pair_count_exchange_dict:    pair_count_exchange_dict[pair] = 0    pair_count_exchange_dict[pair] += 1    pair_count_exchange = [(k, v) for (k,v) in pair_count_exchange_dict.items()]    pair_count_exchange.sort(key=lambda x: x[1], reverse=True)    logger.info(pair_count_exchange[:10])    logger.info(‘main end’)

Needless to say that each crypto exchange has a different set of supported crypto trading pairs and each crypto trading pair may also have a different set of supported crypto exchanges. The results from the above code snippet could provide us with some basic idea about which exchanges have more supported pairs and which pairs have wider supports by exchanges: the more/wider, the better because our end goal is to find statistical arbitrage opportunities. The results are:

[(‘binance’, 502), (‘huobi’, 440), (‘bitfinex’, 349), (‘zb’, 217), (‘bithumb’, 82), (‘kraken’, 70), (‘gdax’, 38), (‘bitstamp’, 15), (‘gemini’, 15), (‘bitmex’, 8)][(‘eth_btc’, 7), (‘ltc_btc’, 7), (‘xrp_btc’, 6), (‘eth_usd’, 6), (‘ltc_usd’, 6), (‘etc_btc’, 5), (‘xlm_btc’, 5), (‘zrx_btc’, 5), (‘btc_usd’, 5), (‘xrp_usd’, 5)]

Great. We could clearly see that binance and huobi (by the way, these two exchanges only support crypto-crypto trading pairs) have a lot of pairs available for trading: significantly more than other exchanges. On the other hand, the most widely supported pairs do not differ a lot from each other: eth_btc, ltc_btc, etc. Let us pick binance/huobi and eth_btc for further analysis.

We will use the one-minute open-high-low-close price time series for as the data foundation for our algorithm development. Let us first take a closer look and examine the details on binance and huobi:

import requestsfrom log import loggerfrom constants import Constantsfrom datetime import datetimeif __name__ == ‘__main__’:    logger.info(‘main start’)    url = Constants.AMBERDATA_BASE_MARKET + ‘/ohlcv/information?exchange=binance,huobi’    headers = { ‘x-api-key’: Constants.AMBERDATA_API_KEY }    response = requests.get(url, headers=headers)    payload = response.json()[‘payload’]    for exchange in [‘binance’, ‘huobi’]:        logger.info(‘{} eth_btc startDate is {}’.format(exchange, datetime.utcfromtimestamp(payload[exchange][‘eth_btc’][‘startDate’]/1000.0).isoformat()))    logger.info(‘main end’)

This snippet tells us how much data are available from Amberdata (i.e. what is the time range of one-minute open-high-low-close price data for the trading pair that we are focusing on). We get:

binance eth_btc startDate is 2017–07–14T00:00:00huobi eth_btc startDate is 2019–04–24T00:00:00

Good, for one-minute open-high-low-close price data, we have at least 50K data points available for eth_btc on both binance and huobi: good enough to reveal statistical significance. The statistical metric that we are targeting at is the standard deviation between the close prices on the two exchanges. Before proceeding, let me explain a little more about why we are targeting at is the standard deviation. Now for simplicity we make an important assumption that the price difference for the same asset between two exchanges follows a normal distribution. Standard deviations are widely used in many data science fields and in particular for normally-distributed data: the probability of having a data point outside one standard deviation of a normally-distributed data set is 31.73%, the probability of having a data point outside two standard deviations of a normally-distributed data set is 4.55%, and the probability of having a data point outside three standard deviations of a normally-distributed data set is 0.27%. Let us call the standard deviation “sigma”. This means that most of the time the between the close prices on the two exchanges will stay between -2*sigma and +2*sigma (we choose two standard deviations rather than three because classical pair arbitrage in quantitative finance usually chooses two as the threshold), but occasionally it might “walk” outside that range. However, as time elapses, the difference will eventually be pulled back into that range thanks to natural market forces.We will be using historical data to determine the value of this important parameter sigma, and we will be searching for the future occurrences of such rare events that lie outside of two sigmas.

Now let us retrieve the historical one-minute open-high-low-close price data from 2019–04–24 to 2019–05–24.

import requestsimport statisticsfrom log import loggerfrom urllib.parse import urlencodefrom constants import Constantsdef join(data_a, data_b, key_index, value_index):    data_b_dict = { x[key_index]: x[value_index] for x in data_b }    data_joined = []    for x in data_a:        if x[key_index] in data_b_dict:            data_joined.append([x[key_index], x[value_index], data_b_dict[x[key_index]]])    return data_joinedif __name__ == ‘__main__’:    logger.info(‘main start’)    url = Constants.AMBERDATA_BASE_MARKET + ‘/ohlcv/eth_btc/historical?’    filters = {}    filters[‘exchange’] = ‘binance,huobi’    filters[‘timeInterval’] = ‘minutes’    filters[‘startDate’] = 1556064000 #’2019–04–24'    filters[‘endDate’] = 1558656000 #’2019–05–24'    url += urlencode(filters)    logger.info(‘url = {}’.format(url))    headers = { ‘x-api-key’: Constants.AMBERDATA_API_KEY }    response = requests.get(url, headers=headers)    data = response.json()[‘payload’][‘data’]    data_joined = join(data[‘binance’], data[‘huobi’], 0, 4)    diff_price_stdev = statistics.stdev([x[1] — x[2] for x in data_joined])    logger.info(‘diff_price_stdev = {}’.format(diff_price_stdev))    logger.info(‘main end’)

From the above snippet, we got the standard deviation (i.e. sigma) of prices between binance and huoi for eth_btc. The classical pair arbitrage in quantitative finance is such that it will long X eth (Note: in the eth_btc trading pair, not eth_usd) on binance and short X eth on huobi when the price difference between the two exchanges is below -2*sigma, and when that difference is pulled back to near 0 (e.g -0.1*sigma), we exit those two positions and materialize a profit. Similarly, the classical pair arbitrage in quantitative finance will short X eth on binance and long X eth on huobi when the price difference between the two exchanges is above 2*sigma, and when that difference is pulled back to near 0 (e.g 0.1*sigma), we exit those two positions and materialize a profit. Hmm…did we just notice the the beauty of this classical strategy? ;) That’s right: it is market neutral, meaning it is independent of whether the actual prices are going up or down. Our profits only depend on the fact that the market fluctuates. (Notes: The classical pair arbitrage in quantitative finance was first invented by Morgan Stanley in the 1980’s. Originally it was applied to a pair of assets traded on the same exchange. Here we have applied it to two different exchanges thanks to the decentralized nature of cryptocurrencies)

Now we have our trading strategies, let us setup the trading bot. This time we need to use websocket connections to get realtime best-bid-offers for etc_btc on binance and huobi to capture entry and exit points for our positions.

import websocketfrom log import loggerimport sslimport jsonfrom constants import Constantsprices = {}sigma = 3.862523415579312e-05def on_open(ws):    logger.info(‘websocket {} was connected’.format(ws.url))    ws.send(json.dumps({‘jsonrpc’: ‘2.0’,‘id’: 1,‘method’: ‘subscribe’,‘params’: [‘market:bbos’, {‘pair’: ‘eth_btc’, ‘exchange’: ‘binance’}]}))    ws.send(json.dumps({‘jsonrpc’: ‘2.0’,‘id’: 2,‘method’: ‘subscribe’,‘params’: [‘market:bbos’, {‘pair’: ‘eth_btc’, ‘exchange’: ‘huobi’}]}))def place_order_if_condition_met(prices):    if prices[‘binance_bid’] and prices[‘huobi_ask’]:        diff = abs(prices[‘binance_bid’] — prices[‘huobi_ask’])    if diff > 2 * sigma:        logger.info(‘here we need to take some positions’)    elif diff < 2 * sigma:        logger.info(‘here we need to exit some positions’)    if prices[‘binance_ask’] and prices[‘huobi_bid’]:        diff = abs(prices[‘binance_ask’] — prices[‘huobi_bid’])    if diff > 2 * sigma:        logger.info(‘here we need to take some positions’)    elif diff < 2 * sigma:        logger.info(‘here we need to exit some positions’)def on_message(ws, message):    logger.info(‘message = {}’.format(message))    json_message = json.loads(message)    if json_message.get[‘params’] and json_message.get[‘params’].get(‘result’):        result = json_message.get[‘params’].get(‘result’)    if result[‘exchange’] == ‘binance’:        if result[‘isBid’]:            prices[‘binance_bid’] = result[‘price’]            place_order_if_condition_met(prices)        else:            prices[‘binance_ask’] = result[‘price’]            place_order_if_condition_met(prices)    elif result[‘exchange’] == ‘huobi’:        if result[‘isBid’]:            prices[‘huobi_bid’] = result[‘price’]            place_order_if_condition_met(prices)        else:            prices[‘huobi_ask’] = result[‘price’]            place_order_if_condition_met(prices)if __name__ == ‘__main__’:    logger.info(‘main start’)    ws = websocket.WebSocketApp(Constants.AMBERDATA_WEBSOCKET_BASE)    ws.header = {‘x-api-key’: Constants.AMBERDATA_API_KEY}    ws.on_open = on_open    ws.on_message = on_message    ws.run_forever(sslopt={‘cert_reqs’: ssl.CERT_NONE})

For those of you who are conservative traders and think that websocket might be somewhat dangerous because a tiny bug could very, very quickly trigger unforeseeable disasters (which could be mitigated by throughly testing the code with paper trading), we could use a periodic polling logic on the ticker endpoint to replace the websocket’s continuous feed:

import requestsfrom concurrent.futures import ThreadPoolExecutor,as_completedfrom log import loggerimport timeimport jsonfrom constants import Constantssigma = 3.862523415579312e-05def place_order_if_condition_met(prices):    if prices[‘binance_bid’] and prices[‘huobi_ask’]:        diff = abs(prices[‘binance_bid’] — prices[‘huobi_ask’])    if diff > 2 * sigma:        logger.info(‘here we need to take some positions’)    elif diff < 2 * sigma:        logger.info(‘here we need to exit some positions’)    if prices[‘binance_ask’] and prices[‘huobi_bid’]:        diff = abs(prices[‘binance_ask’] — prices[‘huobi_bid’])        if diff > 2 * sigma:            logger.info(‘here we need to take some positions’)        elif diff < 2 * sigma:            logger.info(‘here we need to exit some positions’)if __name__ == ‘__main__’:    logger.info(‘main start’)    while True:        url = Constants.AMBERDATA_BASE_MARKET + ‘/tickers/eth_btc/latest?exchange=binance,huobi’        headers = { ‘x-api-key’: Constants.AMBERDATA_API_KEY }        response = requests.get(url, headers=headers)        payload = response.json()[‘payload’]        prices = {}        prices[‘binance_bid’] = payload[‘binance’].get(‘bid’)        prices[‘binance_ask’] = payload[‘binance’].get(‘ask’)        prices[‘huobi_bid’] = payload[‘huobi’].get(‘bid’)        prices[‘huobi_ask’] = payload[‘huobi’].get(‘ask’)        place_order_if_condition_met(prices)        time.sleep(60)    logger.info(‘main end’)

Before we put our trading bot into the real battle field with large sums of real money, there are several things deserving further investigation and scrutinization: a. In our code, we have largely omitted error handling for simplicity. For safety, it is important to add timeout logics on requests for external calls, surround those logics with try-except, and handle those errors/exceptions accordingly (depending on whether you are an offensive or defensive programmer). b. There are a few places where we’ve chosen some threshold values such as 2*sigma, 0.1*sigma, etc., these values are adjustable and the choices would affect the confidence level of such strategies. Feel free to change those threshold values according to your level of acceptance of risks. c. A trading bot handling real money would need to have components for stop-loss logics, portfolio rebalance, order management, etc. etc. In my opinion, these are indispensable rather than optional components. d. Cryptocurrencies usually (e.g. for proof-of-work protocol) have some latency for transaction confirmation. If you know the wallet’s address, you’d better to check the transaction status by yourself rather than solely depending on the exchange’s words because sometimes there are unexpected dragons (remember Mt. Gox?) on the blockchain and the consensus on whether a transaction is ultimately successful or not comes from the blockchain itself rather than a particular exchange, thanks to the decentralized nature of cryptocurrencies.

Disclaimer: The content in this blog post is for informational purposes only, you should not construe any such information and/or other linked materials as investment, financial, or other advice.

--

--