Analyze Trade Arrivals in Crypto Markets

Crypto Chassis
Open Crypto Trading Initiative
7 min readNov 8, 2020
Photo by Daniel Abadia on Unsplash

Over the years, High Frequency Trading (HFT) has become a popular and mature technology in traditional assets trading, and is gaining substantial amount of attention and momentum in the recently-boomed venue of crypto assets trading. As the race to zero latency continues, high frequency data (e.g. tick data), a key component in HFT, remains under the radar of researchers and quantitative analysts/strategists across the world. Unlike low frequency data, which is recorded at regular time periods, tick data arrive at irregularly spaced time intervals. Apparently, many of the classical analysis based on regular time series are no longer applicable to tick data. With the open nature of cryptocurrencies and the readily available crypto trading tick data, can we apply modern HFT data analysis tools and techniques to these datasets, extract some unique insights from them, and perhaps even spy a few lucrative opportunities?

In this article, we are going to do an in-depth analysis on a HFT metric called “Trade Arrival Frequency” for a variety of different cryptocurrencies such as Bitcoin, Ethereum, Litecoin, etc. using accurate data from a variety of exchanges such as Coinbase, Gemini, Kraken, etc. The trade arrival frequency is one of the foundational pieces of information needed towards building a full-fledged HFT strategy.

Brief Literature Survey

Diamond and Verrecchia (https://www.sciencedirect.com/science/article/abs/pii/0304405X87900420) and Easley and O’Hara (https://www.sciencedirect.com/science/article/abs/pii/0304405X87900298) were the first to suggest that the duration between subsequent data arrivals carries information. In particular, the models pointed out that in markets of securities where short selling is disallowed, the shorter the inter-trade duration, the higher the likelihood of unobserved good news. The reverse also holds: in markets with limited short selling and normal liquidity levels, the longer the duration between subsequent trade arrivals, the higher the probability of yet-unobserved bad news. A complete absence of trades, however, indicates a lack of news. The authors further pointed out that trades that are separated by a large time interval carry very different information content than trades occurring in close temporal proximity.

Duration Models

The process of information arrivals is modeled using the so-called duration models. Duration models are used to estimate the factors affecting the duration between any two sequential ticks. Such models are known as quote processes and trade processes, respectively. Duration models are also used to measure the time elapsed between price changes of a pre-specified size, as well as the time interval between pre-determined trade volume increments.

Information arrivals are often modeled using Poisson processes. The Poisson process is one of the most widely-used counting processes. It is usually used in scenarios where we are counting the occurrences of certain events that appear to happen at a certain rate, but completely at random (without a certain structure). For example, suppose that from historical data, we know that earthquakes occur in a certain area with a rate of 22 per month. Other than this information, the timings of earthquakes seem to be completely random. Thus, we conclude that the Poisson process might be a good model for earthquakes. In practice, the Poisson process or its extensions have been used to model the number of car accidents at a site or in an area, the location of users in a wireless network, the requests for individual documents on a web server, the outbreak of wars, photons landing on a photodiode, and even lifetime distribution of alien civilizations (https://www.youtube.com/watch?v=LrrNu_m_9K4). Let’s assume that trade arrivals can be modeled by a Poisson process, then from the statistical properties of a Poisson process we know that the time intervals between adjacent trades follow the exponential distribution (https://en.wikipedia.org/wiki/Exponential_distribution). Therefore if we do notice that in some markets the inter-trade durations do not follow an exponential distribution, it means that trade arrivals in those markets cannot be adequately modeled by the Poisson process. In fact there are several recent research articles studying the generalizations and/or replacements of the Poisson process used in these models, although they do need much much more sophisticated mathematical tools. So for our study below, we will stick to the widely-accepted Poisson process assumption.

Cryptocurrency Market Data API

Now we have a simple model ready in hand and we can proceed to extracting cryptocurrency data for our analysis. There are many market data providers and a handful of them have tick-by-tick trade data available. Ultimately we chose to use our in-house data because apparently it is completely free for us (and less apparently it is also completely free for anyone else, i.e. we’ve taken a completely open and inclusive approach). Head over to https://api.cryptochassis.com/v1/trade/coinbase/btc-usd?startTime=2020-10-20, click on the pre-signed AWS S3 url to download a file and unzip it, we get a standard csv file:

time_seconds,time_nanoseconds,price,size,is_buyer_maker,trade_id
1603152000,38000000,11760.45,0.00261565,0,106089510
1603152000,637000000,11760.07,0.00141045,0,106089511
...

These rows represent the tick-by-tick trade data for coinbase’s btc-usd on 2020–10–20. Columns time_seconds and time_nanoseconds represent the seconds and nanoseconds portion of the unix timestamp at which the trade was transacted. The price and size columns are self-explanatory. The column is_buyer_maker tells us whether the buyer is the maker (i.e. the seller is the taker) or not. The trade_id column is the unique integer id of the trade reported by the exchange (if there isn’t such an id from the exchange, we use the timestamp with nanosecond resolution as the trade id). The columns that we are going to use in this study are time_seconds and time_nanoseconds because the difference of time between adjacent rows is the inter-trade duration. In a future post, we will discuss and highlight the usefulness and importantce of the “is_buyer_maker” column, which by itself deserves a full article.

Write the Code and Analyze the Data

In this section, we will walk you through the Python code needed to download, analyze, and visualize the relevant data for validating the inter-trade duration models. Here we would really like to highlight how easy it is to use our data because they really shine when it comes to usability and simplicity. Let’s roll up our sleeves:

import gzip
import os
import requests
saveCsvLocalDir = os.environ['SAVE_CSV_LOCAL_DIR']
session = requests.Session()
startTime = '2020-10-23'
for exchange in ['coinbase', 'gemini', 'kraken']:
for instrument in ['btc-usd', 'eth-usd', 'ltc-usd']:
r1 = session.get(f'https://api.cryptochassis.com/v1/trade/{exchange}/{instrument}?startTime={startTime}')
awsS3SignedUrl = r1.json()['urls'][0]['url']
r2 = session.get(awsS3SignedUrl)
csvString = gzip.decompress(r2.content).decode('utf-8')
with open(f'{saveCsvLocalDir}/{exchange}_{instrument}.csv', 'w') as csvFile:
csvFile.write(csvString)

With a mere 14-lines of code here, we have saved the tick-by-tick trades from coinbase, gemini, kraken, for btc-usd, eth-usd, ltc-usd on 2020–10–23. Running this code takes only a few seconds: you don’t need complicated API key setups, you don’t need complicated pagination logics, you don’t need long download waiting times, and you don’t need to read lengthy how-to-use documentations. Our APIs were designed for dummies and professionals alike with one philosophy in mind: simplicity, simplicity, simplicity!

From now on, we will focus our data analysis and visualization efforts on btc-usd from coinbase: compared to other exchanges and trading pairs, btc-usd from coinbase has the highest number of trades per day. We want to first examine some of the basic statistical properties of the inter-trade duration distribution: e.g. mean, median, standard deviation, skewness, etc. Let’s roll up our sleeves:

import os
import pandas as pd
saveCsvLocalDir = os.environ['SAVE_CSV_LOCAL_DIR']
df = pd.read_csv(f'{saveCsvLocalDir}/coinbase_btc-usd.csv', usecols=['time_seconds', 'time_nanoseconds', 'price'])
df2 = df.groupby(['time_seconds', 'time_nanoseconds']).size().to_frame('number_of_trades').reset_index()
df2['time'] = df2['time_seconds'].astype(int) + df2['time_nanoseconds'].astype(int) / 1e9
df2['inter_trade_duration'] = df2['time'].diff()
df2['pd_time'] = pd.to_datetime(df2['time_seconds'].astype(int), unit='s')
df2['hour'] = df2['pd_time'].map(lambda x: x.hour)
df3 = df2.groupby('hour').agg(
number_of_trades=('number_of_trades', 'sum'),
average=('inter_trade_duration', 'mean'),
median=('inter_trade_duration', 'median'),
stdev=('inter_trade_duration', 'std'),
skewness=('inter_trade_duration', 'skew')
).reset_index()
print(df3.to_string(index=False))

Output:

Here we have divided the whole day’s data into 24 hours and computed the statistics for each and every hour. Regardless of the hour in the day, the average inter-trade duration is about 1 second and the number of trades per hour is about several thousands: this represents a fairly vibrant and intense trading activity which is even comparable to some popular ETFs on traditional exchanges. If we take a look at similar data points from several years ago, we can see that the trading activity has been monotonically increasing over the past several years (so is the acceptance and popularity of bitcoin 😆. By the way, we do have 5 years of historical data available at your finger tips: https://github.com/crypto-chassis/cryptochassis-api-docs#public-rest-api-from-cryptochassis).

Next we would like to “fit” our data points to the exponential distribution so that we can validate (or invalidate) our theories. Using maximum likelihood estimation (https://www.statlect.com/fundamentals-of-statistics/exponential-distribution-maximum-likelihood), it is straightforward to derive that the inter-trade duration distribution should follow a probability density function of

f(x) = lambda * exp(-lambda * x)
with lambda = 1 / 1.1983892418235191.
Here 1.1983892418235191 is the mean value of the inter-trade duration for the whole day.

With the above equation and the sample data points, we can make a plot to visually examine how well our data points fit the theoretical model. Let’s roll up our sleeves:

import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
from scipy.stats import expon
saveCsvLocalDir = os.environ['SAVE_CSV_LOCAL_DIR']
df = pd.read_csv(f'{saveCsvLocalDir}/coinbase_btc-usd.csv', usecols=['time_seconds', 'time_nanoseconds'])
df2 = df.groupby(['time_seconds', 'time_nanoseconds']).size().to_frame('number_of_trades').reset_index()
df2['time'] = df2['time_seconds'].astype(int) + df2['time_nanoseconds'].astype(int) / 1e9
df2['inter_trade_duration'] = df2['time'].diff()
df2 = df2.iloc[1:]
df2['inter_trade_duration'].plot(kind='hist', density=True, bins=100)
arange = np.arange(0, 10, 0.01)
plt.plot(arange, expon.pdf(arange, 0, 1.1983892418235191))
plt.xlabel('Inter-trade duration (seconds)', fontsize=20)
plt.xlim(0,6)
plt.ylabel('Probability', fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.show()
Probability density function for inter-trade duration
Probability density function for inter-trade duration

Clearly our theory can describe the reality to a near-perfect extent! The inter-trade duration in this market certainly follows the exponential distribution, which implies that the trade arrival for this market follows the Poisson process!

Phew, enough math and coding 😐. To conclude, in this article, we have made a brief review on the statistical theories and models used to describe trade arrival events, learned how to quickly download tick-by-tick trade data for various cryptocurrencies and various exchanges, examined some basic statistical properties in the context of crypto trades, and performed statistical parameter estimations and plottings to verify the proposed theories. As a reader and/or trader, how would you like to leverage the data to your advantage? 😄 The sky is limitless!

--

--