Storing and Processing Billions of Cryptocurrency Market Data Using InfluxDB

Published in

Coinograph

8 min readJun 4, 2018

Recently, I’ve been interested in historical market data across different cryptocurrency exchanges for data analysis and automated trading. I began looking for a source to get OHLCV and raw trade data and soon I realised finding high resolution historical market data is not very straightforward. To be fair, most of the known exchanges provide a rather complete API allowing you to fetch market data through their public endpoints and websockets. However, when it comes to historical data it could be challenging because many of them do not support fetching the history of raw trades or high resolution OHLCV. I decided to collect this data myself by finding this data from around the web, APIs, websockets and eventually, I turned it into a service named Coinograph which provides API and historical data offering across multiple exchanges. At the moment Coinograph has nearly 3B data points for 6major exchanges.

The technical challenge is that cryptocurrency exchanges produce a massive volume of trading on a daily basis. For example, at the time of writing this post, only on Binance the number of trades exceed 3.5M per day.

Enter InfluxDB

InfluxDB is a fast time-series database written in GO. It is a great product to store timestamped data such as application metrics, sensor data and real-time analytics. It offers a SQL-like query language for data exploration.

Originally, I used PostgreSQL to store time-series data. However, once I reached several hundred millions of records, querying the data increasingly became slow and scaling it was more difficult. Also, crypto market data are immutable time-series and can be piled up using a schemaless storage such as InfluxDB.

In InfluxDB, the data is stored in measurements which is conceptually the equivalent of a table in an RDBMS. Each data point has a timestamp and can have fields and tags. Fields are not index and tags are indexed. InfluxDB organises data into series based on measurement name and tags for fast retrieval. If a point does not have any tags the timestamp is the unique key and will be overridden if a new point with the same timestamp is inserted. You can read more about the key concepts in the official documentation.

InfluxDB has an HTTP API to read and write data and we use influxdb-python as a wrapper on top it. The data can be written in batches to avoid too many requests.

Cryptocurrency Market Data

In a cryptocurrency or a traditional exchange, every trade essentially has a few basic information such as timestamp, pair, trade_id, side (sell/buy) and amount. A simple trade can be inserted in InfluxDB interactive CLI:

> CREATE DATABASE mydb;> USE mydb;> INSERT trades,exchange=binance,pair=btcusdt trade_id=100,price=8200.0,side="sell",amount=0.21 1527867107000ms

This will add a sample trade to the trades measurement with exchange and pair as tags while trade_id, price, side and amount as fields (notice the space after btcusdt) because we don’t need those fields to be indexed. At the end, we can also set the timestamp of the data. If that is not provided InfluxDB automatically set the present time as the timestamp. In our case, the exchanges usually provide the timestamp and therefore we can store the trade with that timestamp. The timestamp is epoch in nano seconds but one can use hour(h), minute(m), second(s), millisecond(ms), microsecond(u) by simply adding it to the number. For example 1527867107000ms is the same as 1527867107s.

There is no need to define any schema and our data will be available right away. You can query it with:

> select * from trades
name: trades
time                amount exchange  pair      price side trade_id
----                ------ --------  ----      ----- ---- --------
1527867107893000000 0.21   binance   btcusdt   8200  sell 100

In case you need to know which columns are defined as tags you can run:

> show tag keys
name: trades
tagKey
------
exchange
pair

If you run show series you will also see a new series is created:

> show series
key
---
trades,exchange=binance,pair=btcusdt

If we add another data point with different tags, a new series will be created.

> INSERT trades,exchange=bitfinex,pair=btcusd trade_id=23100,price=8100.0,side="buy",amount=1.43 1527121437000000000
> select * from trades
name: trades
time                amount exchange pair    price side trade_id
----                ------ -------- ----    ----- ---- --------
1527121437000000000 1.43   bitfinex btcusd  8100  buy  23100
1527867107893000000 0.21   binance  btcusdt 8200  sell 100

If you want to see readable timestamp you can run > precision rfc3339 to see standard ISO time format.

What is now interesting is that if you now check the list of the series you will see this:

> show series
key
---
trades,exchange=binance,pair=btcusdt
trades,exchange=bitfinex,pair=btcusd

A new series is created because the new data point was added with different tags. InfluxDB allows you to have millions of series based on tags and when querying the database by tags it can find the right series based on measurement name and tags very quickly.

An important indicator in financial market data is to run aggregation on raw trade data to construct OHLCV or candlestick data points on a given time interval. For example, the following charts shows the candlestick chart for BTCUSD on Bitfinex on a daily interval.

A candlestick is constructed from the following information for a given time interval:

Open: The price of the first trade
High: The highest traded price
Low: The lowest traded price
Close: The price of the last trade
Volume: Total amount of trades

The color indicates if the close of a candle was above (green) or below (red) the opening price.

InfluxDB supports aggregate functions and makes it extremely easy to construct candlestick data from raw trade data. Aggregating trade data to construct candlesticks is as easy as the following query:

SELECT
first(price) AS open,
last(price) AS close,
max(price) AS high,
min(price) AS low,
sum(amount) AS volume 
FROM trades 
WHERE exchange='binance' AND pair='btcusdt' AND time > 1525777200000ms and time < 1525831200000ms 
GROUP BY time(1h), pair, exchange

The group by time(1h) groups the trades in one hour intervals and the aggregate functions operate in that interval. Also, grouping by pair and exchange converts them to tags and makes them indexed. The result of the query above would be:

name: trades
tags: exchange=binance, pair=btcusdt
time          open    close   high    low     volume
----          ----    -----   ----    ---     ------
1525777200000 9295    9198    9305    9185    1780.9619940000014
1525780800000 9200    9140    9200.01 9060.54 3107.177060999968
1525784400000 9140    9128.02 9194.08 9102    1214.1127129999957
1525788000000 9129.99 9135    9187    9100    1134.3902329999914
1525791600000 9135    9199.6  9199.99 9129.31 873.3965249999952
1525795200000 9199.64 9135    9199.99 9120    725.5548289999944
1525798800000 9134    9136.93 9181    9123.78 615.2397540000007
1525802400000 9136.95 9217    9259.23 9080    1652.4270649999996
1525806000000 9217    9255.01 9271    9201.01 757.6078589999971
1525809600000 9255    9163    9257    9163    778.7147539999921
1525813200000 9165    9190    9247.01 9163    637.2860459999731
1525816800000 9190    9199.02 9250.32 9185.04 679.3364369999649
1525820400000 9200    9187.56 9254    9170.57 608.9023279999975
1525824000000 9178    9123.99 9199    9109.61 1180.1364439999986
1525827600000 9115.1  9166    9175    9115.1  666.0184420000018

For a large number of trades, it is not very efficient to construct the candles on runtime. It is possible to use the INTO clause to insert them in a separate measurement. Afterwards you can build higher interval candlesticks from smaller candles. For example if you have 1m candles and you want to build the 5m candles you will have:

SELECT max(high) AS high, min(low) AS low, first(open) AS open, last(close) AS close, sum(volume) AS volume FROM candles_1m GROUP BY time(5m), pair, exchange

Continuous Queries and Real-time data Aggregation

For any financial market it is necessary to get real-time data as they are produced. The same goes for any cryptocurrency exchange. You can either get the latest data through API or websockets. When receiving trade data, you would want to aggregate them on different time intervals to update the latest candle. InfluxDB has a feature named continuous queries which allows you to run queries periodically on real-time data and store them in a specific measurement. For example, in our case as new trades come in we want to make sure that the candlestick for the latest interval is updated accordingly. Moreover, you can also aggregate a few recent intervals in case you had some connection loss and you’ve updated the trades with the missing trades. Therefore you can create a continuous query such as:

CREATE CONTINUOUS QUERY trade_to_candles_60 ON mydb 
RESAMPLE EVERY 10s FOR 10m 
BEGIN SELECT first(price) AS open, last(price) AS close, max(price) AS high, min(price) AS low, sum(amount) AS volume INTO 
candles_1m FROM trades WHERE 
GROUP BY time(1m), pair, exchange END

This is an advanced example of continuous queries. They query between BEGINand END creates candlesticks on 1m interval (group by time(1m) ) and inserts it into candle_1m measurement. RESAMPLE EVERY 10s executes this query once every 10 seconds. FOR 10m runs this query in the range of now and 10 minutes ago meaning in this case the last 10 candles will be reconstructed and overwritten. So with real-time data, the aggregation of candlesticks happens every 10 seconds.

Once you have 1m candlesticks you can easily aggregate them with more continuous queries to construct bigger time intervals.

Continuous queries are very powerful to deal with real-time data and saves a lot of time and resources to run your own periodic jobs.

Hardware Considerations

Depending on your requirements, InfluxDB needs different amount of resources to run smoothly. The official documentations provide some hardware sizing guidlines to help you setup your instance. Since InfluxDB version 1.5 and the release of Time Series Index our memory footprint has gone to around 1.5GB down from 2.5GB. However, importing a lot of data at once can still consume high amount of memory up to 4GB or 5GB. For real-time data, we batch the writes to reduced the number of request and has worked very well so far. Our read request are growing but still not significant therefore we haven’t reached any limit.

InfluxDB also supports clustering but it is a paid service and so far we didn’t need to use that.

Final Thoughts

Overall, InfluxDB is quite suitable for our requirements. We will be scaling up to more exchanges and also support decentralised exchanges and the volume of data will increase significantly. Data aggregation functions along with continuous queries makes processing real-time data very easy and the SQL-like query language allows data exploration as easy as relational databases. We are expecting larger volumes in the upcoming month and we can assess how it is performing in the future.

If you find this article interesting you can give us a clap and If you’re interested about real-time and historical cryptocurrency data, checkout https://coinograph.io. We provide API as a service and historical data dumps for different markets. Also we have a bot for cryptocurrency traders to get technical analysis signals. See it on https:/t.me/coinographbot

You can also follow me on twitter on @n4cr to get in touch!