Navigating the Depths of Cryptocurrency Order Books

Data Mining, Order Book Processing, Dimension Standardization, and Data Labeling

9 min readOct 15, 2023

Introducing

Data Science is not just about math and coding; it’s also about effective communication. Sharing results and insights is just as important as the technical skills required for the job. That’s why I’m here.

This blog is primarily for me to improve my communication skills and showcase my data science projects. However, I also hope that it can provide insights and guidance for beginners in the field. Through my writing, I will share the entire process from data mining to model fitting and service running. I will detail the challenges I’ve encountered, the mistakes I’ve made, and the results I’ve achieved.

Idea

The project idea revolves around developing a trading bot for cryptocurrency markets. Utilizing insights derived from a machine learning model, the bot will place limit orders while considering take-profit and stop-loss parameters. The primary objective of such a bot is to identify entry points in the market.

The uniqueness of this project lies in the use of order book depth as a key feature for creating a classification model. Instead of complex models for precise price forecasting, the project focuses on a classification model. If the growth exceeds a specified threshold, the model recommends buying; otherwise, it suggests waiting.

Order book depth is an indicator that reflects the current volume and prices at which traders are willing to buy or sell a coin. My hypothesis is that analyzing changes in order book depth over a certain time period can provide market trends for a short period ahead, approximately 10–15 minutes.

Data Mining

The trading strategy revolves around short-term limit orders. Therefore, data was collected at a small interval. Coins with low market capitalization and price volatility were chosen for this purpose.

I chose two primary sources for data mining: Binance, one of the most popular cryptocurrency exchanges, and TradingView, a widely used platform for technical analysis and charting. With Binance, I worked through a socket to obtain order books. For data from TradingView, I utilized the asynchronous library aiohttp to retrieve indicators.
As for storage, I used JSON with the following structure:

{
  "time_1_ms": {
    "symbol": "SYMBOL/PAIR", 
    "bids": [[price_1, volume_1], ..., [price_2000, volume_2000]], 
    "asks": [[price_1, volume_1], ..., [price_2000, volume_2000]],
    "TA": {
      "1m": {
        "analysis": {"RECOMMENDATION": str, "BUY": int, "SELL": int, "NEUTRAL": int}, 
        "indicators": {...}
      },
      ...,
      ...,
      ...,
      "1d": {
        "analysis": {...},
        "indicators": {...}
      }
    }
  },
  ...,
  ...,
  ...,
  "time_n_ms": {....}
}

Where “time_1_ms” represents the timestamp in milliseconds from the moment the data was obtained from Binance. The data was collected at 10-second intervals. Additionally, data from TradingView was included, encompassing analysis and indicators for eight timeframes ranging from 1 minute to 1 day.

bids— a list of value pairs [price, volume], representing buy orders in the cryptocurrency market. These are the prices and quantities at which traders are willing to purchase the coin.
asks— correspondingly, a list of pairs representing sell orders, at which traders are willing to sell the coin.

I collected data on 22 coins over a period of 24 days, resulting in a total of 48 zip files (each covering a 12-hour interval), amounting to 90GB of data.

Additionally, I obtained ohlcv data from Binance:

+===========+===========+===========+===========+===========+========+
| timr_ms   | open      | hight     | low       | close     | volume |
+===========+===========+===========+===========+===========+========+
| ...       | ...       | ...       | ...       | ...       | ...    |
+-----------+-----------+-----------+-----------+-----------+--------+

Processing

Let’s focus on order books and take a look at a specific cryptocurrency:

In the chart, I’ve marked the boundaries with purple dashed lines, representing ±3.2% from the current price levels. It’s evident that significant changes are occurring in the central region. After visually analyzing various coins, I concluded that it’s possible to trim the order book to the active zones of change to clean the data for the model.

For different coins, the 3.2% corridor may consist of a varying number of values (price, volume), ranging from 40 to 200. This depends on the coin’s market value and its price increment on the exchange.

For example, let’s take price step 0.01 and 10% changing:
For price 0.5$, changing would be 25 values (from 0.5 to 0.75)
For price 1.0$, it would be 50 values (from 1.0 to 1.5)

The goal: to make same inputs for model, to place bibs and asks to arrays with same length and same percentage price change. And than I can drop the prices, and leave just volume values.

Order Book Processing

Asks and Bids Filtering

These functions filter sell-side and buy-side data based on price level changes and a specified order book length. In filter_asks, attention is drawn to the calculation of max_price and the comparison (asks[i][0] <= new_price) due to the ascending order of asks. For filter_bids, min_price and (bids[i][0] >= new_price) are used, considering the descending order of bids.

Analysis of Changes

Now we can analyze the differences between the original order book and its filtered version. Let’s examine this through the example of compression:

A noticeable change is the increase in volume values (from 50k to 60k) as they were combined. To ensure that this hasn’t distorted the underlying data, we can create a sum graph similar to what you might see on exchange platforms:

During the compression process, the graph becomes smoother. Now, I have the prepared data for training the model — volume values of order books for different coins, all of the same length and with an equal price increment:

"bids": [[price_1, volume_1], ..., [price_64, volume_64]],
"asks": [[price_1, volume_1], ..., [price_64, volume_64]],
"max_bids_price": max_bid_price,  # current selling price
"min_asks_price": min_asks_price,  # current purchase price

Intermediate Summary

After data processing, the scale and volume of information significantly reduced. Initially, I had archives totaling 90 GB (compression by 4–5 times), but after filtration, only 5 GB remained (excluding TradingView indicators).
This occurred because during the development of the socket for retrieving data from Binance, I set the length of the order book to 2000 values. During filtration, I first trimmed it (within ±3.2%), reducing the number of values to 100–200, and then compressed it to 64.
In addition to storage issues, difficulties also arose in data processing. For example, the filtration process for one archive (with data spanning 12 hours) took from 20 to 30 minutes, resulting in a total processing time of more than 10 hours.

I didn’t dedicate enough time to the preliminary analysis of order books. Consequently, I set its length to 2000 (thinking that more is better). Here, I made a mistake.
Initially, I thought that data mining was an intermediate stage that should be quickly completed to start working with the data. However, I now understand that the data collection process is crucial, and it should have been more thoroughly explored. This would have saved a lot of time in the subsequent stages.

At the moment, I’m developing an application for data collection and integrating preprocessing algorithms so that only the necessary information can be stored. I will provide a link to the project at the end.

Data Labeling

Before moving on to the data labeling algorithm, let’s revisit the goal of the trading bot. The essence of it is to analyze the temporal data interval and tag price dynamics for a short period ahead. Ultimately, the machine learning model performs classification with two classes: “buy” (1) and “wait” (0). From this, the first parameters follow:

timeline: This is the time interval on which I will make predictions. In the context of finding target values, this interval is the "working" time during which I analyze market data.
required_growth (future take-profit): This parameter defines the percentage price increase considered positive. If there is growth exceeding this threshold, the data is tagged as "buy" (1).
acceptable_loss (future stop-loss): This parameter is relevant for situations where the price initially drops and then begins to rise. It allows you to determine the point at which it's worth starting to buy after a price drop. It prevents buying during a decline. The data is tagged as "wait" (0).
required_positive_duration: This parameter sets the minimum duration for which the price must exceed the threshold value (required_growth) for the data to be tagged as "buy" (1).

So, the first thing that comes to mind: we take values at each time interval and check whether the price will increase above the threshold value or not. Let’s call such an algorithm “simple.” However, in practice, there are nuances to consider. To better understand, let’s look at one of them:

Below the dotted gray line, the maximum negative threshold value (stop-loss) for the blue points is indicated. Above the dotted green line, the minimum positive value (take-profit) for the blue points. Point “1” and points in region “3” are suitable for buying as the price will rise in the future. However, it’s important to note point “2.” This point is not suitable for buying. The red dotted line represents the stop-loss for it.

The “simple” algorithm will label the data exactly as described. I believe such an approach could negatively impact the data quality since they represent a time sequence. Discontinuities, like point “2,” can slow down the model training process. In this context, within one timeline, I will label continuous points similar to those in region “3” as “buy.” This will help mitigate interruptions and make the data more consistent, ultimately improving the quality of analysis and model training.

To achieve this goal, I divide the timeline into segments and analyze each of them:

Let’s consider the outcome using specific values:

timeline_length: 30 (corresponding to 5 minutes)
required_growth: 0.4%
acceptable_loss: 0.19%
required_positive_duration: 3 time intervals (30 seconds)

Complete sets of points are formed at the bases of rising segments of the chart. These groups are located within time intervals where the asset’s price demonstrates stable growth. The algorithm identifies such segments as “buy”.

It’s worth noting the quantity of these points. In the title, I’ve displayed their percentage relative to the total number of all chart points: Profit Markers (5.07%). This is the result of a class imbalance between “buy” and “wait”.

**The Ratio for Other Coins With the Same Parameters**

The question is, how to choose the parameters, as they affect the required accuracy of the future model and the class imbalance:

Extending the timeline_length will increase the number of 'buy' points, reducing the class imbalance. However, it's important to understand that the current state of the order books is unlikely to reflect the price trends of the coin for a long period in the future. Therefore, excessive extension could adversely affect the model's quality. Moreover, as the interval increases, the number of points decreases:

**The Number of ‘buy’ Points for Each Time Interval**

Increasing the required_positive_duration will improve data quality and eliminate random fluctuations. However, it will reduce the number of points, leading to an increase in class imbalance.
When choosing required_growth and acceptable_loss, it's essential to consider the exchange's fees. I'll be a taker when buying and a maker when selling. Currently, on Binance, the base fee for both is 0.1%.

Calculating Model Accuracy Without Considering Class Imbalance

Conclusion

In this article, I’ve explored methods for achieving uniform dimensions in order books across various cryptocurrencies and provided an example of my data labeling approach. While my primary focus has been on data processing and labeling rather than machine learning models, I’m currently exploring the integration of ML algorithms to create predictive models for the future.

I’ve made strides in the development of a real-time cryptocurrency data collection tool that uses the power of Binance and TradingView. Beyond data mining, the project offers robust data processing capabilities. You can efficiently process, analyze and label the collected data.
👉 You can find code here: DigitalAssetFlow

You can find all the plots on Chart Studio 📊

As I continue my journey, I look forward to sharing my future results and insights.
Thanks for reading!