An Order Book Implementation in Matlab and applied Machine Learning

16 min readMay 22, 2020

In this presentation we digress first into a small time journey around the Order Book ancestry, then we build a Depth of Market Application in Matlab and lastly, we apply some Machine Learning models on the price ladder open order prices and their dance on the time line in an effort to build a short term Order Book market predictability system — a concept that are being currently applied by Level 2 Market Data generators and consumers on the busiest centralized exchanges in the world. We stop and enhance the story in a couple of places with relevant background information from the industry.

Overview of Two DOMs (Depth of Market Apps) in Action.

Order Book, an interesting technological evolution of Tape Reading?

Among all spectrum of market related tools and software, the Order Book or DOM (Depth of Market) — the way it is called by many brokers, presents some of the most interesting aspects.

First, it is rooted in time. It is to Trading Community, the root-kit no modern antivirus software can remove or render inactive from our financial literature. It is the reincarnation and evolution of an old technique called tape reading. The days when a telegraph line would print on ticker tape a symbol, volume and a price, that would then end up on a quotation board in front of the so called trading audience crowded inside of a bucket shop.

It was happening, give or take, 100 years ago and some well-known names such as the Boy Wonder or Boy Plunger, Rollo Tape (Wyckoff pen name), Gann or Jim Rogers survived through the literature the test of time.

Some will say Reminiscences of a Stock Operator’s teachings are as applicable to the markets as the Portfolio Optimization’s Efficient Frontier. Gann angles are still being thought and written about like a quite valid investment technique. Or even more compelling — Richard Wyckoff has a Stock Market Institute to his name. If nothing else, going through that literature on your spare time, while you are on the sidelines during these unprecedented volatility times, is as entertaining as it gets. Entertaining that is, on multiple aspects. First, these writings send you back in time wearing an invisible coat and sitting aside Jesse, or Larry Livingstone while he is cornering the stock market in various occasions; second, by looking at today’s Technical Analysis aficionados. There are people out there that would challenge you to a real sword fight should you dare say there are no significant price levels on a chart.

Today’s Order Book mirage at its finest

Fast forwarding to today’s Order Book concept, we deal with set of many open buy and sell orders placed at different price levels, and a matching engine doing its best to generate as much transactions volume as applicable. All this is possible through Level 2 data, which is available from brokers or specialized data feeds for an add-on cost.

The concept is entangling: a principal price ladder, the volume columns (one for total on the left of price, and one for each side on the far right), the two main Level 2 Limit Order columns (the inner blue/red ones corresponding to the bid/ask respectively), the limit orders volume relative changes (the outer blue/red columns) and the P&L (Profit & Loss) column should you have an open position.

Once you get the basics, you are hooked. Just watch it 5 minutes into a peak season Monday morning open on the Chicago Mercantile Exchange’s ES (E-mini S&P) current futures contract, and the price dance dynamics get stuck to your memory like a annoying grease stain on your new latest iPhone screen, in the forever brain carved belief that “there is an edge in that dance, and I must find it”. And you are right. There is an edge to that dance. Who would go dancing on Britain’s Got Talent, if there wasn’t a consistent doze of fun and fame waiting on the other side … The question is, how good is your voice, how good are your dancing skills, can you compete?

Diving Deeper into Level 2 Data and Matching Algorithms

Diving deeper, and I feel the need to, there are certain aspects related to how the Order Flow works, that many are no aware of. Several key aspects to consider and be aware of before diving deeper into the rabbit hole:

1) MBOs. Level 2 data is already placing a large strain on both your internet connection (not when it’s iddle, rather when there’s both network congestion or data processing bottlenecks in your software/hardware combo). On top of that, CME is introducing a new (or old?) concept: Market by Order (https://www.cmegroup.com/education/market-by-order-mbo.html). This will add value not to the retailer who will not have the processing power to consume the flow real time completely, rather to the professional near exchange market players, HFTs and Market Makers.

2) FIFO? Maybe, maybe not! Contrary to our intuition, the order depth matching algorithm is not always FIFO. Meaning someone places a buy order AFTER you at the same price, and when the market visits your level, she gets filled before you, leaving you stumped with surprise. (https://www.cmegroup.com/confluence/display/EPICSANDBOX/Supported+Matching+Algorithms)

3) Missing data already! There is movement in the current Order Book Price and limit order levels dance that you don’t see. The DOMs we have available on the Retail market are set to ignore what they cannot handle in speed, or what would go missing our eye sight anyways. And that’s without MBOs. For example, two DOMs, one next to another, the right one missing a 2303.25 Ask update. The diff highlighted with yellow.

Level 2 Data Update missed by the right DOM App. In the center, my very own DomRunner Depth Of Market App implementation, available on Windows Store

Matlab Depth of Market Implementation

Matlab is for many the best rapid prototyping data science environment. I know many would say this other thing is better for production, that other thing is so much richer in statistical analysis capabilities, etc. But in terms of rapid prototyping, I still hold to my guns. Especially, that it has the BEST and most concise documentation out there. You don’t have to agree though, after all, the best tool is the one you are faster and more proficient in.

Let the deep dive begin, we are going to use a Matlab Live Script (which is similar to a Jupiter Notebook) as it is much more convenient for code aggregation and presentation. In our case, we reference file “02.mlx” from the below mentioned repository, that you can open in Matlab.

Due to the speed constraints and fast data processing requirements, for DOM representation we are forced to use the fastest data structure Matlab has. And this is not a table, timetable, structure or object. I tried them since it is much better to work with a timetable rather than raw arrays, but there is a significant delay in processing updates compared to the plain old matrix, or the two dimensions array, which is the fastest thing Matlab has. This enables fast followed actions based on potential data triggers / discoveries on the matrix. There cannot be any faster thing other than a simple matrix in Matlab — this is my experience.

Preparation Steps for Setup, and Implementation

Setup

I’ve used Windows 10 for developing this, as I’m most familiar with running Matlab in this environment. I’m fairly sure any other OS will do, if you already have Matlab running on it. The github repository where you can download all the files is:

https://github.com/andreireiand/MatlabZMQ3.git

Download a JeroMQ jar file. I’ve used and placed it under: “C:\Users\username\Documents\JeroMQ\jeromq-0.5.1.jar”. This line is present in the “javaclasspath.txt” file, instructing Matlab to include it in its jar path (actually, Matlab copies it under its root directory. More info about how the “javaclasspath.txt” can be found on Matlab website. The dynamic javaclasspath command did not work).

This presentation handles the receiving side of Zero-MQ, Zero-MQ integration with Matlab and building the DOM in Matlab. The Sending side of Zero-MQ, the Level 2 Data Sourcing and formatting is subject for a subsequent article. Nonetheless, to have your DOM getting populated, you need the Source implemented too. Stay tuned.

Implementation

The first two Live Editor sections, we lay out pipe required for executionDom matrix to look like a DOM. Being a plain matrix means it will never look aesthetic but we sacrificed that for speed.

The 3rd section handles processing updates and populating executionDom matrix. Here things get a bit tricky:

queMkt = parallel.pool.DataQueue();lisMkt = afterEach(queMkt, @getMktToken);

First line builds a DataQueue, which in Matlab enables sending data from Workers back to the Client in a parallel pool, while computation is carried out (in our case, getting a Level 2 Market updates). Now, we can use queMkt to send or listen for messages (or data) from different workers.

Second line defines a function to call when new data is received in the Data Queue. Level 2 Market data handling happens in the getMktToken function. The output is executionDom gets updated on the ‘Top of Book’, or the most inner bid / ask levels.

Following two lines do exactly the same, but for Market Depth received tokens.

Sections 4 and 5 of the document handle function execution asynchronously on a parallel pool worker through “parfeval”:

fObjMkt = parfeval(@GetMktMessageLoop, 0, queMkt, ‘ES 06–20 Globex’, ‘5550’);fObjDep = parfeval(@GetDepMessageLoop, 0, queDep, ‘ES 06–20 Globex’, ‘5551’);

This is a powerful function in Matlab’s arsenal, easy to use, a bit less common. We specify the full name of the ticker we expect in the queue, and the TCP listening port. There are many other helper functions and files you may want to deep dive into. I welcome your feedback.

The last section of the doc is cleaning up variables and preparing the scene for a new run.

We handled so far various aspects of the data retrieval through Zero-MQ, that can be source from any API enabled Broker able to provide Level 2 and integrate with ZMQ.

The output so far is a fast data structure that is looking pretty much like a DOM and fits well as input for further processing, something like below.

Some possible improvement areas we can already spot so far are:

Drop TCP in favor of UDP which is much faster and lightweight on the network. Even if everything goes through the host’s loopback, using UDP will increase speed.
Assess Zero-MQ vs other messaging libraries or vs using a naked OSI Layer 4 Transport protocol. Perhaps Zero-MQ does not provide the best performance given the task at hand.
Speed is a key factor. Even if one gets a valid edge by analyzing Level 2 data, and they overcome all major traps such as look-ahead bias, speed of execution is a technical specialization on its own.

Let us expand on the Machine Learning section further. It is not for us to expand here on what ML is or types of Machine Learning, what curve-fitting or Activation Functions are. These are all great topics with plenty of coverage in other articles. A bit of clarification I feel is needed around some great comments on Machine Learning problem formulation and training flaws that have been express in other great articles, more focused on using some of the latest deep learning advancements to predict price movements.

Excess model complexity. Complexity is common and a fair point. Complexity is good only if the big picture is clear, there are clear demarcation lines between various pieces of functionality, and complexity helps break and solve a hard pieces of the puzzle.

AI not for price price prediction. AI is probably not best to be applied directly for price movement. It might be more fit for predicting market regime, volatility clustering or order flow imbalance.

Look ahead bias. This is more aggravated by first point above. Show me a complex ML model you do not fully grasp, and I can almost guarantee data leakage. It’s like the infamous Matlab moving standard deviation:

M = movstd(A,k)

Apply this, and a simple training gets your P&L to the moon. You’ve made it. Now, in forward testing you get the surprise, and spend another day just to discover k is not what you thought it is:

User guide: “Each standard deviation is calculated over a sliding window of length k across neighboring elements of A.”

You pull ranks, cry out loud and correct the data leakage problem by doing:

M = movstd(A,[kb,0]) — according to the definition:M = movstd(A,[kb kf]) computes the standard deviation with a window of length kb+kf+1. (kb = k backward, kf = k forward periods).

I feel a bug should be filed to just change this function definition (and others similar to this) such that Look Ahead Bias protection gets built into every time series related function. Apply the new form and suddenly the previous stellar performance vanishes in thin air, sending us back to basics.

Walk Forward Analysis. This is a good validating technique for your finished theoretical model, if used wisely. It falls in a specialized topic section along with Monte Carlo Simulation and Portfolio Management, to be covered in a separate study. A broader talk on this topic demands a different type of glance into the universe of backtest iterations, features selection and principal component analysis. We are here covered somewhat by the network classification accuracy, the in-sample out-of-sample data separation and a huge data universe we can green-field test the model.

We are defining now the model input and classification criteria.

Bid side: what needs be predicted is a move downward of the bid-ask spread such that an immediate markets sell order hitting the now-bid will trigger a profitable position once the spread falls down N price ladder steps, allowing us to exit through a buy market order hitting the new ask-level.

The same, but mirrored upside down applies on the Ask side.

Our input file will consist of state of the order flow before one of the two things happen:

The “before” bid is higher than the “now” ask, within a small time-frame.
The “before” ask is lower than the “now” bid, within a small time-frame.

The time-frame for first point above is likely to be a few seconds long. Enough for a market order to get executed before the move. We will assume the “now” state is persistent enough to allow us an exit. But this could be a topic for further refinement. After all, a golden saying about Machine Learning applied to Financial Data is — gathering, transforming, processing and feature transformation in your data is taking 80% of a development cycle. If that is not the case, you know you are doing something wrong.

Past some rather extenuating data handling and refinement effort, which meant capturing on timeline full snapshots of the Level 2 data and then formatting it properly, we end up with content similar to the below — saved in our case in a file, for exemplification:

These are small but numerous market opportunities, having a just before (first line of a pair, or line A with signal = +1/-1) and after (second like of a pair, or line B with signal = 0) price and Level 2 snapshots. Snapshot Line B is captured just when price moved enough for us to take advantage of a minimum of one tick profit, such as:

When the market slides up, we want to see the “Before” (or Line A) Top of Book Ask lower than the “Now” (or Line B) Top of Book Bid. Meaning a “Before” long opened position at Ask Price (through a market buy order) would be in profit when closed at the “Now” Bid Price (through a market sell order).
Similarly, when the market slides down, we want to see the Line A Bid higher than Line B Ask.

Multiple considerations to be taken into account at this point:

We don’t factor in slippage, commissions and liquidity. Liquidity should not be a problem for this instrument when 1 contract is used per side.
We don’t consider for input more history other than line A. Improvements can be made by feeding full snapshots of data for a full length of say, 5 seconds before moment B.
We make the assumption a market order sent right when event B is spotted will get filled at the spotted limit price. In other words, the Limit Top of Book levels don’t just disappear in the next second for us to get filled elsewhere.

Importing the “xlsx” input data file back in Matlab is very easy through the Import Data wizard which will recognize our time column and take care of all the required formatting. It even enables us to save the process either as an object output or as a procedure for automatic subsequent imports.

Imported Data through the Import GUI Function

Having saved it as an import function, we can now write some boiler plate code for import and data transformation.

orderbooksubsetM = importfileM(‘obsubset.xlsx’);

Let us remove lines with y = 0, as we won’t use them for Level 2 prediction.

obsubsetM(obsubsetM(:,23)==0,:) = [];

We will also remove the Bid/Ask Price columns. We have seen from countless studies, predicting price directly is doomed to failure.

obsubsetM(:,[1, 12]) = [];

We end up with a clean matrix having 20 predictors and one predicted variable (typically noted with y, and placed as the last column in the matrix).

A first trial can be done with Matlab Classification Learner Application. The Classification Learner app trains models to classify data. Using this app, you can explore supervised machine learning using various classifiers. You can explore your data, select features, specify validation schemes, train models, and assess results. We may standardize predictors to zero mean and unit variance, as per below.

predictors = obsubsetM(:,1:end-1);mu = mean(predictors);mu = repmat(mu, [size(predictors, 1) ,1]);sig = std(predictors);sig = repmat(sig, [size(predictors, 1), 1]);predictors = (predictors — mu) ./ sig;

We start a session, provide a matrix having as last column the predicted value, and everything else is taken care for you automatically. A number of 25 Classifiers are thrown into crunching the data once the Parallel Pool gets up and running, then results start arriving almost immediately as per below. Before clicking Run button, we enable PCA and choose “All” Models to run.

Classification Learner for straight Level 2 Bid/Ask Data

There is little encouragement to follow this path. We can see many are around the 50% level which is no different to a random walk.

Let us construct the sum of bid and ask open orders, and associated with the predicted column (our long/short signal), we can check if the sum volume on each side can predict immediate move direction.

predictors = obsubsetM(:,1:end-1);obsubs = [sum(predictors(:,1:end/2), 2), sum(predictors(:,end/2+1:end), 2), obsubsetM(:, end)];

The new matrix looks similar to the below

Starting a new session in our Classification Learner and starting the Train process provides a new set of results.

Classification Learner for Total Volume Level 2 Bid/Ask Data

One of the Ensemble Models shows us a clear strength balance: when the bid side is bigger in aggregate volume, the ask side tends to match. The dots and crosses do show however, that imbalance between the two won’t have much predictive power either. A confusion matrix as presented below, tells us we would have been wrong 1210 times on short side and 603 times on long side had we had followed the model, or roughly 49% of the total executions.

Not too encouraging, but let us give it a go quickly with an LSTM network. We create in the below code some of the parameters required for the training process. Features, response, hidden units, layers and options are all required, and expressed in “03.mlx” file.

The learning process flattens out again with little to no hope for improvement.

Experimenting with the Options provided values and network layers, is a perfect fit for Matlab Experiment Manager Application., which enables you to create a deep learning experiment to train networks under various initial conditions and compare the results.

Observations.

It is feasible to bring Level 2 data into a specialized computational environment such as Matlab. This offers countless advantages compared to the traditional DOM Applications the Retail Space offers.
Even when using a simple matrix to receive and hold the data, during peak hours, using retail tools it is still guaranteed to experience some delays in processing Level 2 Updates.
A naive look at absolute values of the Order Book Level 2 data will not by itself provide any prediction power. Some further possible exploratory paths to follow are: Taking into account the relative movement of Level 2 Orders, rather than absolute values. Considering important players “work” both side of the book, correlation of “considerable side” movement between the Bid and Ask side should be taken into algorithm design.It is not always the case that the most order book “manipulative” player wins. A system tuned to successfully spot such behavior will sometimes loose. Be prepared for it.
A bigger picture should be pursued. Monitoring overall market condition is equally important as the microscopic level. In an effort to “front-run the about to happen” events, we see what might be deemed “erratic behavior”.
The Order Flow presents considerable challenges for the Retail Space considering the advantages centralized exchanges offer to Institutional Investors through preferential queuing and MBOs.

An Order Book Implementation in Matlab and applied Machine Learning

Written by Andrei B