Time Series Modeling and Deep Reinforcement Learning (DRL) on High-Frequency Crypto Exchange Data
This article was produced as part of the final project for Harvard’s AC215 Fall 2021 course.
Background
Time series analysis in the financial world is one of the most difficult data science problems around, with literally “infinite” datastream and state space (action space as well when it comes to DRL if we want it to be continuous) to work with. This is why it makes an interesting but difficult task to tackle.
The crypto-currency space, being a new financial sector, is highly volatile and is booming in terms of trade volume and frequency. Capital has been rushing in at an ever higher pace in recent years, which makes it an interesting environment to not only train predictive models, but also deploy DRL agents to navigate in. The implementation hereof finds a balance between experimentation and deployment. The overall goal is to not only test state-of-the-art models on raw exchange data, but also deploy it in a production-like setting where computation could be scaled and parallelized.
The Data
The dataset is composed of high-frequency depth and tick data (milliseconds intervals) collected from the 7 largest exchanges (in terms of trading volume) from around the globe. It acts as a log that keeps track of 33 cryto-assets (Bitcoin, Ethereum, Cardano, etc.) through 127 different instruments (FTX:BTC/USD.SWAP, BINANCE:ETH/USD.SPOT, etc.). An example of the log is given below:
The depth data is comprised of bid and ask prices and size to represent the depth of the limit orderbook for aninstrument ID given the timestamp of their occurences. They are represented in features ask_price1…5, bid_price1…5, ask_size1…5, and bid_size1…5. The tick data shows a record of the actual trades that took place within the market for the given instrument ID. The record shows the price and size of the trade that occured and the side that initiated this trade. These are given in the features size, price, and side.
The logs are temporally asynchronous in nature which asks for drastic data transformation and preprocessing. Not only do we need to rescale and reduce the dimensionality of the data, but we would also need to convert it into time series data with a fixed and evened out time interval. The specific time interval chosen cannot be so fine-grained as to create a dataset too large for us to process, but would also need to capture the miniscule volatility or trends associated with all asset classes. The dataset would also need an extra dimension to groupby the logs according to their instrument ID.
Hence the result uses a series of functions (groupby, resample, etc.) using Dask and Dask-ML to convert the log into a rescaled three-dimensional dataset into the shape of (instrument ID, time series, features) while being parallelizable at every step.
Exploratory Data Analysis
Stationarity
One of the key components of the EDA is to explore the stationarity of our current time series data. It is essential to determine whether the time series is “stationary”. Informally, stationarity is when the auto-covariance is independent of time. Failure to establish stationarity will almost certainly lead to misinterpretation of model identification and diagnostics tests. Moreover, stationarity is decisive in characterizing the prediction problem and whether to use a more advanced architecture. A non-stationary time series is one whose properties do depend on the time at which the series is observed and these are usually refered to as random-walks which are not ideal for prediction. Therefore, time series with trends, or with seasonality, are not stationary since the trend and seasonality will affect the value of the time series at different times. More precisely if {yt} is stationary time series data, then for all (yt,…,yt+s)does not depend on t. Below is an example of non-stationary time series and it is concurrent with what we have observed in our dataset. We have also performed adfuller and KPSS tests on the dataset to check for stationarity. These are statistical hypothesis tests that are designed for determining whether differencing is required for the given dataset and they are performed prior and after detrending to test for stationarity. The result prior to detrending is the rejection of the null hypothesis that the model is stationary.
Hence, detrending is needed for the entire snippet of our time-series and first-order differencing is used in our scenario by computing the differences between consecutive observations. This would help stabilise the mean of the time series by removing changes in the level of a time series, and therefore reducing trend and seasonality. The effect is illustrated by the graph below for a given instrument.
Dimension Reduction
In order to examine which features are redundant and which we want to keep to train the models, we used autocorrelation analysis. For example, to see whether ask_price2 adds additional information to ask_price1, we can look at the autocorrelation of ask_price2 — ask_price1 (shown in the plot below). The lag time step in the plot is 100ms. We can see that the autocorrelation becomes insignificant after around 100 time steps, i.e., 10s. This means that the value of ask_price2 — ask_price1 is independent of the value 10s ago. Because we want to use more than 10s’ data to train our predictive model, ask_price2 — ask_price1 does provide additional information when the time scale is longer than 10s. Therefore, it’s worthwhile to keep all the ask_price1…5 and bid_price1…5.
Choosing the Optimal Time Interval
There are trade-offs that we need to concern when picking a time step size for converting raw data into time-series data. A large time step size will lead to a high amount of information loss after the conversion, while a small time step size will cause the time-series data to be large in memory and require more computational power to process or generate models from. In this section, we explore the optimal time step size that balances between these two trade-offs.
To measure the amount of information loss after conversion, we find the difference between the true price value and the best estimated price value from the time-series data. We compute the average price differences across 10,000 randomly sampled timestamps and use this metric to measure the amount of information loss resulted from the conversion. For each time step size, the time-series data is generated by using linear interpolation between the two nearest left and right raw data. When calculating the best estimated price, we assume that we only know the time-series data and compute the estimated price by using the linear interpolation between the two nearest left and right time-series data.
The plot below shows the mean absolute percentage error between the true price value and the best estimated price value with different time step sizes. We observe that there is an increasing information loss as the time step size gets larger. However, there is a significant jump from step size 0.1s to 0.3s and another big jump from 0.3s to 1s. This leads us to a decision to select 0.1s as our optimal time step size before the amount of information exponentially increases if the time step size gets larger.
Baseline Model
For the baseline model, we simply choose the most simple architecture, the forward feed neural network (FFNN). In FFNN, there are no any loops, which is different to the recurrent neural network structures. Below we show the results of 3000 time steps’ predicted price change by FFNN (red), compared with the true value (blue). The forecast of FFNN basically generates a constant price in the future (price change always equals to 0), while the true price does vary occasionally.
Prediction Model
Several prediction models have been tested for their merits based on predictive accuracy during the initial exploratory phase. Below goes indepth as to what they are and the results we’ve obtained using them.
Echo State Network
The Echo State Network (ESN) belongs to the Recurrent Neural Network (RNN) family which provides its architecture and supervised learning principles. Unlike Feedforward Neural Networks, Recurrent Neural Networks are dynamic systems and not functions. The characteristics of a ESN are being described quite clearly in a medium post:
the weights between the input -the hidden layer ( the ‘reservoir’) : Win and also the weights of the ‘reservoir’: Wr are randomly assigned and not trainable
the weights of the output neurons (the ‘readout’ layer) are trainable and can be learned so that the network can reproduce specific temporal patterns
the hidden layer (or the ‘reservoir’) is very sparsely connected (typically < 10% connectivity)
the reservoir architecture creates a recurrent non linear embedding (H on the image below) of the input which can be then connected to the desired output and these final weights will be trainable
it is possible to connect the embedding to a different predictive model (a trainable NN or a ridge regressor/SVM for classification problems)
This allows the network to be able to train and re-train very quickly, which is ideal to minimize prediction error due to time-related delay between when the model starts training and when the trained model is actually used for predictions. Below is the result of using the ESN to predict 1000 timestep prices of Bitcoin.
As you can see the results are actually quite noisy, we suspect this is due to the randomness of the weights associated with the initiated reservoir and the limited pattern a shallow FFNN could capture. We used Rctorch as the optimization tool for hyperparameter search. The tool leverages on bayesian optimization to reduce the optimization cost of ESNs and converges to hyperparameters that perform well not just on individual time series but rather on groups of similar time series without sacrificing predictive performance significantly.
WaveNet
Another SOTA sequence model is wavenet, the model was first proposed by DeepMind as a deep generative model of raw audio waveforms. Only the encoder part of wavenet is needed as we do not need to generate any additional data. The wavenet model can be abstracted to apply to any time series forecasting problem, providing a nice structure for capturing long-term dependencies without an excessive number of learned weights.
The core building block of the wavenet model is the dilated causal convolution layer. This convolution properly handles temporal flow and allows the receptive field of outputs to increase exponentially as a function of the number of layers. This structure is nicely visualized by the below diagram from the wavenet paper.
Additionally, the model utilizes other key techniques such as gated activations, residual connections, and skip connections. Below are the results we have obtained using the wavenet model to predict the same 1000 timesteps as the ESN. As you can see though wavenet has some spontaneous outlier predictions, however there seems to be some observable predictability associated with its predictions. Though the MSE of ESN is lower, we have decided to use wavenet as our prediction model of choice.
DRL Network
Now that we have a predictive model to forecast future price features of all instruments, we can define a DRL environment for a trading agent with the goal of maximizing profits in the long run. For the sake of simplicity, in this DRL environment, the agent is trading only one instrument (i.e. Bitcoin) while using available price information of all instruments. We set up the DRL environment as follows:
State:
Includes three attributes:
- Price features of all instruments at times {t, t-1, …, t-d+1}
- Current share of bitcoin
- Current balance (cash)
Action:
Represents the number of Bitcoin shares {-k, …, -1, 0, 1, …, k} that we are going to buy, sell or hold at time t
Transition Function:
Transition function maps from current state and action to new state.
We use our predictive model to forecast price features of all instruments at time t+1
Then use these forecasted features to create a new state containing the three attributes above.
We adjust current share of Bitcoin and current balance based on the action and bid/ask prices
Reward Function:
Change in net worth at time t+1 from time t
Net worth = market value of Bitcoin + current balance
Market value of Bitcoin = (current bid price + current ask price)/2 * current share of Bitcoin
In this environment, we have well-defined state and action spaces as well as well-defined transition and reward functions. Note that the transition and reward functions will depend on the predictive model of our choice, which we can choose from the previous section.
Since the state space is continuous and high-dimensional, it is impossible for us to train the Q-table by updating it using the Bellman equation. Instead, we approach this problem by using Deep Q-learning. Critically, Deep Q-Learning replaces the regular Q-table with a neural network. Rather than mapping a state-action pair to a q-value, a neural network maps input states to (action, Q-value) pairs. For more details on Deep Q-learning, please refer to the two following articles.
After we receive a trained Deep Q model, we use it to find an optimal action given a state that the agent is currently in. This action determines whether the agent will buy, sell or hold x number of bitcoin shares at that time step. To make the situation realistic, we also penalize the agent when he/she takes an action that leads to a negative balance or a negative number of shares. Moreover, we charge the agent with a trading fee of 0.01% every time he/she buys or sells shares. We also assume that the agent wants to execute his/her orders immediately when taking actions. Therefore, he/she will use market orders, in which he/she buys at the lowest ask price and sells at the highest bid price.
The Pipeline and Deployment
Our Pipeline revolves around 4 key modules, data preprocessing, prediction model training, the DRL network, and the frontend visualization as illustrated below:
Each module is deployed in separate Kubernetes clusters or compute instances as different modules have different computational needs. The DRL network is CPU-intensive in nature and the preprocessing part leverages on CPU parallelization using Dask. While prediction model training revolves around distributed training using GPUs. All deployments are done automatically using Ansible on the Google Cloud platform. Hence, the entire workflow is built with a production setting in mind that focuses on scaling, speed, and ease of deployment. Due to the data science and deployment nature, and the limited timespan and scope of the course, we have mainly focused on the machine learning part and their implementation at scale. We have not built a real-time server that keeps a log of live stream directly from the exchanges APIs as that could easily be a substantial effort of its own.
Data-Preprocessing Pipeline
We have limited the size of the dataset to around 10 GB in total with the depth data being 10.1 GB and the tick data 1.6 GB. These raw exchange data are stored on a Google storage bucket and downloaded by the data-preprocessing module into dask dataframe. The dataframes are then rescaled using dask-ml MinMaxScaler function (we did not use robust scaler as the dataset did not have frequent outliers). The dataframes then go through a series of groupby and resample functions and are then combined into an array of shape [instrument ID, time series, feature] according to their timestamp with a fixed interval of 100 ms for every time step. The transformed array is 8.6 GB in size and would then be saved in hdf5 format and uploaded to another google storage bucket to serve other modules within the pipeline.
Prediction Model Training
The transformed data array is first downloaded from the bucket into a dask array. The array is then detrended by taking the first-order difference and fed into the wavenet model for training. Upon completion of training, the weights are then saved and uploaded onto a Google storage bucket for serving. The predictions are future first-order differences meant to be added onto the most recent timestep of the transformed log. Currently the model is being trained on past 5000 timesteps due to gRAM constraints. This is a key area that needs to be tackled on into the future as the number of lookback timesteps is currently limited by the amount of gRAM available on the GPU. Model parallelism of the wavenet architecture will need to be implemented to raise the number of parameters and distribute it amongst several GPUs to increase this number of lookbacks. This would in theory increase the predictability of the model.
Conclusion
There is nothing conclusive in terms of the predictive power of our model due to the limited computation power we were able to leverage on. We do however see decrease in predictability as time moves forward if the model is not updated in a timely manner. As regards what the optimal frequency is for model updates, it is obvious that the answer is as soon as possible and as fast as the pipeline allows. In light of this need for prompt update and bigger gRAM for longer lookbacks (as mentioned prior), model parallelism is urgently needed for future work. The ability to split the prediction model by layer and distribute them across multiple GPUs would not only decrease the number of parameters to be trained on for each GPU, but also increase the overall capacity of the model once scaled.