Using Deep Learning To Predict Crypto Markets — Here We Go Again.

Ipopovca
Coinmonks
25 min readApr 29, 2022

--

I can already see your eyes rolling — yet another article about predicting financial markets!

There is indeed an explosion of ideas to use machine learning to predict stock and crypto markets. Ever since the pandemic started, many people, myself included, decided to roll the dice and try to predict the future. After all, the stock market is numbers, and data science can predict numbers, right?

Credit: XKCD

However, most projects I found eventually declare that this goal isn’t attainable; either due to the fact that stocks behave in a random walk fashion or the selected machine learning tools aren’t adequately suited to the task at hand. I find that hard to believe; considering many hedge funds and financial firms use data science to balance their portfolios — and there are people who can trade on technical analysis successfully (although such percent of people is indeed small).

I think the primary reason for such conclusions is a lack of domain knowledge in the financial sector, as normal data-science approaches that might work in physics most likely will not work in finance. This requires re-thinking the tools and approaches we use when trying to undertake this task.

In this article, I’ll list the common pitfalls that I found in other projects; and then create my own python project utilizing Tensorflow to demonstrate that it is possible to create a model that captures a signal upon which we can build a trading system. My goal is to show that it is viable to obtain a measure of information using deep learning from available data, however, we won’t be deriving signals or acting on this information just yet.

There is a lot of work that must be done beyond the steps indicated here to have a hope of a profitable trading system, So I’ll be indicating areas of improvement that can be done for future iterations.

Part 1: Common pitfalls

The first thing to discuss, as with all data science, is how we use our data. All crypto and stock market APIs will provide Open, High, Low, Close, and Volume information about a specific time period of an asset. If you have ever seen a stock market graph, those can be interpreted as candles — a tool to visualize how the given asset moves through time in each period. I’ll be referring to this data as OHLCV.

Example of OHLCV candles visualized.

This brings us to pitfall #1 — For the most simple type of analysis, all of this beautiful information gets discarded, keeping just the Close or Open price.

This, practically every time, leads to a prediction that trails the real price 1 step behind because the model has very little data to work on. In addition, the most commonly used loss function, MSE, isn’t going to do us any favors. So while visually it looks like that we predicted the market — almost perfectly! — in actuality what we have done is generated a model that can be described by a single line of code:

Looks perfect, but is entirely useless

I’ve seen those kinds of graphs in a majority of articles, Youtube videos, and even scientific papers discussing using DL to predict the future price. Needless to say, there are no useful predictions to be had there.

So, we have to use all 5 features — OHLCV to have any chances of successful prediction. This makes sense, any experienced trader will affirm how important volume and high/low points are to establishing a trend.

This leads to pitfall #2 — Failure to augment our data with extra features.

Now I know that DL is, in theory, supposed to be this magic bullet that can figure patterns and trends by itself — but in practice, we often need to help it with extra information to establish a good, working model.

The most obvious choice for this kind of augmentation in finance is technical analysis. Technical analysis is a series of tools and techniques that help traders to predict future asset patterns and direction. Such tools include moving averages, Bollinger Bands, Relative Strength Index, and many others.

Now, some might say that technical analysis is just “astrology for traders” — but we have to make some assumptions to proceed. The assumption here is that technical analysis IS useful to some degree in determining where the price will move.

Other information that we can include is real-life stuff: Bond yields, VIX (fear and greed index), commodity prices, news articles, Twitter sentiment, and others. For my project, I’ll include only technical analysis — at least for now.

I think I see a bull flag pennant archway triangle dodecahedron forming.

There is also the option of using the dark magic of linear algebra to create new features — libraries such as tsfel (https://pypi.org/project/tsfel/) have modules that will allow us to create hundreds of new features for time-series — this is something that is worth exploring in the future.

This leads to pitfall #3 — The wrong objective.

We need to determine what objective are we trying to accomplish — a day or swing trader would benefit from predicting the OHLCV candles themselves, in order to know when to enter or quit the position, while a portfolio manager will be more interested in total returns over some period of time. I will focus on the second approach, although predicting candles themselves can give some insights into how the model is doing, and potentially even help it converge better.

Quick aside — Ideally we should prepare the data above for multiple tickers (Forming the so-called investment universe). This way we can sort the predictions by quantile, and figure out what asset we want to invest in at any given time.

Pitfall #4 —Not enough data.

Unfortunately, without spending a lot of money, we are often limited to daily data on the stock market. Libraries such as yfinance (https://pypi.org/project/yfinance) are of huge use for toy examples, but for daily data, even over a span of 30 years, that will give us no more than 8000 points for any given stock in total.

I will try to remedy this by working with crypto markets — where it is easy to obtain minute data for free. We will settle with hourly data, as that gives us a good balance of the number of points and computational cost.

Pitfall #5 — Using the wrong loss function.

This is a big one — since we are now predicting returns, which could be both positive and negative, most normal loss functions such as MSE will just not do. The reason for that is simple — default loss functions don’t care about the sign of the particular value, as long as 2 points are “close” enough the function is optimized.

I’ll explore various solutions for that in the next chapters — including writing our own loss functions and finding better-suited ones for our goal.

Pitfall #6 — Lack of proper analytics in regards to generated prediction.

Unfortunately, most normal data science result validation methods don’t work with financial data well enough.

What is needed, is a way to backtest the trading data, to see how it would perform in the past — which is a very complicated task! Another simpler way would be to use statistical tools to see if our signal has any useful information in it — I will use Spearman Rank Correlation to generate the Information Ratio in my project.

We will explore both ways in the next chapters — for backtesting, we will use a very simple vectorized backtest — which, while not ideal, should at least show us roughly how our models perform.

Overall, those are the main pitfalls I found that most projects fall into when trying to achieve financial predictions. There are probably many more, and my own project will probably have plenty of its own mistakes — but no one said the task at hand is simple. After all, if it were, everyone would be a stock-market billionaire.

Part 2 — The data pipeline

Disclaimer: This project will take some shortcuts; optimizing its usage for real portfolio management will require a lot more work. The code here is trimmed to show the logic of the process, as the full project relies on many internal variables that won’t work in a standalone script.

Data is the most important part of data science (Roll credits!). So let’s acquire some.

I’ll be using a 1-minute dataset of crypto tickers from Kaggle —

This data will be used for training and validation. For testing, I’ll be using the latest 1000 points from Cryptowatch API —

This way we can easily grab the newest data for predictions while utilizing a giant array of crypto prices from Kaggle for training. Can you spot a potential problem that might arise with technical analysis here?

First things first, I’ll be resampling Kaggle data from 1-minute to 1-hour intervals. I currently don’t have a CUDA-compatible GPU, so I think this is a good tradeoff between the number of points and computational time.

Here is the code:

The general gist is that we put all the 1-minute CSV files into a folder called 1m, then use glob to get a list of all the files, and then use the pandas resample method to resample our candles.

It is important to remember to do this correctly, we must grab the first value of open, the highest point for high, the lowest point for low, the last point for close, and the cumulative volume for our candles to be resampled correctly.

For the Cryptowatch data, we can just grab 1-hour candles via their API:

This code allows us to grab any time interval and asset pair from Binance.

We grab our data, rename the columns (for consistency), and convert the time column to DateTime so it is consistent with the Kaggle data we got above.

As mentioned in part 1, ideally we should use multiple tickers to create an investment universe. This is something that I plan to implement in the future, for now, let’s focus on the BTC-USD pair alone.

Another very important part is to add the order book to this process — This is slightly harder to do, as the Kaggle database doesn’t have one, but other APIs such as Coinapi can provide us with that data — unfortunately it’s a bit costly, so I’ll omit the order book for now — but I believe it will make a big difference towards the results if included.

The next step is to add technical analysis. I’ll be using a python library named ta (https://pypi.org/project/ta/) — it’s not the most comprehensive one, but it does the work in just one line:

We will add every single indicator that is in the library. Ideally, we should study the correlation between the indicators and our objective, however, for now, this is acceptable. Other TA libraries allow us to also add the candle patterns — this will require us to use one-hot encoded vectors to apply to our networks — which might result in additional useful features. This is definitely something to experiment with in the future.

Additionally, we should also use other time intervals for our technical analysis — rather than using just 1-hour data, using 30 minutes, 4 hours and daily data should improve the result — after all, technical analysis traders rarely utilize a singular time interval. This is something I’m planning to add in the further iterations of this project.

In theory, either here or before applying TA we should apply some sort of de-noising, such as wavelet or Fourier one. Due to the scope of the project, I decided to forgo this step, but this is definitely something that might be useful to extract additional features, or just make sure that our data has less noise.

Our next step is arguably the most important — creating lagged returns. I’ll be using lags of 1,4,12,24 and 48 hours for those purposes.

First, we create a percent change of returns for every lag. We will use those as features, and also as targets by shifting them back by the respective lag. This ensures we don’t have any data leaks, and that we can use returns of smaller intervals for the predictions of larger ones.

The next step is a simple data split between train, validation, and test set.

Not a lot to talk about here — make sure that the shuffle is False since we are working with time series.

After this, we want to split our data into features and targets, nothing too special here as well:

One thing that can be improved, is that right now the amount of targets is hard-coded to 5 (Since I’m using 5 lags). In the future, this should be changed to a variable, to account for different amounts of targets.

Don’t forget to drop the targets from the features — lest you will end up with a giant data leak.

Next step — Performing mean and standard deviation removal from our features data. This will help our network learn, and is useful for the step after this one — PCA feature reduction.

We will achieve this using StandardScaler from SciKit Learn library:

Notice that we are fitting the scaler only on the training data, then fitting it on all 3 datasets after. This is to avoid any potential bias from validation/test data. We also create a folder to store our scalers, in case we need them in the future.

I decided against scaling the y values, for both this and min-max transform we will see in the future — our loss functions are going to be sign-sensitive, so if we are to scale the y values we will have to unscale them in the loss function during training. While doable, our y values are percentage-based, and generally well-behaved distribution-wise, so it’s better to leave them unscaled. Let me know in the comments if that is the right choice!

Now, for something exciting and often overlooked — PCA (Principal Component Analysis).

PCA analysis is one of those methods that come in real handy — allowing us to reduce the number of features based on principal components.

It will de-correlate our features, and potentially reduce some noise — which is nice considering we skipped that step.

After this, we will perform min-max scaling for our feature data:

This is pretty standard, we still fit on the training data and then transform the validation and test based on that fit. We also create some folders to dump our scaler, in case we need them. The range (0,1) is picked here — originally I thought to scale the features to the (-0.5,0.5) range, but the (0,1) range just worked better in my attempts. The discussion to be had here, of course, is whether the (-0.5,0.5) range would be better in general, since it will keep some of the features negative — like they were originally, but considering we removed the mean 2 steps ago, we probably skewed them anyways.

The final step is to create time-series data with shifting windows. We want a (Batch, time_steps, features, channels) format for our input data since that’s what our CONV2D network will require.

trim_dataset function trims the dataset so that the total length of the series is divisible by the batch size. This becomes useful if we decide to use stateful LSTM networks.

Generally, this is the gist of my data pipeline. This can be dramatically improved, but for this toy example, it will do just fine.

Part 3 — Loss functions

Before we talk about network architectures, parameters, and all that jazz, we need to address the fact that the standard loss functions most data scientists are used to using are just not going to cut it with financial data.

And the reason is simple — they (the loss functions, not the scientists) don’t care about the sign of the result. For MSE or absolute value loss, -0.1 is equal to 0.1 if the real result is 0, but those 2 are very different in practice — one tells us we need a short position, while the other will make us go long — this is not ideal, at all.

For the loss functions graphs below I fixed the true value to 0.1 and plotted an array from -0.5 to 0.5 of pred values, to visualize the loss function better.

So we need some sort of loss function that does care about the signs. If we do some google-fu we can find an example of one:

Loss from Probabilistic-Programming-and-Bayesian-Methods-for-Hackers, Chapter 5 (Adapted for Tensorflow)

This function is equal to absolute value loss, except it will add an extra penalty of alpha*y_pred**2 shall the signs of y_true and y_pred be different. This is a nice start, but I see 3 problems with this:

1)Absolute value error is not the best for convergence — modifying this to use MSE should give us much better performance.

2) It relies on an alpha parameter — the value of this parameter will most likely affect what kind of results we will get. We can find the value of this parameter by using Bayesian optimization, or even better, by wrapping it into a custom layer and making it trainable. Either way, this is extra work.

3) It has a weird behavior around the sign flip — I’m not 100% confident how the network will react to this.

I’m going to address point 1 by re-writing this function for MSE loss:

To fix point 2, we ideally would wrap the loss function in a custom Keras layer that will adjust the alpha parameter by learning on the fly. This is something I plan in the future, but for now, we will stick to alpha = 100.

Point 3 can be addressed by finding a different function, that is continuous across sign flips, and multiplying this loss by it.

Asymmetric MSE loss. Local minima where y_true = y_pred, exponential growth when the sign of y_true and y_pred are different.

That’s where cosine similarity loss will come in handy. It is a very simple loss — in essence, it measures the angle between the 2 vectors. This is great for us, because that is exactly what we need. Keras's implementation gives us a value of -1 if two vectors have the same angle, and 1 if they are orthogonal to each other.

I couldn’t find the usage of cosine similarity anywhere relevant to this kind of work — so it is definitely a good idea to try out. The problem with this one is that it won’t care about the magnitude values — good thing that we have the asymmetric loss above to help with it.

Function #3 that I propose is this one:

Minima at y_true = y_pred, horizontal parabolic increase as values diverge in the same sign direction, linear increase as values diverge in the direction of different signs.

What we do here is take the ratio of the predictions to the true values — and if they exceed 1 or -1 we take the inverse of the ratio, to keep the range between -1 and 1.

This, in theory, should help us “glue” the asymmetric MSE loss and the cosine similarity together — since this function cares about both the magnitudes and the signs of our predictions.

Now, potentially this loss might have some problems I’m not aware of — if so let me know in the comments. I don’t see any obvious problems with it (it’s differentiable, at the very least), but considering I haven’t seen it anywhere else there might be a reason for it to be so.

The total loss that I propose is this:

This loss grows faster in the direction of the opposite sign — hopefully, we are more likely to overshoot our target, than undershoot — making it so we are more likely to hit the correct signs.

This should cover the bases on all fronts — both value-wise, and sign-wise. Some experimentation with squaring the cosine similarity might be useful, but for now, that is what I will be using.

An even better approach would be a custom Keras layer acting as a loss — maybe with some pseudo-trading logic, or a vectorized backtest implemented as a loss. This is something I’m planning on experimenting with in the future.

We also should come up with some sort of metric to see how our training is going. Cosine similarity and our ratio loss do act somewhat as a metric, but a more human-readable approach would be nice to have.

After some tinkering, this one made the most sense to me:

We grab the signs of y_true and y_pred, subtract those from each other, and then take the absolute value of the result. After this, we will subtract the sum of those values from our batch size, then divide by the total batch size and multiply end result by a hundred. What we should get, is the total amount of signs that were correct across all of y_pred compared to y_true, percentage-wise. This will give us an easy indicator of how well the network is doing. Since we are predicting 5 different time lags, if one does poorly it might offset the usability of this indicator, but as a general metric, this should do fine.

There are definitely a lot of things that haven’t been publicly explored in regard to specialized loss and metric functions — I don’t think what I have here is the peak of what could be done — just something that currently makes sense for the task at hand. There is definitely room for improvements with custom layers acting as a loss — those allow to keep accumulated values inside, so maybe a faux-trading loss or something similar to that might produce an even better result.

Part 4 — Evaluating the results

Before we can go to the modeling part, we need to figure out how to confirm if our potential models are actually producing adequate results.

Full disclosure — this is my weakest area so far — the tools I’m proposing here are rudimentary, but the combination of them should give us at least an idea if we are predicting something more than random noise.

Ideally, we would use an event-based backtesting engine to see how our strategies perform in action — this is something that I plan to do in the future, but for now let’s look at the tools at hand:

Computing Information Ratio with Spearman coefficient

The easiest, and most natural way to measure how well our predictions are doing is to measure the correlation between our predictions and actual returns.

We will be using a non-parametric Spearman rank correlation coefficient, which measures how well the relationship between two variables can be described using a monotonic function.

Scipy.stats got us covered in this case, and the code is very simple:

We will compute the coefficient for each of the five lags individually, to see the strength of the correlation, as well as if the signals are correlated together at all.

If samples are correlated, any correlation coefficient above 0.05 could be used for active portfolio management. If we are lucky, we should be able to see values of 0.1 or above. Unfortunately, since this is finance, we most likely won’t see results above 0.15, due to the chaotic nature of the markets.

Vectorized Backtesting.

Vectorized backtesting is not nearly as good as event-driven one, but should still give us a somewhat decent indication if we are doing good or not.

We will do a very simple version of one: Start with some amount of money, make a fixed bet, based on predictions, and then evaluate the new balance based on the real value.

Overall, it looks a bit something like this:

Another simple method is to simply count how many signs (up/down) we got correctly:

This will indicate how many more/fewer signs were correct rather than incorrect. I expect this to strongly correlate with the vectorized backtesting above since it’s practically the same idea. (Although the backtesting does depend on the original values of y_true, so the results might differ). I’ll refer to this method as accuracy.

Overall, the result that we are looking for is the one that is correlated to the original signal, with a correlation strength of at least 0.05, using Spearman correlation coefficient, performs decently well in vectorized backtesting, and has more days guessed correctly sign-wise than incorrectly. If all 3 are true, we can conclude that there is some sort of signal in our predictions that has a chance to outperform the buy and hold strategy, provided active portfolio management is done correctly.

Part 5 — The Modeling.

So, with all the tools above, let’s see if we can generate models that will produce a result that fits our criteria.

Let’s briefly look at the settings I used to generate my models:

Activation function — I’ll be using the swish activation function, x * sigmoid(x). It avoids gradient explosion and vanishing, and generally outperforms most other activation functions. It’s the only activation function that behaved well with most networks that I tried.

Final Layer Activation —Linear. No surprise here, a classic for regression tasks.

Normalization — No normalization will be applied.

Optimizer — Adam will be used, with a typical learning rate of 0.0001.

Time Steps — 5, if applicable to the model.

All other hyperparameter values were picked by the method of trying. Ideally, we would use Keras HyperTuner for this task, but as stated above I currently don’t have the computational resources to do so. This is exciting because it means if this works there is infinite space to improve!

Full disclosure: I’m removing mean (bias) from all the models. Most, if not all, of my models, have bias that I cannot fix at the current time. This comes from underfitting due to the loss we are using. Can you spot the problem? There is a discussion to be had on how to potentially use bias as a source of estimating the “confidence” of the model, but this is outside of this article's scope.

So, let’s look at our models:

Dense:

Structure of our dense model.
Predictions generated by the best Dense Model
Backtest for Dense model.

Surprisingly, the Dense model has a strong correlation with the 24h lag (Albeit a negative one). It also seems to have decent accuracy after the mean removal. This is where good things about it end, as it completely fails backtesting (even on 24h intervals), and the graph shows that model isn’t much use for serious forecasting.

This is expected, as dense models are generally not suited for time-series data. I cannot explain the decent numerical result it got on 24-hour lag — even though I did my best to avoid data leakage, that possibility remains. Another possibility is that I’m not using enough points to backtest (only 1000). Either way, we cannot trade on the results of this model and must go on.

LSTM:

Structure for LSTM network.
LSTM predictions.
LSTM backtest vs buy and hold strategy

As we can see, the LSTM network did better than the Dense one. It has good correlations with 12-hour and 24-hour intervals, although the backtest on the former is lacking. The correlation is still negative — something that I see on a few of my models, but not always. It guessed a decent amount of correct directions on 12,24,48 h intervals — all in all, with better hyperparameter selection it could be a solid pick.

Convolutional LSTM

Structure for convolutional LSTM network
ConvLSTM predictions.
Backtest for ConvLSTM.

Convolutional LSTM is strongly correlated with 24 and 48-hour lag intervals, outperforms in the backtest on the 48-hour result, and has good accuracy on 12,24, and 48-hour lag. It’s still negatively correlated (I wonder if it has something to do with the loss I used or lack of hyperparameter optimization).

For the final showcase, I made a very simple ensemble — an average result of around 30 models of LSTM and ConvLSTM models. To account for negative correlations, some predictions were inverted, and the lags of the models that did not correlate well were dismissed.

Predictions for the average ensemble. 1-hour predictions are missing due to not having good correlations.
Backtest for ensemble model.

Wow! What great results from such a simple ensemble. 1-lagged results are missing completely, due to not having any strongly-correlated models in that interval. Other than that, we got a strong correlation on all the other lags, the backtest is doing great on all intervals but 24 hours (Although it still eventually out-does the buy and hold approach). We also got a good amount of signs guessed more correctly than not. Pretty good result!

One final backtest I’d like to do, is to run predictions on old data, cutting out any future data, as it was during that hour. This way we can check how accurate our predictions would be if we ran them back n hours ago. I accomplished that by taking the slices of 1000h API data, starting 50 hours ago, and then moving back by 1 hour for every prediction.

I did this with the whole ensemble set of models, for 446 hours, with the following results:

This shows that the ensemble would predict accurately if we ran it in the past. 60% is quite good! Surprisingly 48-hour did not outdo the other lags, but that could be to a combination of various factors that I’m not aware of right now.

Overall, as we can see, individual models do decent performance, but not nearly as good as ensembled ones. This makes sense, ensemble techniques exist for a good reason. I’m still worried about the possibility of a data leak, unfortunately, if one exists, I cannot find one in my code. It would be a good idea to backtest those models on a larger time interval — but that is for another time. Overall, I think we got decent results.

There are a lot of things that could be done to improve on this result — using stateful LSTMs, adding attention, or better ensembling techniques.

Part 6 — Conclusions

So, here we are. A lot of corners were cut, I know, but I believe the results are there. We CAN extract useful information from technical analysis, even when using subpar data processing and methods. It’s all a matter of applying the right tools and techniques.

Where do we go from here? Well, for starters, those corners that were cut will have to be un-cut. Having order book data is a BIG thing that I believe will help, so is having an investment universe with more than one asset to predict upon. Augmenting data with higher-resolution intervals, as well as adding additional features such as candle analysis. Adding other real-world data, such as economic factors might help out too — it would be interesting to see how stock markets affect crypto assets.

Experimenting with different time horizons will help — in my testing very few models performed well on 1h horizon, while a good chunk did perform well on 48hour one. Would predicting OHLCV candles help the network learn the predictions better? That is definitely something to explore.

Big improvements can be done on the modeling front too — more advanced models, ensembling, hyperparameter tuning, and experimenting with individual model structures. Introducing new, dynamic losses through custom layers (or maybe using those custom layers for the training itself), is definitely also something to be explored.

We also will need eventually find a way to generate signals of when to buy and sell from our predictions — I believe GAN networks would be excellent for that, receiving the market and prediction data and outputting an appropriate signal. Alternatively, manual algorithms can be written to do so.

There are probably million other things I’m not aware of or forgot to mention here — for a person who is looking, there always be something to improve upon.

I believe all of those are reasons why most “entry-level” financial prediction projects fail — there is just too much complexity that needs to be addressed, from data to functions to networks, in addition to the lack of freely available information for the stock market.

I will continue to expand on this project — my next big goal is to move this project partially to cloud computing services to take advantage of their GPUs, so I can test bigger and better models. All the improvements above will need to be addressed eventually too — I’ll post an update once I accomplish enough to have something exciting. As for now, thanks for reading, let me know what I missed in the comments!

Join Coinmonks Telegram Channel and Youtube Channel learn about crypto trading and investing

Also, Read

--

--