Wavenet variations for financial time series prediction: the simple, the directional-Relu, and the probabilistic approach

Published in

Analytics Vidhya

20 min readMay 31, 2020

One of the most useful features of deep neural network architectures is their capability to yield good results in very different domains. Image recognition, natural language processing, or regression tasks can be implemented with similar solutions, model parts, or models. To build a state of the art model in any domain it is important to know what are the new techniques of other fields.

The Wavenet model by DeepMind was originally created for generating audio but showed good results with language translation and time series prediction.

In this article, we will build models based on the Wavenet architecture to predict currency prices.

Forecasting the stock market or the foreign exchange market is very difficult. There are too many unknown effects. You will be never able to predict what the president of the United States will share on Twitter tomorrow, and only a single tweet has the power to alter the course of the market. But most of the time we don’t want to build a model which is capable to predict the price in all situations. That would be an unachievable goal. We can aim lower, we can try to build models which are capable to predict some aspects of the future, in some situations, for a not too long timeframe with acceptable skill. This skill can be very different depending on the exact problem.

What exactly want we to predict when we are trying to forecast the future price of some assets? Are we trying to predict the real future value? No. We are trying to forecast the response of a system which is made of millions or billions of individual human brains and algorithms. It doesn’t matter that the aggregated response is how rational according to our judgment. If the aggregated response is biased, then our subjectively rational decision (which is a belief) can be bad. We could sink in a never-ending philosophical debate about the existence of the ultimate rational decision, but in my world, we have to predict the reaction of a very complex system, and not some very vague real value.

The system we aim to predict can’t be totally random, it must have a pattern, but in the case of financial markets, these patterns are more and more difficult to exploit. To find these patterns we have to use all kinds of tools and data our idea can exploit, and we have to have all the new data asap. Deep learning models can help to find non-linear associations and very week signals if enough data is fueled into them, but the longer the forecast range, the more probable that some factor outside of our model-world will affect the state of the system.

Maybe you have other views. I am a meteorologist and for more than 10 years I made ultra short-range predictions, where we always had to use (almost) real-time information to predict the weather, and often we had to leave out of consideration the base forecast-model because it was obviously unable to deal with the situation on the actual spatial and temporal resolution. This experience inevitably shaped my view of prediction methods. The weather is, of course, a very different beast than the stock-market. The weather was guided by the same rules yesterday and will be tomorrow. Financial markets have aspects of a zero-sum-game in the short term, where the rules can change from one day to the other. This doesn’t mean that it is possible to ever make perfect weather predictions, but that is another story.

Cumulonimbus over lake Balaton, source: https://pixabay.com/hu/photos/balaton-tihany-felh%C5%91k-term%C3%A9szet-1711867/

Back to our problem:

We will use the Wavenet architecture which will be the core of our models with small variations and will attach to this core model different inputs and outputs. The Wavenet architecture is composed of great additions of the deep learning toolbox, like dilated convolutional layers, gated activations, or residual connections. You can read more about them in this article.

Finding a good architecture is necessary, but not sufficient. Without proper input data, no model will yield useful results. There are datasets, where inputting simple, one-dimensional data can be enough to get some future steps with low error, but financial datasets aren’t so easy to tame. Giving the model only the 1-minute or 1-day price history of the closing ask price of an asset will not give good results (at least not for me).

Not so long ago I read the book of Daniel Kahneman; Thinking, fast and slow. This book isn’t about Deep Learning, it is about human thinking, the effect of cognitive biases, where we can and cannot trust our intuitions or decisions, and other things. A fantastic book. I wasn’t able to read it without always comparing the properties and biases of the human braines to deep learning models. Don’t think that I am a believer that general artificial intelligence almost is in our grasp. I don’t even like the name “artificial intelligence”, this name promises too much compared to our available tools. But there are some similarities, and if you make machine learning there is a good chance that you make something for people or about people. Even if you aren’t interested in decision theory or psychology, you can be hit by great intuitions when reading this book to improve your product. One important cognitive bias we have is shorty described as “What you see is all there is.” This refers to the fact, that we make decisions based on inadequate information, or the information all we have (most of the time). We don’t really think about the information we don’t have, that would be too much effort. This is true for deep learning models as well. It doesn’t matter how many neurons and what kind of state of the art architecture we build. We can give the model petabytes of data, if that data doesn’t have enough information (different features being in some kind of association with our output) to map it to the label by our shiny function approximator, then our model will not perform well.

Here we shortly summarize what kind of data we will use. Our goal is to forecast forex bar features of the upcoming step. Forecasted pairs: EUR/USD, GBP/USD, and JPY/USD (not the more common USD/JPY). We will forecast simple price values, log-returns, and the directions of changes.

We will play a bit with the Relu activation as output. With activation functions as output, we can determine the codomain of our models. In the end, we will build the Wavenet model with distribution output. For that, we will use the Tensorflow probability library. An implementation of a similar model can be seen here: Time series prediction with multimodal distribution — Building Mixture Density Network with Keras and Tensorflow Probability.

We will forecast features calculated from the tick means of the bar ranges, and not OHCL. Tick means are more representative measurements of the prices during the bar period, and not so noisy as the closing price. It is easier for a model to find patterns if we use means.

The data inputs have two main components:

features generated from tick data during a 5-minute range (source: Dukascopy)
features generated from economic news calendar (source: FXStreet)

Generating hundreds of features required compromises and arbitrary choices. Preparing the data took a longer time than building the models and training them. The short manual, data processing, and pipelines can be found here: https://github.com/sinusgamma/probabilistic_wavenet_fx/blob/master/data_preparation.md. In the article, Financial bars at the age of deep learning, you can read about some other ideas.

To feed the data to our model we will use the Tensorflow dataset API.

For training, we will use the 2016–2018 period, and 2019 is the validation period. For simplicity, I didn’t use separate test data.

The models were trained on the Google Cloud AI Platform.

The notebook with the code is available on Github: https://github.com/sinusgamma/probabilistic_wavenet_fx/blob/master/wavenet_fx_final.ipynb

The Core Wavenet model

Here is the code of the core model which will be part of all models with slightly different parameters. Depending on our input data or output attachment the parameters of the wavenet_model_setup function will change from one model to the other.

The baseline prediction

First, we set baselines we want to overperform.

The most obvious and naive feature we can predict is the price itself. The most naive forecast, which sometimes can be very hard to surpass and called ‘Naive Forecast’. We predict, that the price at the next step is the same as the price at the last known step. The MAE of this forecast can be calculated from our label dataset. Here we average the MAE of the three currency pairs.

0.00014286015166823933

If we do the same naive prediction with log-returns, then we make a mistake. Price describes a state, while log-return describes the change of a state. So with the above naive math, we would forecast that the price change will be the same, as before. (In the final dataset log-returns are scaled in the input and output as well.) But let's see how this prediction would perform:

0.45833495595504936

When we forecast that the price in the next step will be the same as before, we assume that the log-return will be 0.0. Calculating the error of this forecast will give a better naive prediction with lower MAE than above.

0.3554945769675469

The price forecasting models

I tested two models to forecast the price of the currencies. One model was trained only with the features derived from the tick data, the other model was trained with additional features derived from economic news data.

Simplified architecture of price forecaster model

In the graph above we can see the architecture of our model with price and news data.

input_nosparse: This represents the data derived from the economic news dataset. The ‘no-sparse’ in the name means that I dropped some features which were mostly zeros because of one-hot-encoding. Very sparse models make the training very ineffective, so I included and generated features that yield not so sparse tensors. This dataset includes the last surprise factor of a given news event, and a counter, which tells the model how far are we from the last event. (Another counter could count the timesteps until the next known occasion of that event, but here I didn’t use that.) The idea of the counter is similar to the CoordConv architecture, which “allows filters to know where they are in Cartesian space by adding extra, hard-coded input channels that contain coordinates of the data seen by the convolutional filter”. The difference is that this isn’t spatial, but temporal help. As the Wavenet model also uses convolution, I hope, this will have a similar positive effect on the model performance.
input_eventcur: This dataset shows if in the given timestep there is any economic event that could affect our prediction and the currency of the event.
input_curbars: Features generated from the tick data, like ‘log-returnized’ OHCL, standard deviation or Spearman’s rank correlation of the ticks, and others.
model_news: News data is sparse and has lots of dimensions. We use a kind of embedding before inputting it to the Wavenet part of the model. Entity embedding can’t be used with this dataset, because multiple events can coexist in the same step. Instead of that, we use depthwise convolution to embed this data. Depthwise convolution is a convolution when your kernel size is only one. This way the kernel weights the input only along one axis. By increasing or decreasing the filter size and using different activations we can use it to represent our data with less or more dimensions. Why don’t we use kernels with more dimensions? Imagine an image with RGB layers. The relation of close pixels has meaning and the task of a typical kernel of an image recognition problem is to exploit these relations. But what if the order of the pixels along a dimension is arbitrary? For example, what if one axis of our tensor represents different currency pairs and the other axis represents some features of the price during a bar range. The order of the currency pairs is arbitrary, and the order of the features is arbitrary as well. In this case, our dataset isn’t similar to an image, there is no point to try 2D kernels. But the data isn’t totally unrelated. Along one axis, we have all the features of the same currency pair, and along the other axis, we have all the currency pair data of the particular feature. For some of our datasets, we could use depthwise convolution even from two directions, but I will calculate it only from one.

Possible directions of depthwise convolution of a tensor

The model with the price data and the additional economic news data was better than the model with the price features alone, but I have to mention that because of the data size differences I used different numbers of filters in the Wavenet, so the comparison isn’t totally fair.

In the end, the better model had 0.00024 MAE, which was worse than the naive no-price-change forecast with its 0.00014 MAE. With some tunning, architecture optimizing, and input data preprocessing I think it isn’t too hard to overperform our base prediction, but we will pursue different goals. Forecasting the price is harder than some other equally useful features. Even if we normalize, or standardize our output, the variation of the price during a sequence can be far smaller than the variation in the whole dataset, which makes it harder for the model to capture the changes. It would be a better choice to forecast the change in the price compared to the start of the sequence or compared to the earlier step, or we can forecast the return or log-return instead (what we will do later).

I can’t go further without a nice chart, similar to the charts we can so often see when somebody showcases the performance of a one-step forecast model.

The real and the predicted price :) HAHAHA

This always looks so good and it is so useless and deceptive. If we examine the chart a bit longer we can notice that the prediction lags behind the real price, but this lag is worse than the naive forecast. If this would be a multi-step forecast all along, that would be fantastic, and you know, I wouldn’t show it to you. But this is generated from one-step forecasts, and it is very hard for a human to evaluate the performance of the model from this image.

An alternative way to show the performance of our model on a chart is to display the real change of the price, the predicted change, end the error. This doesn’t seem so good.

The log-return forecasting models with directional Relu

This chapter of the article would have been shorter if I didn't make a mistake in the code. But I made one and was happy about that.

Originally in the Wavenet part, I used Relu activation as output. Relu is zero where the input is negative and keeps the original number, where it is positive. For price prediction, this wasn’t a problem, every currency pair has positive value everywhere. But log-returns can be negative or positive, so my outputs were weird, there were too many zeros.

After realizing my mistake I tried to exploit this property of the Relu activation.

I developed three models with the same architecture and input data, only the last activation function of the models was different. The first model had a normal Relu-activation and was able only to predict positive log-returns and zeros. The second model had an inverted Relu-activation and was able to predict only negative log-returns and zeros. The third model didn’t have an output activation function and was able to predict any number.

I implemented the negative Relu in a very simple way, multiplied the Relu output by -1. Theoretically, we should implement negative Relu differently. We could keep the negative inputs, and zero out the positive ones, but as the weights are determined by the output, they would give us the same results only with opposite weights before the last lambda layer.

The model predicts log-returns, but we implement some metrics to evaluate the directional forecast performance of the model.

(The log-returns in my dataset are scaled, divided by a standard deviation, but aren’t shifted.)

These metrics need some explanation. The directional metrics measure how often the model can predict if the log-return will be positive or negative, regardless of the magnitude of the log-return. For example ‘direction_acc’ measures the percentage of the predictions on the correct side from all predictions. With this metric, our no-activation output is larger, than the positive or negative Relu output, as the Relu outputs are able to predict only one side. But the ‘direction_acc_pos’ metric counts for only the positive log-returns, and the ‘direction_acc_neg’ counts only for the negative log-returns, so we can use these metrics to compare the performance of our Relu models to the no-activation model. To be on the correct side is important, but it is also important to know how often are we on the wrong side. For that, we use the ‘_inacc_’ metrics, they measure how often we predict the wrong side. Our directional accuracy and directional inaccuracy functions use > or < and not ≥ or ≤ distinctions. This way we generate an uncertain zone. The model with no activation never predicted 0.0 log-return, it isn’t really capable of that, because there is a very small chance that the weights will output exactly zero. The sum of the directional accuracy and inaccuracy of this model is 1.0 with these metrics. But if we add together the directional accuracies and inaccuracies of both the positive and negative Relu models, we got a number less than 1.0, we got 0.86. We have another metric for counting the rate of the zero predictions ‘pred_zero’. The rate of zeros of the no activation model is 0.0, and maybe you would anticipate, that the rate of zeros in the Relu models is close to 0.5, and their sum is close to 1.0. No, in both models the rate of zeros is above 0.5. Or from another viewpoint, they both predict the direction available for them by the Relu in less than 50 percent of the steps. the Relu output pulled into play a new card: uncertainty. Training with Relu made the models more cautious because with this activation function as output they are allowed to be cautious. Of course, we could determine a threshold by optimizing for maximum F1-score on the training data, or other methods to find the best threshold for our task, and apply that threshold to all the models. This way we would give uncertain regions to the no activation model as well, but the Relu models made that without any postprocessing.

The Relu output models predicted lots of zeros. Because of that their directional accuracies weren’t so high than the positive or negative directional accuracy of the no-activation model, but their inaccuracy scores were close or lower.

We can go further with our Relu models to forecast uncertain zones. If we examine the prediction of the Relu models together we can have predictions where:

The positive and negative Relu outputs are both zero: we forecast zero log-return.
One log return is zero, the other isn’t: we forecast the non-zero log-return.
Both predictions are non-zeros. The Relu output models predict opposite directions. This is an uncertain situation. We predict zero log-return.

After building the above joint-model we check the rate of accurate and inaccurate model directions. (_dbl is our joint model)

2.690532928942808

3.035407773534037

2.4188293295362335

3.031693169537505

The joint model had fewer accurate predictions than the no-activation model but had fewer inaccurate predictions as well. The rate of accurate/inaccurate predictions was 2.42 with the no-activation model, and 3.03 with the joint-Relu model considering both directions. This is very good. (Seems too good to me, so if you discover an error in my logic or data preparation, please, tell!)

The no-activation mode always had to choose a side, but the joint-Relu model was able to give 0.0 log-returns in uncertain situations (about 20 percent of the steps), and this helped the model to improve the accurate/inaccurate prediction rate.

Oh, and I’ve almost forgotten to mention that with the log-return forecast we overperformed the base model. MAE of no-activation model: 0.2835, MAE of base: 0.3555.

Probabilistic Wavenet

The joint Relu model helped us to discover some uncertain situations, but a more sophisticated way to describe uncertainty is to predict distributions or probabilities instead of simple values.

What does a distribution represent in our problem? It represents a belief, the subjective belief of our model about the possible outcomes of the future.

Nassim Nicholas Taleb wrote in his book, Fooled by Randomness: “Probability is not a mere computation of odds on the dice or more complicated variants; it is the acceptance of the lack of certainty in our knowledge and the development of methods for dealing with our ignorance.”

This statement is closely related to Kahneman’s “What You See Is All There Is” problem I quoted earlier.

Probability has an objectivist or frequentist and a subjectivist or Bayesian view. You can read a bit more about them on Wikipedia. All probability problems can be enrobed by both views, but some problems are better to view trough objectivist, other problems through subjectivist glasses.

For example, tossing a fair coin, or playing most games in a casino has uncertainty, but that uncertainty is well modeled by math, and because of the law of large numbers the casino will never loose on average (or only if it is lead by idiots). From the view of the casino, it isn’t really an uncertain business (if we don’t count tax, competitors, COVID-19, and politics).

But in most real-world examples we have to count for very diverse factors, some of them are hard to represent with numbers (but we have to if we want to build models on that information), and there are situations that occur only once in history. Try to train a model on that.

Predicting forex prices or the stock market is better viewed trough our subjectivist glasses. There are lots of factors we don’t know. Different people or algorithms have different information about the system, but nobody knows all the factors and the system is continuously changing, so something that was true yesterday maybe not be true tomorrow. Or I could say that the distribution of our forecasted time-range will be different than the distribution of the historical data (most of the time). Of course, we don’t hope to build a perfect model, we just want to build one that is good enough for a task, and for a time.

Here we will build Wavenet models with bimodal normal distribution outputs. (Sorry Taleb.) It is possible to use lots of other distributions of the Tensorflow probability library to build mixture distributions, but my goal was to make predictions where the output had double peaks at some steps, so a bimodal normal seemed satisfactory. (You can see a mixture density network with multimodal forecasts here.)

For the first model I used the following architecture:

The three Mixture normal layers are for the three currency pairs, and we have to make the Wavenet output compatible with our MixtureNormal layers.

The architecture of our Probabilistic Wavenet model

After training this model I didn’t find any prediction with double peaks, but from this model, I didn’t expect that.

To explain my goal I just copy/paste a part of one of my earlier articles:

Let’s distinguish three different priors:

We know only the earlier movement of the price.
We know the earlier movement of the price and the time of economic news.
We know the earlier movement of the price, the time of economic news, and the surprise factor.

In the first case, we know nothing about the news. Our model sees only the earlier price movement, and one step before the economic news the model will be blind to the possible up or down jump caused by the surprise. This model doesn’t know that the next step can have a large up or down jump. This model will probably expect some more symmetric normal-like outcome even if capable to forecast a multimodal distribution. In the second case, our model knows the time of the news, but not its surprise factor. A model trained on this dataset will probably know one step before the news that a big jump can come, but not the direction of the jump. This model will most likely forecast a bimodal distribution, probably with peaks of different heights based on our price and news time history. In the third case, we know the time and the surprise of the news as well. Of course, this isn’t possible before the time of the news. This knowledge will most probably reduce one peak of our bimodal distribution, as the model knows the historical effect of the surprise, and most probably will forecast a more unimodal distribution.

In the second model, I gave the model some information about the future. This information was known before the predicted timestep about the predicted timestep. This information contained only the existence of upcoming regular economic announcements and the currency of the country.

Unfortunately, this new data didn’t improve the performance of the model and didn’t lead to double peaks in the output distribution. It can be that the double peak pattern I wanted to find doesn’t exist in the data, the model isn’t able to recognize it, or in my opinion, the most probable is that the number of news event timesteps compared to the not-news-event steps is so small, and we would need other techniques to help the model to learn. It is possible to train a model only for the periods where important news is expected in the forecasted step or train a model with weighted data, where the sequences before the economic news have a higher weight than the others.

Interestingly the validation loss of the model was lower than the training loss. This can be because of a bug (as anytime), or the validation period can be easier to forecast. Because our naive forecast also got much better scores on the validation data than on the training data, I assume that period is easier to predict.

Another problem was that training with the standard Adam parameters overfitted the model in the first epoch. Changing the optimizer to SGD and playing a bit with the learning rate helped a lot. The performance of the model improved.

To see how uncertain the models are we calculate the weighted sum of the standard deviations of the distribution parts. In the next chart, we can see the relative magnitude of the largest, smallest, and median weighted standard deviation sums.

Weighted sum of the mixture standard deviations. noF — model without future data, F — model with future data

With the available dataset and the model variations, we could play for a very long time. There are so many options, try a larger model, input more data, use different embedding, tune the hyperparameters. To examine the output of the Probabilistic Wavenet model alone could worth longer research, and if we want to use these outputs for trading strategies we have a wide variety of options as well.

But here I stop now. The only thing I want to show you is a prediction of the model and the real price around the 10000th validation step. Don’t forget, the returns are scaled, but not shifted, so zero is zero. We can see the JPY/USD scaled 5-min log-return forecast. The model makes mistakes, but I think it isn’t bad at taking sides. (The model (noF) trained with Adam optimizer without the future data has very similar forecast as the model (F) trained with SGD optimizer and knowing if the next step has economic news or not.)

Future plans

Possible next article topics:

Playing with the output distribution and fitting the model with weighted or selected data around economic news time. I would like to see a double peak in the forecasted distribution :)
Building a time-series predictor with the transformer model. I have the intuition that the attention mechanism can exploit multivariate alternative data in the case of time series prediction. Read this for ideas: link.
Other things, which at the time of writing this sentence aren’t in my mind or aren’t so compelling.

Thanks for reading. If you have any remark, critic, or idea you want to share, write in the comments, or send a message on Linkedin. https://www.linkedin.com/in/istvanveber/