Time series prediction with multimodal distribution — Building Mixture Density Network with Keras and Tensorflow Probability

Exploring data where the mean is a bad estimator.

Published in

Analytics Vidhya

12 min readMar 12, 2020

Image: https://commons.wikimedia.org/wiki/File:Guentersberg_Wasserkuppe_Tree_Road_Fork.png

When we make regression we hope to estimate the most probable value. But depending on our model and data often we get only some number which minimizes the mean squared error. It may occur that this value (or its proximity) has never happened and will never happen. In this article, we will examine a method that can help us to discover and handle such situations.

The two most common neural network problems are regression and classification. One of the major differences between the two is that classification outputs the probability of a given class, while regression outputs the value of the predicted variable without any information about the uncertainty of the forecast. Even classification models output only rigid numbers, not distributions, but most of the time this approach is satisfactory to estimate the uncertainty of the prediction. Usually, we want something like “class B has a chance of 0.73”, and not something like “according to our fitted normal distribution there is 60% chance, that the chance of class B is between 0.63 and 0.8”.

To address this problem we can use Monte Carlo Dropout, here you can find a very good explanation: link. Monte Carlo Dropout can be a good choice in some cases, but I will show an example, where this technique won’t really improve our forecast, because the typical loss functions (mostly MSE) used in regression will always tend to center the output around the mean of the distribution, and can’t capture multimodal phenomenons.

Recently I started to explore Tensorflow Probability, a library built on Tensorflow, which enables us to estimate the aleatoric uncertainty (known unknowns) and epistemic uncertainty (unknown unknowns) of our model and data. This article gives a really good basic idea about the potential of this library to estimate model uncertainty, but Tensorflow Probability has much more use cases beyond neural networks.

In this article, I will focus on the estimation of the known unknowns. Using Tensorflow Probability I will build an LSTM based time-series forecaster model, which can predict uncertainty and capture multimodal patterns if it exists in the data. These types of networks are called Mixture Density Networks.

The notebook can be found here: https://github.com/sinusgamma/multimodal_network

The Dataset

In the chart below we can see the shape of our series. I wanted to use as simple data as possible to show some pitfalls of non-probabilistic models. Instead of a continuous time-series, I generated a batch of samples with the same patterns. With this data, it is easier to show the behavior of our forecast. The input data (X) is a 30 steps series without any pattern or slope, it is only white noise. The target part (Y) goes up with a 65% chance and goes down with a 35% chance, and has some noise as well.

It is easy to recognize the bimodal nature of the target forecast steps by humans, and it is noticeable that the up-trend is more common than the down-trend. If we would stick one sample to the end of the other and would make a common continuous time-series it would be harder to recognize this bimodal nature of the series, and in case of real data, we are rarely able to recognize similar patterns. With neural networks, our input and output space can have multiple dimensions. Multi-dimension datasets make it even harder or impossible to catch potential multinomial divergences looking at simple analysis charts, and these divergences can be very hard to be found even with very careful and extensive examinations. But the power of neural networks can help us here if we build the appropriate model.

Bimodal or multimodal patterns aren’t so rare that we should neglect them all the time. Some example where this kind of pattern can occur:

Financial time-series at regular economic news can go up and down based on the surprise of the incoming data. As far as we don’t know the side of the surprise (if the economic news is better or worse than we expected), the movement of the price will have bimodal distribution based on our knowledge.
Peak traffic hours or restaurant hours, or a lot of other things in our timetable.
Daily average precipitation during the year in a large part of the world.

These are obvious examples, not hard to show on a histogram, but neural networks can be able to find “latent” multimodality, because of their power in pattern recognition.

Among the examples above the first example deserves more regard. Our historical series obviously will be the same regardless of our input data. But the distribution of the forecast and the modality of the forecasted distribution will depend on our prior knowledge — our input data.

Here I make some assumptions about the possible forecasted distributions to show how important can be our prior knowledge, and how it can alter our posterior distribution, but I have to stress that these are only my actual assumptions. I will examine in an upcoming article if the forecast distributions really behave this way or not.

In our thought experience, we use the USD/JPY pair, which in my experience is very sensitive to regular economic news outcomes. But what is a surprise in the economic news term? Before the regular economic news or indicators are released, there is a consensus or estimation of the expected indicator number. The consensus number is the general agreement of experts on the outcome of the number. When the real indicator about inflation, GDP, Non-Farm Payroll or other official data comes out it is usually larger or smaller than the earlier consensus. Depending on the deviation from the consensus this can be a smaller or bigger surprise, and big surprises usually affect the price movement.

Let’s distinguish three different priors:

We know only the earlier movement of the price.
We know the earlier movement of the price and the time of economic news.
We know the earlier movement of the price, the time of economic news and the surprise factor.

In the first case, we know nothing about the news. Our model sees only the earlier price movement, and one step before the economic news the model will be blind to the possible up or down jump caused by the surprise. This model doesn’t know that the next step can have a large up or down jump. This model will probably expect some more symmetric normal-like outcome even if capable to forecast a multimodal distribution. In the second case, our model knows the time of the news, but not its surprise factor. A model trained on this dataset will probably know one step before the news that a big jump can come, but not the direction of the jump. This model will most likely forecast a bimodal distribution, probably with peaks of different heights based on our price and news time history. In the third case, we know the time and the surprise of the news as well. Of course, this isn’t possible before the time of the news. This knowledge will most probably reduce one peak of our bimodal distribution, as the model knows the historical effect of the surprise, and most probably will forecast a more unimodal distribution.

These conclusions are traceable by humans, but a very high dimension dataset can hide connections or patterns from us, but not necessarily from a neural network.

Ok, let’s go back to our basic example. We will see how we can implement a model capable to forecast our peaks with Keras and Tensorflow Probability.

Forecasting with simple regression

To demonstrate the inability of the most common regression models to recognize bimodal patterns I build a simple LSTM model. The model complexity here doesn’t matter. With a better model we can be able to predict more accurately the mean of the possible future paths, but not more. The problem is that in some datasets there is a chance, that the mean path will never happen. Unfortunately with non-probabilistic approaches, we can’t do better.

The non-probabilistic model:

In the graph below we can see that the model did a pretty good job if our only concern is the mean squared error and we are satisfied with the estimation of the mean of the possible paths. The real paths are denoted by “x”, and the forecast paths by the “+” sign. 65% of our real paths go up, 35% go down. The forecast is an up-trend between the two. This isn’t a bad forecast, depending on the problem this can be the estimate what we want.

But if the data consists of the GPS coordinates of drones that reached our destination, and we want to send the next drone on the best possible path, then we definitely should avoid these kinds of predictions, as we can easily hit the tree between the roads. Maybe this isn’t the best example, but it is obvious that in some cases the mean can be a very improbable point, and we don’t want very improbable points to be our forecast.

Forecast of a non-probabilistic model — the “+” signs

Fitting Unimodal Distribution to the data

Our artificial data have very similar distribution at every future step. The added noise has the same variance, only the means of the peaks are further from zero. I will examine the 6th step (index=5) of the test data, the other steps have similar properties.

First, we fit a normal distribution to the 6th forecast step. In the graph below we can see that this distribution how badly represents our data. We fitted this distribution to the data itself, this is the best guess we can hope from a unimodal normal.

Fitting Bimodal Distribution to the data

Instead of a unimodal Gaussian, we can try to fit a bimodal Gaussian. Since our artificial data is well separated, it isn’t hard to build a distribution model close to the real one.

We estimate the weights of the distributions from the occurrence of the negative or positive paths and calculate the means and standard deviations of the positive and negative samples. With the MixtureSameFamily class, it is very easy to build the mixture distribution that well fits our data, and it would be awesome if we could forecast that distribution with a neural network.

As you have foreseen we can do that :) These networks are called Mixture Density Networks, and here you can read an awesome article about the math behind them: link (I borrowed the style of the histogram graphs from there, thanks Oliver Borchers.) In the article above you can check how to implement a mixture density layer yourself. Here I will use the MixtureNormal layer from the Tensorflow Probability library.

The fitted bimodal Gaussian mixture distribution.

The Mixture Density Network

This mixture density network will use the MixtureNormal layer, but the other parts of the network are very similar to the non-probabilistic network we used earlier. There are two main differences. Instead of the Dense layer, we use a MixtureNormal layer. The LSTM layer before the MixtureNormal layer needs to have the proper number of neurons to satisfy the needs of the MixtureNormal, and I set the activation to “None” because constraints of the default “tanh” activation function are too restrictive to the MixtureNormal parameters.

With real datasets, we don’t know how many peaks our distributions can have, and the number of submodels can change depending on the input and the forecast step. Pretending that we don’t know the number of peaks we set the number of component distributions to 3.

The parameter size for the MixturNormal layer can be calculated easily. We have (3 components) x (10 steps) x (2 parameters of the Normal distributions) + 3 weight of the components = 63, but it is safer to calculate it in the following way.

We can estimate how probable is our data given our distribution. Log probabilities are more practical for computations. Negative log probabilities give us the loss functions we want to minimize. This loss function is very simple to implement when the output of our model is a Tensorflow distribution object.

In our dataset every example is very similar to the other, the difference is only the noise, so we will examine only the first example from the test set.

Our forecasted distribution consists of different submodules. The parameters of these submodules are our forecasted variables.

# the components of our mixture model
yhat.submodules>>>(<tfp.distributions.Independent 'model_mixture_normal_MixtureSameFamily_independent_normal_IndependentNormal_Independentmodel_mixture_normal_MixtureSameFamily_independent_normal_IndependentNormal_Normal' batch_shape=[1, 3] event_shape=[10] dtype=float32>,
 <tfp.distributions.Categorical 'model_mixture_normal_MixtureSameFamily_Categorical' batch_shape=[1] event_shape=[] dtype=int32>,
 <tfp.distributions.Normal 'model_mixture_normal_MixtureSameFamily_independent_normal_IndependentNormal_Normal' batch_shape=[1, 3, 10] event_shape=[] dtype=float32>)

One of our submodules describes the (3, 10) normal distributions we fitted to our data. We will check the 6th steps as we did earlier. We can see that the first two means are very close to our real component distribution means, and the third is close to zero.

(1, 3, 10)
[[-0.26222986  0.24365899  0.01705011]]

The other submodule is the Categorical distribution submodule. This submodule contains the weights of the components [0.29453883 0.6899422 0.01551905]. The first two weights are close to our 35% and 65%, and the third is practically negligible. The model was able to recognize that we have only two real components.

(1, 3)
[[0.29453883 0.6899422  0.01551905]]

In the graph below the line-widths are determined by the component weights. As we expected the upper trend is stronger, but the lower trend is apparent as well, the third component is almost invisible.

The weighted means of the component distributions at every forecast step

The components with larger weights have small standard deviations, but the third component has relatively large. Along with its small weight, this further confirms that our third component is redundant. If we face such a component we should consider dropping it or retrain our model with fewer components.

array([[0.04764858, 0.04561024, 0.5772563 ]], dtype=float32)

Next, we will rebuild the forecasted distribution of the 6th step and compare it to the real distribution of the test set. The forecasted distribution fits well the data. Tunning the model probably could result in an even better fit.

The forecasted Gaussian Mixture distribution and the test data

Probabilistic forecast visualization

With non-probabilistic neural networks, we get only one number for a variable. With probabilistic models we can get as many random forecast scenarios as we want, we can examine the mean of the distribution which is comparable to the non-probabilistic result, and we can examine the submodule means of a multinomial case. This can be seen in the figure below. We didn’t drop our underweighted submodule, and because of that, we got some very random forecast paths.

Forecasted distribution samples and the forecasted means

End of the prologue

Here we saw the power of probabilistic neural networks. If other ideas will not seduce me, I will examine the earlier mentioned financial time-series. In an upcoming article or articles, I will use mixture density networks to build a more complex model to forecast multi-dimensional financial time series. Instead of LSTM layers, I plan to build a WaveNet model, or some other CNN based architecture and use non-Gaussian distributions. According to my current idea, the data will consist of different USD pairs and regular economic news. My focus will be on the effect of economic news, and the occurrence of distribution peaks far from each other. I don’t plan to use very large datasets, and from that small dataset, I don’t hope real-life forecast power. My only plan is to build some techniques which can be used with models of better forecast power.

Thanks for reading. If you have any remark, critic or idea you want to share, write in the comments or send a message on Linkedin. https://www.linkedin.com/in/istvanveber/

Update

In this article, I mix the Wavenet model with probabilistic output and predict financial data: Wavenet variations for financial time series prediction: the simple, the directional-Relu, and the probabilistic approach.