Bayesian Time Series Analysis at Scale

Ryuta Yoshimatsu
14 min readJul 1, 2023

--

Introduction to dynamic linear models, uses cases and implementation.

Image from Adobe Stock

Introduction

The Bayesian approach to inference gives us measures of uncertainty around estimates, whereas the Frequentist approach gives us point estimates that maximise the likelihood. With Bayesian, we can incorporate our prior beliefs about the model. If the assumptions made were good, we can get better estimates with fewer data points. The Frequentist alternative makes no such prior assumption and it solely depends on data at hand. The differences between the two often lead to a discussion on which method to use given a use case.

One important application of statistical inference is time series analysis, which is relevant in virtually all industrial settings. Better understanding of time series data allows us to adapt to changing market conditions, allocate resources effectively and make strategic decisions about our operations. In this article, I discuss the Bayesian approach to time series analysis.

There will be three parts in this article. First, I will briefly introduce the theory of Dynamic Linear Models, which is a mathematical framework that can be used to formulate time series. The Bayesian technique can be used in this framework to estimate the model parameters. By no means, I intend to give a comprehensive introduction to the theory, but I will point out some mathematical features that distinguish the Bayesian approach. In the second part, I will discuss the use cases. These are the use cases, in which I think the Bayesian techniques shine more than the Frequentist alternatives. The final part is about the implementation. I will discuss the state-of-the-art techniques to scale these solutions.

Dynamic Linear Model

This is a simple linear regression model that we have all seen:

where i signifies the iᵗʰ data point, θ are the model coefficients and e is the noise term. Now, this is a dynamic linear model:

The difference between the two models is that the parameters in the dynamic linear model are time dependent. We can see that in the subscripts t. This means that our coefficients of the regressors in the dynamic linear model can change over time. This is where the name dynamic comes from: ability to change over time. If we use the Bayesian approach to estimate these parameters, we will have a probability distribution for each parameter at every time step.

Being able to model the evolution of the parameters sounds great. But there is one problem with this formulation. We usually collect only one data point at a time. At time t, there is one data point and with one data point, we can only estimate one parameter. There will be no notion of uncertainty around this estimate. Usually, we have more than one parameter in our model and we want to evaluate uncertainties, so this is a big problem.

We can overcome this by adding a constraint to the parameters. We make the parameters depend on the previous time step, t-1.

θ is a state vector and it holds the state components that additively contribute to the observed time series. For example, in the simplest case, the components can be an intercept and regression coefficients expressing a simple linear regression. We define the state vector based on our use cases. Usually, in industrial time series applications, trend, seasonality, external regressor and autoregressive terms are used as components of the state vector.

G is a transition matrix. It applies a linear transformation to the state vector and yields the state vector of the next time step. G is usually well defined with the knowledge of the system sometimes expressed as a set of ordinary differential equations (e.g. Newton’s equations of motion). We will discuss this more in the use case section.

ω is an added noise on the state equation, allowing the state components to diffuse over time. The distribution we choose for the noise determines the likelihood function in the Bayesian framework. When we specify this noise to be Gaussian, the class of the model reduces to the normal dynamic linear model. In this article, we assume the noise terms to be Gaussian with mean zero and with a non zero covariance matrix.

To link the state vector to the observation, we have the second equation:

y is the observation and F is an output vector that maps the state vector to the observation. ν is the added Gaussian noise on the observation level. F is usually well defined based on the use case.

This set of two equations is called the state-space formulation of a normal dynamic linear model. The first equation is the state equation that describes the dynamics of the hidden states, and the second equation is the observation equation, which projects the hidden states to the observation.

The intuition behind this set of equations is the following. Suppose we knew the state of the system at time t-1. By knowing, I mean we know the distributional properties of the state components. If the distributions were Gaussian, the properties would be the means θₜ₋₁ and the covariance matrix Pₜ₋₁. We apply the transition matrix G on to the state vector and the covariance matrix and get the estimates at time t:

We then apply the output vector F to the estimated state and get the estimates of the observation y and the observation variance s at time t.

This is how we get a one-step-ahead forecast / prediction using this framework. To get a multistep-ahead forecast, you repeat this process multiple times.

Throughout the day or the week or whatever our time unit is, we will eventually collect that observation, the true yₜ. What we do then is to update (or to correct) the estimation of the state using this observation. To do that, we first compute a relative magnitude between the variance in the observation and the covariance in the system:

Then we add the fraction of the difference between the observation and the estimate to the estimated state for correction.

The key idea here is if the variance in the observation equation is greater, we rely more on the previous estimate. It might be that our measurement is corrupted. If the system noise has a greater variance, then we put more trust on the observation. This process is called filtering.

If we assume that both of these noise terms are normally distributed around the mean zero, they are uncorrelated, and G and F to be linear, then this filtering is called the Kalman filter. That ratio between the two variances is associated with something called the Kalman gain (K).

As mentioned above, F, G and θ are given by the system. The noise terms ω and ν (we are assuming Gaussian, so essentially their variances W and v) can be specified by how much we know about that system.

For example, let’s suppose W is zero, G is an identity matrix and F is made up of regressors: X₁, X₂ and X₃. Our state vector will be static and the set of equations reduces to a simple linear regression. If F is made up of y at previous time steps: yₜ₋₁, yₜ₋₂ and yₜ₋₃, then we will have an autoregression of order 3. We can fix the regression coefficients by setting W to zero.

We can assign values to the variances, if we know what they should be. Otherwise, we will use the Bayesian method to estimate them. Let’s say that the observational variance, v, is unknown but constant (homoscedastic). We choose a prior distribution and use the Bayes’ Theorem to get the posterior distribution for v. A common choice for this prior is inverse gamma because it forms a conjugate pair with the normal likelihood. This means that we get a closed form solution for our posterior and don’t need to sample or approximate. If v is not constant (heteroscedastic), we can parametrize it and give prior distributions to each parameter. Again depending on the choice of these distributions, we might get a closed form solution or we might have to sample or approximate. Sampling is typically done using Markov Chain Monte Carlo (MCMC) based algorithms and approximation by Variational Inference (VI) techniques.

Wrap Up

That was the introduction to dynamic linear models. The 3 key takeaways are:

  • Parameters can be dynamic.
  • You get the probability distribution of the parameters.
  • You can select your prior and likelihood distributions.

Use Cases

Let’s talk about the use cases. I would like to present three use cases. This certainly is not an exhaustive list and I would be happy to hear any other use cases that could benefit from the Bayesian framework in the context of time series analysis. Please leave your comments below!

Dynamical Time Series

The first use case is when the underlying process that generates the time series changes over time. If we have a time series, whose trend, seasonality, effects from the covariates, or even autoregressive structure changes. The dynamic linear model then may be a good option for us as it can adapt to changes.

In this notebook, I synthesized a time series with a change point at t=150. At this point, the autoregressive structure changes from order 2 to order 1 and the coefficient on the order 1 term changes as well. The linear trend is inverted from negative to positive. There is constant seasonality throughout the time.

I use a python library PyMC to perform the inference. The details of the prior and likelihood specifications can be found in this notebook.

We can see that the forecast extrapolates the latest structure of the autoregressive and the trend components. This would be different, had we maximized the likelihood (taken the Frequentist approach) because that would try to find a fixed set of coefficients that best describes the entire history. Of course, not just with the autoregressive and the trend structures, but the model can adapt to a changing seasonality, time variant covariate effects and heteroscedasticity (volatility).

This way of modeling comes at hand if we have a time series whose structure changes irregularly. With the maximum likelihood approach, we would have to inspect the time series first and decide at which point we truncate the time series so that the model captures only the latest structure. With the Bayesian approach, there is no need to do this.

The flip side of this flexibility is that it will only forecast accurately until the next change occurs. In other words, the model will not predict the changes. It will adapt to the changes quickly when they happen, but it won’t predict the changes themselves. When you use dynamic linear models for forecasting, usually the horizon is set to be short.

Real-Time Anomaly Detection of Time Series Data

The next use case is monitoring and tracking of time series data.

The Bayesian update rule gives us a probability distribution around a one-step-ahead forecast conditioned on the latest observation. Once a new observation is collected, we can compute the probability of observing that or a more extreme value using this distribution. If the probability is below a predefined threshold, we flag this point as a potential anomaly. To do this in real time, we use a conjugate prior for our likelihood, which gives us a closed form solution for our posterior predictive distribution and hence significantly reduces the time to generate a forecast.

The time series shown below (solid blue line) shows the number of taxi passengers in NYC from July 2014 to January 2015. The raw data is taken from the NYC Taxi and Limousine Commission and the processed data from the Numenta Anomaly Benchmark. The processed dataset consists of the total number of taxi passengers in NYC aggregated into 30 minute buckets. Some notable anomalies occurred during this period include the NYC Marathon (November 1), Thanksgiving (November 27), Christmas (December 25), New Years Day (January 1), and a snow storm (January 27). There is a prominent daily and weekly seasonality in the time series.

We model this time series and identify potential anomalies in real-time. I used a python library called PyDLM to specify a normal dynamic linear model with daily and weekly seasonality, and autoregressive components (see this notebook for the details). The observational variance is assumed to be constant but unknown and for the system variances we use the variance / covariance discounting method.

The red dots overlaid on the time series above are observations with below 0.1% probability according to the one-step-ahead forecast distribution. Note that the distribution is calculated on-the-fly each step when a new observation is collected. We can see that some anomalies were accurately identified (the NYC Marathon, Christmas, New Years Day, the snow storm) but the others not (Thanksgiving), and some could potentially be false positives. We also see that there are many red dots at the beginning of the time series when the model is still learning the parameters.

Well-Defined System Dynamics

The last use case is when we don’t have that many data points but we have a good model to describe the system dynamics.

One example of this is epidemic modeling. At an early stage of an epidemic spread, we usually don’t have enough data at hand to specify all necessary components to model the evolution. But epidemic spread is a phenomenon that is extensively studied in fields like applied math and physics, and there are many models that capture the dynamics.

These models are usually expressed as a set of coupled differential equations. That means you can derive the state of the system at the next time step from the state of the current time step given the model parameters. In this case of a model called the SIR model (see below), the state vector has the following components: the size of the susceptible population (S), the size of the infected population (I) and the size of the recovered population (R). The model parameters are λ and μ, which are the rate of infection and the recovery rate, respectively. f is the fraction of the population who recover from the disease.

We can choose reasonable assumptions on the prior distribution of the parameters using our domain knowledge or from previous studies. Note that the parameters are time dependent and we want to make an inference using the Bayesian framework to estimate the distributions at each time step. When discretized, the set of coupled differential equations above will give us the transition matrix. The observation is the number of infected (I) and the error term that gives the likelihood function is chosen based on the domain expertise. Read this blog post for more details on epidemic spread modeling using the Bayesian inference and see this repo for implementation.

The same principle applies to an object tracking algorithm. In that case, the state vector consists of the position, the velocity and the acceleration of the object and the transition matrix can be defined by Newton’s equation of motion.

Wrap Up

Just to wrap this section up, there might be more powerful use cases of dynamic linear models, but the ones I recommend are:

  • When the underlying data generation process changes and you want to capture that change.
  • When you want to monitor, track and detect anomalies in a time series in real time.
  • When you have a good understanding of the system dynamics and the distribution of the priors and likelihood.

Implementation

Let’s talk about implementation.

There are many tools that implement dynamic linear models. In this blog post, we saw python packages called PyMC and PyDLM. But there are many others. When deciding which tools to use, one key point you can look into to compare the tools is the fitting algorithms they support. In most applications, your posterior distribution won’t have a closed form, so you will have to either perform MCMC like sampling or VI like optimization to estimate the posterior distributions. And it’s important to know what these algorithms are and which one is supported in which tools.

MCMC is a sampling technique that you can use to estimate a posterior distribution of your parameters. It is more likely to converge to the target distribution, but it’s more expensive and difficult to scale beyond a single node. VI approximates the target distribution using predefined well-known distribution families and uses an iterative descent algorithm. It is generally faster and it does scale, but it’s not guaranteed to converge to the target distribution.

There have been quite a few improvements made in these two methods and now some popular libraries implement the improved versions of them. The table below shows three python packages that are widely used for Bayesian time series analysis. PyMC primarily supports Hamiltonian MC instead of MCMC or VI, which speeds up the sampling process significantly. Parallel sampling using multiple CPUs or a single GPU is also supported. Pyro uses an optimized version of VI to speed up the fitting process. TFP supports both HMC and the stochastic VI, and takes advantage of GPUs and TPUs for processing of multiple time series.

Finally, a few words about deployment. For a time sensitive use case like tracking or monitoring, you would want to specify a model with a conjugate prior and likelihood pair to avoid sampling or optimisation. Then you can build a real time pipeline or a streaming pipeline that updates the closed form posteriors conditioned on the data that arrives. For non time sensitive use cases, you can choose whatever distribution for your prior and likelihood, and the fitting algorithm will get you the posteriors conditioned on the entire dataset.

Big Wrap Up

Let me wrap up this blog post.

Dynamic linear models are:

  • Good for modelling time series, whose underlying process changes.
  • You will get a distribution around your estimate.
  • And you can specify the distributions based on your domain knowledge.

Use cases that are fit for this approach are:

  • Short term forecasting of dynamic time series.
  • Real-time tracking and monitoring.
  • Modeling time series whose system dynamics you understand well.

Some powerful implementations include PyMC, Pyro and TFP. The key to compare which library to use depends on which fitting algorithms they support and how accurate and fast each of them delivers the target distribution.

--

--