Continuous-Time HMMs

Field Cady
7 min readMar 31, 2022


In previous posts I have discussed Hidden Markov Models, one of the simplest but most powerful tools for working with sequential data. I showed how you can use them to decode a stream of noisy data to extract the underlying signal, and also how they can be trained to data even if you don’t know the underlying signal. This blog post will discuss an important variant of HMMs called continuous-time hidden Markov models (CT-HMMs).

In an HMM the world proceeds in discrete time steps, like the words on a page or an update that happens at regular intervals. But in many real situations time flows continuously, and things can happen at any moment. Think of a medical context, where a patient can go between healthy and sick at any time, and we get observations wheneveer (and only when) they are examined by a medical professional.

Going from discrete timesteps to continuous time with irregularly-spaced events is conceptually a modest change. And indeed, the problems that people solve with CT-HMMs are basically the same as with HMMs. However the technical details for CT-HMMs become a bit more subtle in practice, and they require some care to work with.

Continuous-Time Markov Models

Before we get into the observations in a CT-HMM, let’s spend a minute on the key part of it that differs from HMMs: the world behaving as a continuous- versus a discrete-time (i.e. normal) Markov chain. In each case there is a (usually) finite collection of states the world can be in, and at every time t it is in state S(t). The only difference is in the time titself: in a normal Markov chain t is an integer ranging from 0 to infinity, but in a continuous-time chain t is a real number ≥ 0.

The lynchpin assumption in either Markov chain — the one that everything else follows from mathematically — is that it’s “memoryless”; if you know S(t) for certain then the states before t tell you nothing about what happens after t. In the discrete case this means that at time t we select a weighted dice associated with S(t), roll it, and the dice tells us what S(t+1) will be. In this way S(t+1) depends probabilistically on S(t), but it does not depend on anything that came before (except insofar as it influences S(t)).

Mathematically a Markov chain is characterized by the “transition matrix” T, where T[i,j] is the probability of going from state i →j in one step (i and j may be the same state). For our purposes it will be easier to talk about knowing, for every state s:

  1. how long on average the system stays in s, and
  2. when it changes, the probability of going to every other state s’!=s.

The continuous-time equivalent is a tad harder to visualize. Imagine that you get one cold per year on average (and ignore complications like long-term fluctuations in your lifestyle or the seasons). If you got over your last cold 3 months ago you are not “in the clear for a while”, and if it’s been three years you are not “due for a cold”: your chance of bumping into a sick person and contracting a virus in the near future is the same (When you do get sick the illness tends to follow a schedule, but for our purposes imagine that it too is memoryless). In a continuous-time Markov chain we know, for each state s,

  1. how long the system stays in s on average, and
  2. when it changes, the probability of going to every other state s’!=s.

This is the same as for a normal Markov chain! The only difference is that the duration in state s can be any real number, not just integers.

The time spent in state s for a normal Markov chain follows what’s called a geometric distribution. For continuous-time chains the amount of time you spend in a particular state is an “exponential distribution”; it’s the continuous-time analog of a geometric distribution. You can visualize the geometric distribution here:

The time spent in a particular state in a normal Markov chain is geometrically distributed

In each case the distribution is parameterized by a single number: the average amount of time in a given state.

Personally I like to think about continuous-time Markov chains as “fine grained” discrete Markov chains. If you tend to stay healthy for 1 year on average, imagine breaking time into 1-week chunks, and have a Markov chain that stays in the “healthy” state for 52 timesteps on average. Or break time into 1-day chunks, and use a Markov chain that stays healthy for 365 steps; in the limit of taking finer partitions this becomes a continuous-time Markov chain.

Let’s talk math for a minute. In a normal Markov chain we use the transition matrix T for nuts-and-bolts calculations. If x(0) is a vector giving the probability of being in each state at time 0, then the probabilities n steps later are given by

For a continuous-time model we speak of the so-called “Q-matrix”. Q[i,j] indicates the rate at which probability “flows” from state i →j. After an infinitesimal amount of time dT the probability vector x(0) goes to

For finite time t this becomes:

where we are using the “matrix exponential” that you might have seen in a linear algebra class (if not, don’t worry about it). In the limit of small t, the exponential just becomes the identity matrix I and x(t) becomes x(0).

T and Q are necessary if you want to do calculations. But for purposes of intuition I suggest the equivalent formulation of 1) how long the world stays in s on average, and 2) when it does change, how likely each other state is to be next.

Generalizing the Algorithms

Continuous-time models invite a lot of pathologies that can’t occur in normal Markov models, like having multiple state transitions between two observations. They also require sophisticated math like matrix exponentials. So it is perhaps surprising that generalizing the algorithms to the continuous case is almost trivial.

Take the Viterbi algorithm, for example, which computes the single highest-probability sequence of states for a sequence of observations. The core operation is to fill up an array Sigma, where Sigma[i,s] is the probability (technically the log-probability) of the best sequence for the first i observations that ends in state s. We compute Sigma[i,s] by looking at the best option for state i-1. In our algorithm this is:

For a continuous-time model this is identical, except that Pr[x →s] is no longer T[x,s]. Instead it is

In effect what we have is simply a HMM where the transition matrix varies from one step to another. This gives us the optimal state labels at the observation times, gracefully side-stepping messy issues like when exactly the state transitions occurred.


There is one important use case that specific to CT-HMMs. In a HMM the entire world proceeds in discrete timesteps; knowing the state of the system at those times tells you all there is to know. But in a CT-HMM there is time between observations, and we want to be able to guess what’s happening in there too.

Using the equation above for x(t) it is possible to interpolate from any guess arbitrarily far forward in time. Without getting into the matrix algebra details, it is also possible to interpolate backward.

Given a sequence of observations O¹, O²,… at times t¹, t², … the forward-backward algorithm gives us the likelihood of each state at t¹, t², .… If we want to guess the state at time t we can find the observations right before/after it, interpolate forward/backward from them, and take a weighted combination to arrive at a single best guess.

Parting Words

In my previous posts I argued that a major advantage of HMMs (in contrast to other tools you might try) is that their simplicity makes them easy to hack. I can think of no better example of this than CT-HMMs! The limitations of HMMs are clear when your observations are irregularly spaced, and the fix ends up being just an HMM with a non-uniform transition matrix.

As with normal HMMs, CT-HMMs paint a simple picture of the world. This means that you can understand why they behave the way they do, think critically about how true their underlying assumptions are, and patch any shortcoming in a way that makes mathematical sense (rather than throwing stuff at a wall and seeing what sticks — the standard approach if you’re using a neural net). They are straightforward to implement, easy to hack, and they deserve a place in any analytics toolbox.




Field Cady

Field Cady has written two books on data science, and consulted for companies of all sizes. His background is in mathematical physics, especially stochastics.