Time series trend detection with Bayesian methods
--
In this post, we’ll create a do-it-yourself procedure to detect trend changes in time series data. We’ll take ideas from the well-known Prophet library, and reimplement them using PyMC, a general tool for Bayesian inference.
Time series analysis is a vast topic, and there are many powerful libraries one can use. However, relatively few people are aware of strengths of these modern methods and its limitations. The primary goal of this post is to use a simple example and a simple model to develop the basic intuition.
Motivation
Imagine that one Monday morning, you decide to change your life and exercise for 30 minutes each day. As you continue to stick to your resolution, you phone is keeping score, and eventually reports a trend change in your exercise minutes, celebrating it with a chart.
According to the chart, there was a constant trend for 16 weeks, and then it suddenly changed another constant trend for the most recent 10 weeks. This is a simple model, and these notifications only serve to keep you entertained, but a very similar story is likely in business setting:
- There are metrics that tend to change in stepwise fashion — they sometimes jump as you release new features or change business processes, and then stay at the new value until next change.
- You want to large trend changes to be detected and reported, in case they are unexpected and negatively impact the business
- You don’t want minor changes to be reported, as otherwise investigating them costs more than any benefits
Let’s start with the following fictional data for 4 weeks
It’s hard to reason from a blank slate, so we’ll need to first propose a data generation process that will limit the models we consider. Suppose, that from our knowledge of the metric, we can assume that:
- Individual measurements are normally distributed around a trend
- The trend may change only on Monday
Classical probability model
The first week of data already has clear trend, so let’s focus on the second week and apply classical statistics to see if there was a trend change. We start with a null hypothesis that the trend did not change, and use a statistical test to probe this hypothesis.
# data is an array with 28 measurementsimport scipy.stats
scipy.stats.ttest_1samp(a=data2[7:14], popmean=10)Ttest_1sampResult(..., pvalue=0.0028212592801565937)
Note how easy it is to implement — we provide the trend from the first week, the date from the second week, and get a number evaluating our hypothesis. In this case, the value of 0.002 is pretty low, and we should reject the null hypothesis. As for new trend, the most likely estimate (MLE) is simply the mean of the second week data.
Sounds like we can wrap up this post, having found a one-liner with no parameters that answers our questions. Sadly, the simplicity is misleading:
- Classical statistics is only applicable in specific cases, and quite some knowledge and care is necessary to apply it correctly. Even above, I should have first applied another test to see if data is really normally distributed, and if that test fails, I have nowhere to go.
- We have no parameters to control this model. Even if we have prior information how likely is trend change, we can’t encode this.
- If we have multiple related metrics, there is no way to encode their relation
- We either keep the old trend, and completely forget the past and use new week data
Bayesian model
With Bayesian inference you can overcome the above limitations
- It uses computer to brute-force an answer; you specify which data is randomly generated using what family of distribution, and the computer tells what parameters of the distribution would make sense.
- If you have prior assumptions about parameters of the distribution, you can specify them too
- Parameters, in turn, can be hierarchically connected to other random variables
- In the end, you can get not just single number, but a distribution of likely values. Or, if you still need a single number, it can the most likely value, not necessary one of two options.
Let’s create a simple model for our case, essentially writing down the data generating process we discussed earlier
with pm.Model() as model:
# Week 0 has a clear trend
trend_0 = 10
# On each Monday, there is a normally-distributed trend change
# sigma controls how likely a trend change is
changes = pm.Normal("changes", mu=0, sigma=10, shape=3) # Week 1 data is normally distributed around its trend
# Sigma here controls how much variance inside week we have
trend_1 = trend_0 + changes[0]
actual_1 = pm.Normal("actual 1",
mu=trend_1, sigma=10, observed=data[7:14]) # Week 2 is similar
trend_2 = trend_1 + changes[1]
actual_2 = pm.Normal("actual 2",
mu=trend_2, sigma=10,observed=data[14:21]) trend_3 = trend_2 + changes[2]
actual_3 = pm.Normal("actual 3",
mu=trend_3, sigma=10, observed=data[21:28]) # Compute most credible values for trend changes
map = pm.find_MAP()$ {'changes': array([ 5.41521218, 1.32171058, -0.51981033])}
We’ve got back 3 numbers — the amount of change that happened on each Monday. Let’s plot them.
This is about right. However, while the first trend change, estimated as 5.4, is clearly worth investigating, the change between second and third week, estimated as 1.3, is probably not very significant, and we should not send “there has been a trend change” email to management. Let’s see if we can “dampen” our model — after all, support for prior beliefs is one of benefits of Bayesian inference.
Looking again the the code above, we see that Monday changes are modelled as normally distributed random variable, with sigma controlling how likely the change is. Let’s reduce it from 10 to 3, in the hope it will eliminate these unnecessary trend changes, and plot the new trends:
We made our model less willing to assume large trend changes, but the end result is even worse — we now have three trend changes with comparable magnitude. We need a smarter approach, but for that, we need to recall regulazired linear regression.
Regularized linear regression
Probably most readers are familiar with the linear regression, which comes in three main flavours — OLS, Ridge, and Lasso. Their formulations are shown below.
OLS (ordinary least squares) regression minimizes the mean squared error (MSE) between model predictions and the true target. It can be effectively solved, but can overfit the training dataset. To fight that, we can use either L2 (Ridge) or L1 (Lasso) regulazations, which both add cost for weights and therefore push weights to zero.
The primary difference is that L1 regularization is much more willing to pull some weights to almost zero. To understand why, we’ll draw some isolines for the MSE term and regulaziation term, starting with L2.
In red, we have isolines of the regularization term, which is minimized at zero. In green, we have one isoline of the MSE term, which is minimized at the OLS solution. To optimize them together, we need to find a point somewhere in the middle. Is it possible for this point to have w2 equal to zero? Not really — since all isolines are convex, we can easily find another point on the green line that has lower regularization cost. Therefore, while L2 regularization does push weights to zero, it’s very unlikely to push any weight to exactly zero. In fact, a weight will be zero only if the corresponding weight of the OLS solution is zero.
Now let’s look at L1 regulization.
The isolines for the regularization terms are now squares, and it changes everything. It is quite possible for the solution to lie on the axis, even if the OLS solution does it. Because isolines of the regularization terms are now “less convex”, they can pierce through the MSE isolines.
This observation allows us to understand what went wrong in our prediction model. We use normal variables both to model change from week to week, and to model variation within the week. Since the distributions have the same shape, any spikes inside the week also push the estimate for the week trend change. We need a “less convex” probability distribution for the week-to-week change.
Using the Laplace distribution
Once candidate is the Laplace distribution. The picture below shows PDF of the normal and Laplace distributions with the same mean and variance.
We can see that Laplace is much more concentrated around zero, and is a tad “concave”, so we can hope our bayesian inference will be less likely to explain metrics with trend changes. To apply this idea, we need to modify just one line of code:
with pm.Model() as model:
.... # b parameter of 10/math.sqrt(2) results in the same variance
# as the normal distribution with sigma of 10 that we tried
# initially, only the shape is differene
changes = pm.Laplace("changes", mu=0,
b=10/math.sqrt(2), shape=3) ...$ {'changes': array([5.69020714e+00 1.87704650e-04 7.52844191e-09])
Our model now believes that the changes on the third and fourth week are essentially zero. We can show it in the plot form:
It sounds exactly what we wanted:
- There’s a large trend change on the second week that we can report
- The changes on the third and fourth weeks are infinitesimal, we can reasonably ignore them
Conclusion
We used toy dateset, built a very simple PyMC model and used it to obtain trends that appear to make sense to the naked yet. Of course, there’s much more we’d do for model real data
- Instead of intuitions about distribution shapes, we’d look at our data and systematically build a data generation process
- We’d model seasonality, assume the trend itself is linear in time, and employ covariates.
- Finally, we’d keep track of detected trend changes, whether they are confirmed by users or not, and adjust our model sensitivity.
However, you have hopefully saw they key properties of the Bayesian approach to time series modelling. We can naturally describe the data generation process, encode our assumptions (in particular, picking suitable probability distributions), and receive the estimate of the parameters.
In case you are interested to go deeper
- The complete code can be found in a companion notebook.
- Prophet ideas can be found in the original paper
- Using PyMC to find the location of a single changepoint in time series data is discussed in many posts, for example by Chad Scherrer.
- Using PyMC to model seasonality is discussed in PyMC Examples.
If you found this post useful, please follow me here, or on Twitter.