Stationarity Isn’t The Special Case, Nonstationarity Is

5 Facts You Need to Know Before Using Nonstationary Kernels and Time Series

--

This post is a sequel to the post on stationarity and memory in financial markets.

Despite being well understood as mathematical properties, stationarity and nonstationarity are among the most misused tools by machine learning practitioners and seasoned researchers alike. The aim of this post is to address widespread misconceptions about stationary and nonstationary time series and nonstationary positive definite kernels.

I share 5 important points any practitioner or researcher dabbling with time series analysis, kernel methods, or Gaussian Process methods should know.

Fact I. You can’t test whether a time series is stationary or not from a single path.

As argued in-depth in this post, it is impossible to use a statistical test to conclude that a time series is stationary (or nonstationary for that matter) from a single path, no matter how long.

In essence, every statistical test of stationarity makes an additional assumption about the family of diffusions the underlying process belongs to. Thus, a null hypothesis rejection can either represent empirical evidence that the diffusion assumption is incorrect, or that the diffusion assumption is correct but the null hypothesis (e.g. the presence of a unit root) is false.

The statistical test by itself is inconclusive about which scenario holds.

Fact II. You can’t learn nonstationarity.

Another common misconception is that one can ‘learn nonstationarity’ from the data in a Bayesian nonparametric fashion. The intuition here comes from the realization that by modulating the inputs or the output of a stationary stochastic process with a deterministic function, we obtain a nonstationary stochastic process.

This is certainly true, but the next step is where things usually go wrong. Not knowing what modulating function to use, Bayesian nonparametric researchers would typically place a functional prior on it.

For instance, [¹] considers functional priors of the form

where g and f are two independent mean zero stationary Gaussian processes, and proposes a method for learning the logarithm f of the variance modulating function, which the authors claim is tantamount to learning nonstationarity.

This is clearly not the case. In effect, the exponential of a strongly stationary process is strongly stationary, the product of two strongly stationary processes is also strongly stationary, and consequently, h is also strongly stationary!

In fact, the mean of h is easily found to be zero and its translation-invariant covariance function reads:

Note however that, had the modulating function g been known (or deterministic) the covariance function would have been

which is indeed is not (necessarily) translation-invariant.

The functional prior proposed by [¹] should really be compared to a stationary Gaussian process with mean 0 and the same covariance function, and even if it outperforms its GP equivalent, it would certainly not be because of nonstationarity as both processes are (second-order) stationary. Such experiments could only make the case for using other diffusions than Gaussian processes as functional priors.

Similarly, [²] suggests modulating the inputs by using a functional prior of the form

where g is a mean-zero stationary Gaussian process, and f(x) is a Gaussian process with mean function the identity function m(x)=x, and with translation-invariant covariance function.

This mean function aims at using ‘no input modulation’ as the baseline, while the translation-invariance of the covariance function of f aims at avoiding to provide any domain-specific prior knowledge about the modulating function.

Clearly, for a known modulating function f the covariance function, which works out to be

is not (necessarily) translation-invariant, as intended. However, under the functional prior suggested by the authors, h is still mean-zero but now has translation-invariant covariance function

which makes the overall functional prior (second-order) stationary!

The fundamental issue at play here is that, to have a nonstationary functional prior, the overall setup needs to provide a sense in which certain inputs or parts of the domain are different from others: this is what a deterministic f does. If you are placing a functional prior on f to avoid expressing such domain-specific prior knowledge, which is what you need to do to ‘learn nonstationarity’, then you’ll end up with a stationary functional prior overall.

In short, there is no way you can assume a functional prior or time series is nonstationary without expressing a specific sense in which some inputs or parts of the input domain differ from others. If do not have such prior knowledge, then you cannot, and should not, assume nonstationarity.

As we elaborate in the following point, nonstationarity cannot be the object of your learning, it is the effect of your learning or someone else’s, the state of your knowledge.

Fact III. Stationarity/Nonstationarity is not about the phenomenon you are modeling, it is about your understanding of the phenomenon you are modeling!

This brings us to the most common misconception: the idea that data can be nonstationary, or can warrant the use of nonstationary kernels.

It is not uncommon to see articles in top-tier machine learning conference proceedings and journals adhere to the notion of ‘nonstationary functions’. See for instance [³].

Like randomness, stationarity and nonstationarity in econometrics and Bayesian nonparametrics have absolutely nothing to do with the data or the phenomenon that you are modeling; they have everything to do with your knowledge, understanding, or uncertainty about the phenomenon of interest!

Let’s illustrate this point with a concrete example. Let us go back in time to April 2019 and consider modeling the outcome of the 2020 Democratic Presidential Nomination, starting from April 2019.

The one thing we know for sure from the onset is that there will be a nominee. At any given point in time prior to the nomination, we may model the outcome of the nomination using a categorical distribution whose possible outcomes are all declared candidates plus an ‘other’ category to account for late candidacies.

The categorical distribution assigns a probability to each candidate not because the nomination process is random, but because we did not know what would happen. The rules were clear, the game was on, but we just did not know who will cross the finish line!

The categorical distribution, or any random variable for that matter, is simply a mathematical framework to encode our knowledge or lack thereof about a phenomenon whose outcome we do not know with certainty. The same phenomenon can be associated with multiple random variables without any one of them being wrong.

In our election example, for instance, there are multiple ways one could come up with outcome probabilities. One approach is to use national polls. They answer the question: if all primaries were to take place today, what percentage of the votes would each candidate get? These percentages could then be used as our outcome probabilities.

National polls can be regarded as weighted averages of state polls weighted by the total number of voters in each state. Could we have used another weighting scheme? Certainly! All states do not vote at the same time, and the winners in states that vote first (e.g. Iowa and New Hampshire) could get a significant national boost. Thus, when weighing state pools, it might make sense to boost the weights of Iowa and New Hampshire. Then there are critical endorsements. It is conceivable that someone who knew that Rep. Jim Clyburn would endorse Joe Biden prior to the announcement would have slightly increased his probability of winning.

The broader point here is that up until the primary process is over, there is no such thing as the ‘true probability’ that any candidate will win. The random variable we use to model the final outcome of the primary process reflects the information we have access to, and our perspective. Two people can use two different distributions to model the same phenomenon, and neither will wrong, even when one has more information than the other!

A prediction made can turn out to be wrong, but no random variable that can generate any possible outcome is wrong. The same goes for stochastic processes. So long as a diffusion can generate any path over a bounded input space, it is a valid time series model or functional prior!

As it turns out, the space of paths of the simple stationary Gaussian process with mean 0 and covariance function k(x, y) = exp(||x-y||²) is dense in the space of piecewise continuous functions on a compact. This means that if you observe an indexed collection of phenomena (e.g. your output indexed by time or explanatory variables), no matter how weird your data look, they could have been generated by a Gaussian process with mean 0 and covariance function k⁰!

Note that k⁰ is not just translation-invariant, it is the Squared-Exponential kernel with a fixed set of hyper-parameters. I could have used any non-degenerate hyper-parameters and the argument above would still hold.

I hope I’ve convinced you by now that nothing about your data requires nonstationarity. Now, let us consider a practical example to illustrate how knowledge can be encoded as nonstationarity.

We want to learn a function f defined on [-1, 1] in a Bayesian nonparametric fashion. The only thing we know about the function is that f(0)=1. We could ignore this piece of information and model f as a draw from a stationary Gaussian process with Squared Exponential covariance function. The figure below illustrates three random draws from this functional prior and the 2 standard deviation bounds.

Possible draws of a Gaussian process with Square Exponential covariance function.

Or we could encode the piece of knowledge f(0)=1 in our functional prior so as to ensure that all paths of the process will take value 1 at 0. Note that the resulting functional prior cannot possibly be translation-invariant. Examples of such functional priors are the Gaussian processes with mean function

and covariance function

for any nonnegative parameters a and l. The figure below illustrates three random draws from such a process and the 2 standard deviation bounds.

We could adopt a similar approach to encode time-dependent knowledge such as seasonality and time-to-maturity when modeling time series.

Fact IV. If you *really* need a default assumption, it should be stationarity, not nonstationarity!

As previously discussed, by choice or by accident, whenever you use a nonstationary time series or functional prior, you are expressing some domain knowledge. So, if you have no domain knowledge, you should stick to stationary kernels, assuming you absolutely want to choose one or the other.

Let me provide another explanation. At the end of the day, stationarity and nonstationarity are model properties, and intuition dictates that, when possible, we should always stick to the most parsimonious model. This is also known as Occam’s razor, an application of which is the maximum entropy principle. As it turns out, nonstationary time series tend to have a much lower entropy rate than stationary time series. Thus, stationary time series should be the preferred choice whenever possible.

Fact V. You don’t have to choose between stationarity and nonstationarity, and you probably shouldn’t!

I have painted a pretty bleak picture so far. You can’t test or learn nonstationarity, nonstationarity is about what you know not what you are modeling, and you should avoid assuming nonstationarity unless you have some domain knowledge. Let’s end with a silver lining.

It might seem as though nonstationary kernels/covariance functions can only be useful when we have some domain knowledge, but this is only the case when they are used exclusively. That is, when you would only consider one or multiple kernels that are all nonstationary.

When you do not have domain knowledge, it is not necessary to choose between stationary and nonstationary kernels from the onset. You should work with a family of kernels that include both stationary and nonstationary kernels. The only question that should be guiding your choice of a family of kernels is whether or not there is a kernel out there, stationary or not, that performs significantly better than any kernel in your chosen family on the task at hand. If the answer is no, then the family of kernels you are working with is flexible enough, and you should be focusing on hyper-parameters learning.

I formalized this notion in my PhD thesis [⁴] by introducing the notion of general-purposeness. Essentially, a parametric family of psd kernels is general-purpose when it is (pointwise) dense in the family of all continuous and bounded kernels. In can be shown (see [⁴]) that this guarantees that you are not missing out by not considering kernels outside of any general-purpose family in that, for any performance a kernel outside of the family may achieve, there is at least one kernel in the general-purpose family that can achieve a performance arbitrarily close. Generalized spectral kernels are examples of general-purpose families of kernels. [⁴][⁵]

--

--

Yves-Laurent Kom Samo, PhD
The Principled Machine Learning Researcher

Founder & CEO @kxytechnologies | Prev: PhD Fellow in ML @GoogleAI | @ycombinator Alumn | PhD in ML @UniofOxford | Quant @GS. New Blog: https://blog.kxy.ai