Introduction To Copulas: A Machine Learning Perspective

Photo Credit: Quora

If you have ever built a machine learning model, chances are that you have done some data pre-processing. Some basic but commonly used techniques include data ‘standardization’ (i.e. making each variable mean zero and standard deviation one), rescaling transformations (e.g. mapping each variable to [0, 1], linearly or through sigmoid-like functions), or logarithm transformations. Others entail getting input and/or output variables closer to the underlying assumptions of the mathematical model you might be interested in using (e.g. removing outliers using empirical quantiles so as to make the data generating distribution closer to a Gaussian distribution).

The hope is often to iron-out peculiarities in input and/or output variables so that the marginal data generating distributions will achieve specific properties such as having a given support (e.g. [0, 1]), having given mean and standard deviation (e.g. 0 and 1 respectively), belonging to a distribution family (e.g. Gaussian distributions), to name but a few.

The Probability Integral Transform

A normalizing transformation playing a pivotal role in statistical theory and machine learning theory is the probability integral transform. It is based on the mathematical property that, if x is a scalar random variable with cumulative density function (or CDF) F, then F(x) is a uniform random variable on [0, 1]. In other words, each marginal distribution of a random vector X,

which contains the idiosyncrasies of the corresponding coordinate variable, can always be flattened out into the uniform distribution on [0,1]. This is true regardless of whether the marginal distribution is multimodal or unimodal, fat-tailed or light-tailed, etc. The resulting transformed random vector,

is no longer informative about the peculiarities of each original coordinate random variable, but remains informative about the dependence structure between original coordinate random variables. We refer to the random vector U above as the copula-uniform dual representation of X.

Problems Are in the Primal Space, Solutions Are in the Dual Space

The copula-uniform dual space allows us to study the structure in our problem in a way that is robust to any increasing transformation that we might have applied to our variables. This effectively makes any other increasing normalizing transformation redundant, while allowing us to focus on studying the structure, associations or patterns in our data without, as we will see below, making any assumption on the data generating marginal distributions.

Additionally, we typically do not have control over the data generating distribution. Its marginal distributions can be affected by both the peculiarities of the underlying phenomenon of interest, and how the sample we are working with was prepared (e.g. log prices or prices, additional pre-processing, etc.). Working with such distributions can pose analytical and numerical problems beyond our control.

Example problems include the non-existence of mean and/or higher moments (e.g. when true marginals are Cauchy, inverse gamma or Student’s t distributed), the need to resort to cumbersome convergence theorems on a case-by-case basis to ensure the validity of intuitive operations such as swapping limits and integrals or sums, the need for elaborate variance reduction schemes in Monte Carlo integration, to name but a few.

Of course, we could simply pick marginal distributions based on how friendly they are to work with, rather than seeking the true marginal distributions. This is what most researchers do, and it could explain the widespread misuse of Gaussian distributions, with consequences as damaging as the 2008 credit crunch.

While we do not have control over numerical or analytical hardships encountered in the primal space, numerical or analytical hardships in the dual space are always an indication that we are working with a model that posits that our problem is excessively structured. This will almost certainly come from model misspecification, as we will see in the examples below. If the machine learning problem we are solving is indeed that structured, then we are very lucky, and a simple model can have a tremendous impact.

Crucially, improving decision making in finance has more to do with learning patterns, structures or associations from data, and less to do with learning a model of financial markets or user behavior. In probabilistic terms, we care about learning stochastic dependence, we do not care so much about marginal distributions. If we do not care about the marginals of the data generating distribution, would it not be a shame if we ought to make arbitrary assumptions on said marginals as a prerequisite to learning what we actually care about? It sure sounds odd. As it turns out, most structure learning problems can be formulated in the copula-uniform dual space independently from, and without reference to the marginals of the data generating distribution. I provide an example below and more in subsequent posts.

Marginal Is The Name of Your Data Vendor, Copula Is The Name of Your Pattern

A link between problems in the primal space and their dual equivalent is provided by Sklar’s theorem. When the marginal CDFs are continuous, it follows that

Sklar’s theorem formalizes the above result and states that the distribution of any random vector X is uniquely and fully specified by its marginals and by a distribution with uniform marginals (the distribution of its copula-uniform dual representation U), the CDF of which is referred to as the copula of X.

More generally, the term ‘copula’ denotes any mathematical function that is the CDF of a multivariate distribution with uniform marginals on [0,1].

In a machine learning setting, copulas fully capture the dependence structure between input variables, while marginals are solely informative about how the data were collected or how the underlying phenomenon was observed. For instance, whether you are working with prices, log-prices, or some other increasing mapping used by your data provider for licensing reasons, it will be fully reflected in your marginals, but the copula will remain the same!

Copulas and Total Correlation

An intuitive question every data scientist should ask himself before fitting any model is whether the variables being considered are any informative or related.

Related variables are necessary for the success of models involving some form of compression (e.g. VAEs, GANs, and other (auto-)encoders). Using inputs that are informative about labels is key to the success of supervised learning models. However, input variables that are highly mutually related in classification and regression problems indicate poor variable selection and can cause numerical instabilities (e.g. ill-conditioning).

Thus, quantifying information redundancy can go a long way toward estimating the extent to which decision making can be improved, prior to running any model. We recall that the entropy of a random variable quantifies the amount of information (defined as uncertainty) it contains.

The entropy of a probability distribution with density function p with respect to a base measure dμ. The entropy of a random variable is that of its probability distribution.

It can be shown (*) that the entropy of the random vector X can be broken down as

where X* is a random vector with marginal distributions identical to those of X, but with independent coordinates.

As it happens, the term h(X*)-h(X) is known as the total correlation between coordinates of X and quantifies information redundancy between coordinates of X. The equation above establishes that the total correlation of a random vector is simply equal to the negative of the entropy of its copula, which obviously does not depend on primal marginals! Note that that the entropy of a copula is always non-positive. (**) In fact, the only continuous copula with 0 entropy is the independence copula

all other continuous copulas have negative entropies! Or said differently, the amount of redundant information in a random vector is always non-negative and 0 if and only if its coordinates are statistically independent, which makes intuitive sense.

In subsequent posts, I will establish that most structure learning problems can, in fact, be formulated in the dual space solely as a function of the copula-uniform dual representation, and independently from primal marginals.

For now, if there is one takeaway from this post, it is that marginal distributions have a lot to do how the data were collected by your data vendor or data team, while copulas have everything to do with the structure between the underlying phenomena and nothing to do with how the data were collected. Improving decision making will come from studying the structure in your data, not studying your data vendor or data team.

The Copula Trick

Now let’s address the elephant in the room. Clearly, the marginal CDFs will typically not be known and, as previously discussed, they aren’t what we care about. So, how can we work in the dual space without explicitly applying the probability integral transform to the data? We proceed as follows:

Step 1: we use as empirical evidence one or multiple decent (ideally unbiased and consistent) estimators of known functionals of the (unknown) copula. The sample versions of a few well-studied concordance measures (e.g. Spearman’s Rho, Gini’s, Blest’s, etc.) fall in this category.

Expression of the Spearman concordance measure between two scalar random variables x and y as a functional of their copula density c(u, v), and a sample estimator thereof in the primal space using n i.i.d. samples (xi, yi). rg(xi) (resp. rg(yi)) represents the rank of xi (resp. yi) in the set (x1, …, xn) (resp. (y1, …, yn)).

Like copulas, these estimators are typically invariant by any increasing transformation of the data and are often depend on data solely through ranks.

Step 2: We learn the copula density in a model-free fashion by solving a maximum entropy problem in the dual space under the constraints that the foregoing functionals are equal to the values estimated in the primal space (using rank data). In so doing, we choose to use as copula the least structured among all copulas that match our empirical observations — we do not restrict ourselves to any parametric family of copulas!

By using this ‘copula trick’, we never actually have to generate the copula-uniform dual representation of our dataset, and we never have to make an arbitrary model assumption. I’ll write a separate post detailing how to undertake maximum entropy inference of copulas, and how it can be used to solve concrete problems.

Examples

In the meantime, let’s consider a couple of problems and their solutions.

Problem 1: Among all 3-dimensional distributions in the primal space that have a given average pairwise Spearman correlation, which distribution is the least structured or has the least amount of information redundancy, and what is its total correlation?

Note that this optimal distribution is also the one whose copula has the highest entropy, or said differently, whose copula is the most uncertain about everything but the observed average pairwise Spearman correlation (regarded as a functional of the copula density). While there is an infinite number of solutions (***), they all have the same copula. The plot below illustrates the minimum total correlation as a function of the constraint value.

Fig. 1: The minimum amount of information redundancy there is in a 3-dimensional random vector as a function of its pairwise average Spearman Rho.

Problem 2: What is the density of the copula with the highest entropy, among all copulas of 2-dimensional random vectors with a given Spearman rank correlation?

The figure below illustrates the solution for a Spearman correlation of -0.7.

Fig. 2: Copula density of the copula with the highest entropy among all copulas of 2-dimensional random vectors with Spearman correlation -0.7.

As discussed earlier, the most uninformative of all copulas (i.e. the one with the highest entropy) is the independence copula

whose density is simply the flat surface

When we add the Spearman correlation constraint to the maximum entropy problem, the flat surface solution to the unconstrained problem should bend just enough to meet the new constraint. The more the flat surface bends, the more the entropy is reduced, and the more structure the copula encodes.

We know the maximum entropy copula with Spearman correlation -0.7 is not Gaussian. But how far away is it from the copula of the bivariate Gaussian with Spearman correlation -0.7 (or equivalently a Pearson correlation of -0.72)? How much more structure does the Gaussian copula encodes? A whole lot, as suggested by the plot below!

Fig. 3: Copula density of the 2-dimensional Gaussian copula with Spearman correlation -0.70 (Pearson correlation -0.72).

At a glance, the Gaussian copula with the same Spearman correlation is indeed a lot more curved and steep around the corners (0, 1) and (1, 0).

This might seem counterintuitive to the alert reader who knows that Gaussian distributions are maximum entropy under covariance constraints in the primal space. This can be easily explained. Do you remember the following identity?

It shows that maximizing the entropy or, equivalently, model uncertainty in the primal space seeks a tradeoff between maximizing the sum of the entropies of the marginals and maximizing the entropy of the copula. While the entropy of the copula reflects the amount of structure in our problem, and we do not want to posit that our problem is more structured than suggested by empirical evidence, there is absolutely no reason whatsoever to maximize the entropies of the marginals! Why would the representation of the phenomenon of interest that you are working with, which is provided by your data vendor or data team be Gaussian or even high entropy? You don’t know, and you probably shouldn’t care anyway.

Don’t let Gaussians fool you. They might be high entropy, but their copulas, which are all you should care about, are unusually low entropy and excessively structured. The same holds for any multivariate distribution whose copula density explodes when a coordinate goes to 0 or 1, including the Student’s t distribution!

If you are still not convinced that those distributions are excessively structured, let’s reason in the primal space. The copula density of X can be written as

where f (resp. fᵢ) is the joint (resp. marginal) PDF, and p the conditional PDF xᵢ given all other coordinates. Let’s fix all coordinates but xᵢ and take the limit of the copula density c as x goes to infinity (+∞ or -∞) or, equivalently, as uᵢ goes to 1 or 0. Both fᵢ(xᵢ) and f(…, xᵢ, …) will go to 0 as f and fᵢ should integrate to 1. However, if c explodes then

meaning that the joint distribution posits that knowing values of all xⱼ (j ≠ i), no matter what those values actually are, is always so informative about xᵢ that it drastically fattens its tails, to the point that the conditional PDF converges to 0 at a speed that is negligible compared to the speed at which the marginal of xᵢ converges to 0.

To visualize this, we plot the contours of two multivariate distributions with standard Gaussian marginals, one with Gaussian copula with Spearman correlation -0.7, and the other with the previously illustrated maximum-entropy copula with the same Spearman correlation.

Fig. 4: PDFs of two bivariate distributions with Spearman correlation -0.7 and standard Gaussian marginals. The distribution on the left has copula the copula with maximum entropy under the Spearman correlation constraint (see Fig. 2), while the distribution on the right has Gaussian copula (see Fig. 3).

Copula is the name of your patterns, and Marginal is the name of your data vendor. Study your pattern, thank your vendor for the data, and practice some social distancing away from Gaussians!

(*) Hint: When X admits a probability density function (PDF), so does U, and it reads

(**) Hint: Recall that the entropy of a joint distribution is always smaller than the sum of the entropies of its marginals and that the entropy of the uniform distribution on [0, 1] is 0.

(***) Remember, Spearman correlation does not depend on marginals. A copula combined with any set of marginals will result in a multivariate distribution in the primal space with the same Spearman correlation.

--

--

Yves-Laurent Kom Samo, PhD
The Principled Machine Learning Researcher

Founder & CEO @kxytechnologies | Prev: PhD Fellow in ML @GoogleAI | @ycombinator Alumn | PhD in ML @UniofOxford | Quant @GS. New Blog: https://blog.kxy.ai