Characteristic Stability Index (CSI) as a Statistic

Vassili Savinov
7 min readOct 29, 2022

--

Can CSI be used to distinguish temperature distributions for different geographic locations (in UK)?

Characteristic Stability Index (CSI), and the closely related Population Stability Index (PSI) are popular metrics to measure the distribution stability for numeric features. Often CSI is presented as a universal metric with rule-of-thumb thresholds that are understood to have wide applicability. In this post I will instead treat it as a statistic and demonstrate that sample size and the underlying randomness of the data source can drastically alter what one should count as large or small for CSI. I will also present a JAX-based Python library to estimate appropriate CSI thresholds.

Introduction

We shall focus on a real-valued feature such as temperature X. Let there be the reference set of observations of X, which are recognized as ‘normal’, call it the ‘left’ set, and a new set of observations which need to be compared to normal, call that the ‘right’ set:

Left set, containing ‘N’ observations, and right set containing ‘M’ observations.

Thus CSI is means to comparing ‘left’ to ‘right’. To carry out the comparison, one bins the observations into, typically, 10 bins such that each bin contains 10% of the observations from the ‘left’ set. Number of observations, in each bin, from the ‘right’ set, may not be in the same proportion (this is what CSI is trying to detect):

To compute CSI, the observations have to be binned. Typically there are 10 bins, such that the observations in the ‘left’ set split equally between the bins (10% of ‘left’ observations in every bin).

Note: Throughout the text I shall refer to arrays of occupancy, the fourth and fifth columns in the table above as histograms to simplify text.

Given left and right histograms one can proceed to compute CSI:

For example, consider the temperature observations (12 months for the last century+) for Oxford (UK) and Durham (UK) available from the Met-Office UK (using tmax degC column). Breaking the the temperature into bins and collecting it into histograms:

Example of binning temperature observations for Oxford and Durham into 10 bins, appropriate for CSI calculation.

One can quickly estimate that CSI between these two distributions would be around 0.23. Similar approach has been described in:

With some simple analysis one can relate the CSI to Kullback-Leibler divergence. More, precisely, CSI is a symmetrized version of KL-divergence, e.g. see SE what-is-the-intuition-behind-the-population-stability-index

Treating CSI as a statistic

One way to get a better handle on CSI, is to note that it is based solely on the histograms of the left and right sets of observations, i.e. the actual sets of observations are not important. Therefore one can reduce the problem to considering the expected CSI from a set of samples drawn from the multinomial distributions. There can be left and right set of probabilities for these distributions:

One can then draw samples with different sizes from the ‘left’ and ‘right’ multinomial distributions, using the probabilities above, and compute the appropriate CSI-s. For example, let:

for 10 categories. One can obtain these using, for example:

Which ends up giving extremely large value, e.g. 4…8 depending on eps. The actual value is not important, what matters is that:

The expectations for CSI, i.e. the null hypothesis, can be expressed in terms of the expectations for the multinomial distributions that give rise to histograms which then lead to CSI.

Null hypothesis

A simple null hypothesis would be that left and right histograms come from the same multinomial distribution, perhaps with equal probabilities of each category. This however misses the noise inherent in the system under investigation. For example, with Met-Office temperature measurements used in preceding section, the temperatures in the source data are specified to within one decimal point of a degree. This would mean that 20.06 and 21.14 would be rounded up to the same value of 21.1 degrees, so one has to expect the inherent noise in the temperature be on the order of 0.04 degrees plus/minus (at least). If the bins for the temperature are about 2 degrees wide, then the change in width of the bin due to noise can be 1.92…2.08 degrees i.e. 8% change.

One way to capture this is to assume that left and right histograms come from the same multinomial distribution, but the probabilities of each category are not precise, instead they are themselves random numbers with a known distribution. A convenient parametrization here is to express probabilities of all categories as logits, and describe all logits as normally distributed random variables from a distribution with known mean and an unknown variance:

Parametrization of the probabilities for multinomial distribution that will be used to specify Null Hypothesis for CSI.

Whilst this may seem somewhat complex, it is quite simple to implement

by adjusting logit_std and sample_count one can easily see that larger number of samples does decrease the CSI, but as one increases it, the CSI usually hits a limit, which is dictated by logit_std.

As we will demonstrate in the next section, by repeatedly sampling CSI, in a fashion shown here one can build what is essentially a hypothesis test, where CSI is treated as a statistic.

Worked example: CSI for max monthly temperatures for different UK locations

As an example, we shall return to historical monthly temperatures provided by the Met-Office UK. Here are the locations we shall consider.

See the thumbnail picture for where these cities are located on the map. Southampton, as the southern-most city will be the reference point, i.e. the grid for the histograms will be such that monthly Southampton’s temperatures would be split into roughly the same-sized 10 bins.

The full code for loading the data and splitting it into bins is provided in a separate notebook notebook, here we shall only present illustration of the histograms extracted for Southampton.

Illustration of (monthly) temperature observations from Southampton binned into 10 bins and aggregated over 10-year periods (so 120 observations in each histogram)

So, for example, in the 10-year span 1978–1987, corresponding to histogram #12, 12 maximum monthly temperatures were below 7.6 degrees (Centigrade), 13 observations were between 7.6 and 9.0 degrees etc. Still staying with the #12 histogram, 12 out of 120 observations were below 7.6 degrees, this corresponds to proportion of 0.1 and logit of -2.197=1/(1+exp(-0.1)). Converting all observations into logits in this manner, for all locations, and then computing the mean logit and the standard deviation one finds:

Logit summaries (mean and standard deviation) for histograms of all considered locations

Unsurprisingly, Southampton’s logits are closest to -2.2 and with lowest standard deviation, which corresponds to all observations being split roughly equally between the 10 bins. This is direct consequence on choosing bin boundaries based on Southampton as a reference point.

Next, one can take the 10-year aggregated histograms for different locations and compare them with Southampton’s by computing CSI. Using the most recent 10-year histograms in all cases one finds:

CSI for temperature distributions for all locations compared to Southampton

Which CSI is large enough to take as evidence of substantially different distribution of monthly maximum temperatures? Rule of thumb for CSI is to take anything above 0.2 as significant change, however this rule of thumb fails to take into account the noise in the data.

Instead, one can adopt the Null Hypothesis as left and right histograms coming from multinomial distribution where probability of each category corresponds to the sigma-function of a logit, whilst the logit is a normally distributed random variable with mean of -2.197 and standard deviation of -0.282 (comes from Southampton data). Drawing 300,000 (large enough to converge) of pairs of such histograms, with 120 samples in each (10 years of 12 months) one can find the distribution of the expected CSI (see a separate notebook notebook):

Distribution of CSI under Null Hypothesis. Shape is somewhat similar to chi-2 or Gamma distribution. Also see G-test

Using this, one can estimate that 95% of all observed CSIs will be below 0.453.

Therefore, if one was to treat this as a hypothesis test, at 95% confidence, one would reject the null hypothesis only for Bradford and Stornoway (since there the CSI is above 0.45). Observations for all other locations do not differ sufficiently from the distribution of temperatures for Southampton (i.e. CSI value is sufficiently low to ascribe differences to noise).

Specifics of implementation

Drawing a large sample of CSIs, to establish thresholds, as done above, is, in principle, simple, but can be quite time-consuming if one has to cycle. A much better solution is to vectorize this process. Correspondingly a JAX-based implementation for drawing CSI samples is offered as part of this post (draw_csi_with_logit_variance(…)). The usage is as follows (see a separate notebook notebook)

The calculation takes 10–20 sec on an average laptop, and can effectively use available cores and RAM.

Conclusion

Characteristic Stability Index can be a useful way for estimating the drift in the distribution of a random variable, given a test set and a reference set. To draw informative conclusions it is best avoid using rule-of-thumb thresholds, as these ignore:

  • Size of your dataset
  • Inherent noise in your dataset

Instead, one can estimate appropriate thresholds by building a sound Null Hypothesis and then estimating the expected distribution of CSI under this Null Hypothesis. Modern tools such as JAX make this a relatively inexpensive and scalable exercise.

Code: https://github.com/vasasav/csi_hypothesis

--

--