Stories by Jeffrey Näf on Medium

What Is a Good Imputation for Missing Values?

Jeffrey Näf — Sat, 08 Jun 2024 03:11:56 GMT

My current take on what imputation should be

This article is a first article summarizing and discussing my most recent paper. We study general-purpose imputation of tabular datasets. That is, the imputation should be done in a way that works for many different tasks in a second step (sometimes referred to as “broad imputation”).

In this article, I will write 3 lessons that I learned working on this problem over the last years. I am very excited about this paper in particular, but also cautious, as the problem of missing values has many aspects and it can be difficult to not miss something. So I invite you to judge for yourself if my lessons make sense to you.

If you do not want to get into great discussions about missing values, I will summarize my recommendations at the end of the article.

Disclaimer: The goal of this article is to use imputation to recreate the original data distribution. While I feel this is what most researchers and practitioners actually want, this is a difficult goal that might not be necessary in all applications. For instance, when performing (conditional mean) prediction, there are several recent papers showing that even simple imputation methods are sufficient for large sample sizes.

All images in this article were created by the author.

Preliminaries

Before continuing we need to discuss how I think about missing values in this article.

We assume there is an underlying distribution P* from which observations X* are drawn. In addition, there is a vector of 0/1s of the same dimension as X* that is drawn, let’s call this vector M. The actual observed data vector X is then X* masked by M. Thus, we observe n independently and identically distributed (i.i.d.) copies of the joint vector (X,M). If we write this up in a data matrix, this might look like this:

The data generating process: X* and M are drawn, then we observe n i.id. copies of (X,M), where X is X* but masked by M.

As usual small values x, m means “observed”, while large values refer to random quantities. The missingness mechanisms everyone talks about are then assumptions about the relationship or joint distribution of (X*,M):

Missing Completely at Random (MCAR): The probability of a value being missing is a coin flip, independent of any variable in the dataset. Here missing values are but a nuisance. You could ignore them and just focus on the fully observed part of your dataset and there would be no bias. In math for all m and x:

Missing at Random (MAR): The probability of missingness can now depend on the observed variables in your dataset. A typical example would be two variables, say income and age, whereby age is always observed, but income might be missing for certain values of age. This is the example we study below. This may sound reasonable, but here it can get complicated. In math, for all m and x:

Missing Not at Random (MNAR): Everything is possible here, and we cannot say anything about anything in general.

The key is that for imputation, we need to learn the conditional distribution of missing values given observed values in one pattern m’ to impute in another pattern m.

A well-known method of achieving this is the Multiple Imputation by Chained Equations (MICE) method: Initially fill the values with a simple imputation, such as mean imputation. Then for each iteration t, for each variable j regress the observed X_j on all other variables (which are imputed). Then plug in the values of these variables into the learned imputer for all X_j that are not observed. This is explained in detail in this article, with an amazing illustration that will make things immediately clear. In R this is conveniently implemented in the mice R package. As I will outline below, I am a huge fan of this method, based on the performance I have seen. In fact, the ability to recreate the underlying distribution of certain instances of MICE, such as mice-cart, is uncanny. In this article, we focus on a very simple example with only one variable missing, and so we can code by hand what MICE would usually do iteratively, to better illustrate what is happening.

A first mini-lesson is that MICE is a host of methods; whatever method you choose to regress X_j on the other variables gives you a different imputation method. As such, there are countless variants in the mice R package, such as mice-cart, mice-rf, mice-pmm, mice-norm.nob, mice-norm.predict and so on. These methods will perform widely differently as we will see below. Despite this, at least some papers (in top conferences such as NeurIPS) confidently proclaim that they compare their methods to “MICE”, without any detail on what exactly they are using.

The Example

We will look at a very simple but illustrative example: Consider a data set with two jointly normal variables, X_1, X_2. We assume both variables have variance of 1 and a positive correlation of 0.5. To give some context, we can imagine X_1 to be (the logarithm of) income and X_2 to be age. (This is just for illustration, obviously no one is between -3 and 3 years old). Moreover, assume a missing mechanism for the income X_1, whereby X_1 tends to be missing whenever age is “high”. That is we set:

So X_1 (income) is missing with probability 0.8 whenever X_2 (age) is “large” (i.e., larger zero). As we assume X_2 is always observed, this is a textbook MAR example with two patterns, one where all variables are fully observed (m1) and a second (m2), wherein X_1 is missing. Despite the simplicity of this example, if we assume that higher age is related to higher income, there is a clear shift in the distribution of income and age when moving from one pattern to the other. In pattern m2, where income is missing, values of both the observed age and the (unobserved) income tend to be higher. Let’s look at this in code:

library(MASS)
library(mice)



set.seed(10)
n<-3000

Xstar <- mvrnorm(n=n, mu=c(0,0), Sigma=matrix( c(1,0.7,0.7,1), nrow=2, byrow=T   ))

colnames(Xstar) <- paste0("X",1:2)



## Introduce missing mechanisms
M<-matrix(0, ncol=ncol(Xstar), nrow=nrow(Xstar))
M[Xstar[,2] > 0, 1]<- sample(c(0,1), size=sum(Xstar[,2] > 0), replace=T, prob = c(1-0.8,0.8) )


## This gives rise to the observed dataset by masking X^* with M:
X<-Xstar
X[M==1] <- NA


## Plot the distribution shift
par(mfrow=c(2,1))
plot(Xstar[!is.na(X[,1]),1:2], xlab="", main="", ylab="", cex=0.8, col="darkblue", xlim=c(-4,4), ylim=c(-3,3))
plot(Xstar[is.na(X[,1]),1:2], xlab="", main="", ylab="", cex=0.8, col="darkblue", xlim=c(-4,4), ylim=c(-3,3))

Top: Distribution of (X_1,X_2) in the pattern where X_1 is observed, Bottom: Distribution of (X_1,X_2) in the pattern where X_1 is missing.

Lesson 1: Imputation is a distributional prediction problem

In my view, the goal of (general purpose) imputation should be to replicate the underlying data distribution as well as possible. To illustrate this, consider again the first example with p=0, such that only X_1 has missing values. We will now try to impute this example, using the famous MICE approach. Since only X_1 is missing, we can implement this by hand. We start with the mean imputation, which simply calculates the mean of X_1 in the pattern where it is observed, and plugs this mean in the place of NA. We also use the regression imputation which is a bit more sophisticated: We regress X_1 onto X_2 in the pattern where X_1 is observed and then for each missing observation of X_1 we plug in the prediction of the regression. Thus here we impute the conditional mean of X_1 given X_2. Finally, for the Gaussian imputation, we start with the same regression of X_1 onto X_2, but then impute each missing value of X_1 by drawing from a Gaussian distribution. In other words, instead of imputing the conditional expectation (i.e. just the center of the conditional distribution), we draw from this distribution. This leads to a random imputation, which may be a bit counterintuitive at first, but will actually lead to the best result:

## (0) Mean Imputation: This would correspond to "mean" in the mice R package ##


# 1. Estimate the mean
meanX<-mean(X[!is.na(X[,1]),1])

## 2. Impute
meanimp<-X
meanimp[is.na(X[,1]),1] <-meanX

## (1) Regression Imputation: This would correspond to "norm.predict" in the mice R package ##

# 1. Estimate Regression
lmodelX1X2<-lm(X1~X2, data=as.data.frame(X[!is.na(X[,1]),])   )

## 2. Impute
impnormpredict<-X
impnormpredict[is.na(X[,1]),1] <-predict(lmodelX1X2, newdata= as.data.frame(X[is.na(X[,1]),])  )


## (2) Gaussian Imputation: This would correspond to "norm.nob" in the mice R package ##

# 1. Estimate Regression
#lmodelX1X2<-lm(X1~X2, X=as.data.frame(X[!is.na(X[,1]),])   )
# (same as before)

## 2. Impute
impnorm<-X
meanx<-predict(lmodelX1X2, newdata= as.data.frame(X[is.na(X[,1]),])  )
var <- var(lmodelX1X2$residuals)
impnorm[is.na(X[,1]),1] <-rnorm(n=length(meanx), mean = meanx, sd=sqrt(var) )



## Plot the different imputations

par(mfrow=c(2,2))


plot(meanimp[!is.na(X[,1]),c("X2","X1")], main=paste("Mean Imputation"), cex=0.8, col="darkblue", cex.main=1.5)
points(meanimp[is.na(X[,1]),c("X2","X1")], col="darkred", cex=0.8 )

plot(impnormpredict[!is.na(X[,1]),c("X2","X1")], main=paste("Regression Imputation"), cex=0.8, col="darkblue", cex.main=1.5)
points(impnormpredict[is.na(X[,1]),c("X2","X1")], col="darkred", cex=0.8 )

plot(impnorm[!is.na(X[,1]),c("X2","X1")], main=paste("Gaussian Imputation"), col="darkblue", cex.main=1.5)
points(impnorm[is.na(X[,1]),c("X2","X1")], col="darkred", cex=0.8 )

#plot(Xstar[,c("X2","X1")], main="Truth", col="darkblue", cex.main=1.5)
plot(Xstar[!is.na(X[,1]),c("X2","X1")], main="Truth", col="darkblue", cex.main=1.5)
points(Xstar[is.na(X[,1]),c("X2","X1")], col="darkgreen", cex=0.8 )

The distribution of (X_1, X_2) plotted for different imputation methods. Different Imputation methods (red are the imputed points).

Studying this plot immediately reveals that the mean and regression imputations might not be ideal, as they completely fail at recreating the original data distribution. In contrast, the Gaussian imputation looks pretty good, in fact, I’d argue it would be hard to differentiate it from the truth. This might just seem like a technical notion, but this has consequences. Imagine you were given any of those imputed data sets and now you would like to find the regression coefficient when regressing X_2 onto X_1 (the opposite of what we did for imputation). The truth in this case is given by beta=cov(X_1,X_2)/var(X_1)=0.7.

## Regressing X_2 onto X_1

## mean imputation estimate
lm(X2~X1, data=data.frame(meanimp))$coefficients["X1"]
## beta= 0.61

## regression imputation estimate
round(lm(X2~X1, data=data.frame(impnormpredict))$coefficients["X1"],2)
## beta= 0.90

## Gaussian imputation estimate
round(lm(X2~X1, data=data.frame(impnorm))$coefficients["X1"],2)
## beta= 0.71

## Truth imputation estimate
round(lm(X2~X1, data=data.frame(Xstar))$coefficients["X1"],2)
## beta= 0.71

The Gaussian imputation is pretty close to 0.7 (0.71), and importantly, it is very close to the estimate using the full (unobserved) data! On the other hand, the mean imputation underestimates beta, while the regression imputation overestimates beta. The latter is natural, as the conditional mean imputation artificially inflates the relationship between variables. This effect is particularly important, as this will result in effects that are overestimated in science and (data science) practice!!

The regression imputation might seem overly simplistic. However, the key is that very commonly used imputation methods in machine learning and other fields work exactly like this. For instance, knn imputation and random forest imputation (i.e., missForest). Especially the latter has been praised and recommended in several benchmarking papers and appears very widely used. However, missForest fits a Random Forest on the observed data and then simply imputes by the conditional mean. So, using it in this example the result would look very similar to the regression imputation, thus resulting in an artificial strengthening of relations between variable and biased estimates!

A lot of commonly used imputation methods, such as mean imputation, knn imputation, and missForest fail at replicating the distribution. What they estimate and approximate is the (conditional) mean, and so the imputation will look like that of the regression imputation (or even worse for the mean imputation). Instead, we should try to impute by drawing from estimated (conditional) distributions.

Lesson 2: Imputation should be evaluated as a distributional prediction problem

There is a dual problem connected to the discussion of the first lesson. How should imputation methods be evaluated?

Imagine we developed a new imputation method and now want to benchmark this against methods that exist already such as missForest, MICE, or GAIN. In this setting, we artificially induce the missing values and so we have the actual data set just as above. We now want to compare this true dataset to our imputations. For the sake of the example, let us assume the regression imputation above is our new method, and we would like to compare it to mean and Gaussian imputation.

Even in the most prestigious conferences, this is done by calculating the root mean squared error (RMSE):

This is implemented here:

## Function to calculate the RMSE:
# impX is the imputed data set
# Xstar is the fully observed data set

RMSEcalc<-function(impX, Xstar){
  
  round(mean(apply(Xstar - impX,1,function(x) norm(as.matrix(x), type="F"  ) )),2)
  
}

This discussion is related to the discussion on how to correctly score predictions. In this article, I discussed that (R)MSE is the right score to evaluate (conditional) mean predictions. It turns out the exact same logic applies here; using RMSE like this to evaluate our imputation, will favor methods that impute the conditional mean, such as the regression imputation, knn imputation, and missForest.

Instead, imputation should be evaluated as a distributional prediction problem. I suggest using the energy distance between the distribution of the fully observed data and the imputation “distribution”. Details can be found in the paper, but in R it is easily coded using the nice “energy” R package:

library(energy)

## Function to calculate the energy distance:
# impX is the imputed data set
# Xstar is the fully observed data set

## Calculating the energy distance using the eqdist.e function of the energy package
energycalc <- function(impX, Xstar){
  
  # Note: eqdist.e calculates the energy statistics for a test, which is actually
  # = n^2/(2n)*energydistance(impX,Xstar), but we we are only interested in relative values
  round(eqdist.e( rbind(Xstar,impX), c(nrow(Xstar), nrow(impX))  ),2)
  
}

We now apply the two scores to our imaginary research project and try to figure out whether our regression imputation is better than the other two:

par(mfrow=c(2,2))


## Same plots as before, but now with RMSE and energy distance 
## added

plot(meanimp[!is.na(X[,1]),c("X2","X1")], main=paste("Mean Imputation", "\nRMSE", RMSEcalc(meanimp, Xstar), "\nEnergy", energycalc(meanimp, Xstar)), cex=0.8, col="darkblue", cex.main=1.5)
points(meanimp[is.na(X[,1]),c("X2","X1")], col="darkred", cex=0.8 )

plot(impnormpredict[!is.na(X[,1]),c("X2","X1")], main=paste("Regression Imputation","\nRMSE", RMSEcalc(impnormpredict, Xstar), "\nEnergy", energycalc(impnormpredict, Xstar)), cex=0.8, col="darkblue", cex.main=1.5)
points(impnormpredict[is.na(X[,1]),c("X2","X1")], col="darkred", cex=0.8 )

plot(impnorm[!is.na(X[,1]),c("X2","X1")], main=paste("Gaussian Imputation","\nRMSE", RMSEcalc(impnorm, Xstar), "\nEnergy", energycalc(impnorm, Xstar)), col="darkblue", cex.main=1.5)
points(impnorm[is.na(X[,1]),c("X2","X1")], col="darkred", cex=0.8 )


plot(Xstar[!is.na(X[,1]),c("X2","X1")], main="Truth", col="darkblue", cex.main=1.5)
points(Xstar[is.na(X[,1]),c("X2","X1")], col="darkgreen", cex=0.8 )

If we look at RMSE, then our regression imputation appears great! It beats both mean and Gaussian imputation. However this clashes with the analysis from above, and choosing the regression imputation can and likely will lead to highly biased results. On the other hand, the (scaled) energy distance correctly identifies that the Gaussian imputation is the best method, agreeing with both visual intuition and better parameter estimates.

When evaluating imputation methods (when the true data are available) measures such as RMSE and MAE should be avoided. Instead, the problem should be treated and evaluated as a distributional prediction problem, and distributional metrics such as the energy distance should be used. The overuse of RMSE as an evaluation tool has some serious implications for research in this area.

Again this is not surprising, identifying the best mean prediction is what RMSE does. What is surprising, is how consistently it is used in research to evaluate imputation methods. In my view, this throws into question at least some recommendations of recent papers, about what imputation methods to use. Moreover, as new imputation methods get developed they are compared to other methods in terms of RMSE and are thus likely not replicating the distribution correctly. One thus has to question the usefulness of at least some of the myriad of imputation methods developed in recent years.

The question of evaluation gets much harder, when the underlying observations are not available. In the paper we develope a score that allows to rank imputation methods, even in this case! (a refinement of the idea presented in this article). The details are reserved for another medium post, but we can try it for this example. The “Iscore.R” function can be found on Github or at the end of this article.


library(mice)
source("Iscore.R")


methods<-c("mean",       #mice-mean
           "norm.predict",   #mice-sample
           "norm.nob") # Gaussian Imputation

## We first define functions that allow for imputation of the three methods:

imputationfuncs<-list()

imputationfuncs[["mean"]] <- function(X,m){ 
# 1. Estimate the mean
  meanX<-mean(X[!is.na(X[,1]),1])
## 2. Impute
  meanimp<-X
  meanimp[is.na(X[,1]),1] <-meanX
  
  res<-list()
  
  for (l in 1:m){
    res[[l]] <- meanimp
  }
  
  return(res)
  
}

imputationfuncs[["norm.predict"]] <- function(X,m){ 
 # 1. Estimate Regression
  lmodelX1X2<-lm(X1~., data=as.data.frame(X[!is.na(X[,1]),])   )
 ## 2. Impute
  impnormpredict<-X
  impnormpredict[is.na(X[,1]),1] <-predict(lmodelX1X2, newdata= as.data.frame(X[is.na(X[,1]),])  )
  
res<-list()

for (l in 1:m){
  res[[l]] <- impnormpredict
}

return(res)
  
  }


imputationfuncs[["norm.nob"]] <- function(X,m){ 
 # 1. Estimate Regression
  lmodelX1X2<-lm(X1~., data=as.data.frame(X[!is.na(X[,1]),])   )
 ## 2. Impute
  impnorm<-X
  meanx<-predict(lmodelX1X2, newdata= as.data.frame(X[is.na(X[,1]),])  )
  var <- var(lmodelX1X2$residuals)
  
  res<-list()
  
  for (l in 1:m){
    impnorm[is.na(X[,1]),1] <-rnorm(n=length(meanx), mean = meanx, sd=sqrt(var) )
    res[[l]] <- impnorm
  }

  
  return(res)
  
}


scoreslist <- Iscores_new(X,imputations=NULL, imputationfuncs=imputationfuncs, N=30)  

scores<-do.call(cbind,lapply(scoreslist, function(x) x$score ))
names(scores)<-methods
scores[order(scores)]

#    mean       norm.predict     norm.nob 
#  -0.7455304   -0.5702136   -0.4220387

Thus without every seeing the values of the missing data, our score is able to identify that norm.nob is the best method! This comes in handy, especially when the data has more than two dimensions. I will give more details on how to use the score and how it works in a next article.

Lesson 3: MAR is weirder than you think

When reading the literature on missing value imputation, it is easy to get a sense that MAR is a solved case, and all the problems arise from whether it can be assumed or not. While this might be true under standard procedures such as maximum likelihood, if one wants to find a good (nonparametric) imputation, this is not the case.

Our paper discusses how complex distribution shifts are possible under MAR when changing from say the fully observed pattern to a pattern one wants to impute. We will focus here on the shift in distribution that can occur in the observed variables. For this, we turn to the example above, where we took X_1 to be income and X_2 to be age. As we have seen in the first figure the distribution looks quite different. However, the conditional distribution of X_1 | X_2 remains the same! This allows to identify the right imputation distribution in principle.

Distribution of X_2 in the pattern where X_1 is observed, Bottom: Distribution of (X_1,X_2) in the pattern where X_1 is missing.

The problem is that even if we can nonparametrically estimate the conditional distribution in the pattern where X_1 is missing, we need to extrapolate this to the distribution of X_2 where X_1 is missing. To illustrate this I will now introduce two very important nonparametric mice methods. One old (mice-cart) and one new (mice-DRF). The former uses one tree to regress X_j on all the other variables and then imputes by drawing samples from that tree. Thus instead of using the conditional expectation prediction of a tree/forest, as missForest does, it draws from the leaves to approximate sampling from the conditional distribution. In contrast, mice-DRF uses the Distributional Random Forest, a forest method designed to estimate distributions and samples from those predictions. Both work exceedingly well, as I will lay out below!

library(drf)


## mice-DRF ##
par(mfrow=c(2,2))

#Fit DRF
DRF <- drf(X=X[!is.na(X[,1]),2, drop=F], Y=X[!is.na(X[,1]),1, drop=F], num.trees=100)
impDRF<-X
# Predict weights for unobserved points
wx<-predict(DRF, newdata= X[is.na(X[,1]),2, drop=F]  )$weights
impDRF[is.na(X[,1]),1] <-apply(wx,1,function(wxi) sample(X[!is.na(X[,1]),1, drop=F], size=1, replace=T, prob=wxi))


plot(impDRF[!is.na(X[,1]),c("X2","X1")], main=paste("DRF Imputation", "\nRMSE", RMSEcalc(impDRF, Xstar), "\nEnergy", energycalc(impDRF, Xstar)), cex=0.8, col="darkblue", cex.main=1.5)
points(impDRF[is.na(X[,1]),c("X2","X1")], col="darkred", cex=0.8 )


## mice-cart##
impcart<-X
impcart[is.na(X[,1]),1] <-mice.impute.cart(X[,1], ry=!is.na(X[,1]), X[,2, drop=F], wy = NULL)

plot(impDRF[!is.na(X[,1]),c("X2","X1")], main=paste("cart Imputation", "\nRMSE", RMSEcalc(impcart, Xstar), "\nEnergy", energycalc(impcart, Xstar)), cex=0.8, col="darkblue", cex.main=1.5)
points(impDRF[is.na(X[,1]),c("X2","X1")], col="darkred", cex=0.8 )

plot(impnorm[!is.na(X[,1]),c("X2","X1")], main=paste("Gaussian Imputation","\nRMSE", RMSEcalc(impnorm, Xstar), "\nEnergy", energycalc(impnorm, Xstar)), col="darkblue", cex.main=1.5)
points(impnorm[is.na(X[,1]),c("X2","X1")], col="darkred", cex=0.8 )

Though both mice-cart and mice-DRF do a good job, they are still not quite as good as the Gaussian imputation. This is not surprising per se, as the Gaussian imputation is the ideal imputation in this case (because (X_1, X_2) are indeed Gaussian). Nonetheless the distribution shift in X_2 likely plays a role in the difficulty of mice-cart and mice-DRF to recover the distribution even for 3000 observations (these methods are usually really really good). Note that this kind of extrapolation is not a problem for the Gaussian imputation.

The paper also discusses a similar, but more extreme example with two variables (X_1, X_2). In this example, the distribution shift is much more pronounced, and the forest-based methods struggle accordingly:

More extreme example of distribution shift in the paper. While the Gausisan imputation is near perfect, mice-RF and mice-DRF are not able to extrapolate correctly.

The problem is that these kinds of extreme distribution shifts are possible under MAR and forest-based methods have a hard time extrapolating outside of the data set (so do neural nets btw). Indeed, can you think of a method that can (1) learn a distribution nonparametrically and (2) extrapolate from X_2 coming from the upper distribution to X_2 drawn from the lower distribution reliably? For now, I cannot.

Imputation is hard, even if MAR can be assumed, and the search for reliable imputation methods is not over.

Conclusion: My current recommendations

Missing values are a hairy problem. Indeed, the best way to deal with missing values is to not have them. Accordingly, Lesson 3 shows that the search for imputation methods is not yet concluded, even if one only considers MAR. We still lack a method that can do (1) nonparametric distributional prediction and (2) adapt to distribution shifts that are possible under MAR. That said, I also sometimes feel people make the problem more complicated than it is; some MICE methods perform extremely well and might be enough for many missing value problems already.

I first want to mention that that there are very fancy machine learning methods like GAIN and variants, that try to impute data using neural nets. I like these methods because they follow the right idea: Impute the conditional distributions of missing given observed. However, after using them a bit, I am somewhat disappointed by their performance, especially compared to MICE.

Thus, if I had a missing value problem the first thing I’d try is mice-cart (implemented in the mice R package) or the new mice-DRF (code on Github) we developed in the paper. I have tried those two on quite a few examples and their ability to recreate the data is uncanny. However note that these observations of mine are not based on a large, systematic benchmarking and should be taken with a grain of salt. Moreover, this requires at least an intermediate sample size of say above 200 or 300. Imputation is not easy and completely nonparametric methods will suffer if the sample size is too low. In the case of less than 200 observations, I would go with simpler methods such as Gaussian imputation (mice-norm.nob in the R package). If you would then like to find the best out of these methods I recommend trying our score developed in the paper, as done in Lesson 2 (though the implementation might not be the best).

Finally, note that none of these methods are able to effectively deal with imputation uncertainty! In a sense, we only discussed single imputation in this article. (Proper) multiple imputation would require that the uncertainty of the imputation method itself is taken into account, which is usually done using Bayesian methods. For frequentist method like we looked at here, this appears to be an open problem.

Appendix 1: m-I-Score

The File “Iscore.R”, which can also be found on Github.

Iscores_new<-function(X, N=50,  imputationfuncs=NULL, imputations=NULL, maxlength=NULL,...){
  
  ## X: Data with NAs
  ## N: Number of samples from imputation distribution H
  ## imputationfuncs: A list of functions, whereby each imputationfuncs[[method]] is a function that takes the arguments
  ## X,m and imputes X m times using method: imputations= imputationfuncs[[method]](X,m).
  ## imputations: Either NULL or a list of imputations for the methods considered, each imputed X saved as 
  ##              imputations[[method]], whereby method is a string
  ## maxlength: Maximum number of variables X_j to consider, can speed up the code
  
  
  require(Matrix)
  require(scoringRules)
  
  
  numberofmissingbyj<-sapply(1:ncol(X), function(j)  sum(is.na(X[,j]))  )
  print("Number of missing values per dimension:")
  print(paste0(numberofmissingbyj, collapse=",")  )

  methods<-names(imputationfuncs)

  score_all<-list()
  
  for (method in methods) {
    print(paste0("Evaluating method ", method))
    
    
    # }
    if (is.null(imputations)){
      # If there is no prior imputation
      tmp<-Iscores_new_perimp(X, Ximp=NULL, N=N, imputationfunc=imputationfuncs[[method]], maxlength=maxlength,...)
      score_all[[method]] <- tmp  
      
      
    }else{
      
      tmp<-Iscores_new_perimp(X, Ximp=imputations[[method]][[1]], N=N, imputationfunc=imputationfuncs[[method]], maxlength=maxlength, ...)
      score_all[[method]] <- tmp  
      
    }
    
    
    
  }
  
  return(score_all)
  
}


Iscores_new_perimp <- function(X, Ximp, N=50, imputationfunc, maxlength=NULL,...){
  
  if (is.null(Ximp)){
    # Impute, maxit should not be 1 here!
    Ximp<-imputationfunc(X=X  , m=1)[[1]]
  }
  
  
  colnames(X) <- colnames(Ximp) <- paste0("X", 1:ncol(X))
  
  args<-list(...)
  
  X<-as.matrix(X)
  Ximp<-as.matrix(Ximp)
  
  n<-nrow(X)
  p<-ncol(X)
  
  ##Step 1: Reoder the data according to the number of missing values
  ## (least missing first)
  numberofmissingbyj<-sapply(1:p, function(j)  sum(is.na(X[,j]))  )

  ## Done in the function
  M<-1*is.na(X)
  colnames(M) <- colnames(X)
  
  indexfull<-colnames(X)
  
  
  # Order first according to most missing values
  
  # Get dimensions with missing values (all other are not important)
  dimwithNA<-(colSums(M) > 0)
  dimwithNA <- dimwithNA[order(numberofmissingbyj, decreasing=T)]
  dimwithNA<-dimwithNA[dimwithNA==TRUE]
  
  if (is.null(maxlength)){maxlength<-sum(dimwithNA) }
  
  if (sum(dimwithNA) < maxlength){
    warning("maxlength was set smaller than sum(dimwithNA)")
    maxlength<-sum(dimwithNA)
  }
  
  
  index<-1:ncol(X)
  scorej<-matrix(NA, nrow= min(sum(dimwithNA), maxlength), ncol=1)
  weight<-matrix(NA, nrow= min(sum(dimwithNA), maxlength), ncol=1)
  i<-0
  
  for (j in names(dimwithNA)[1:maxlength]){
    
    i<-i+1

    
    print( paste0("Dimension ", i, " out of ", maxlength )   ) 
    
  
    
    # H for all missing values of X_j
    Ximp1<-Ximp[M[,j]==1, ]
    
    # H for all observed values of X_j
    Ximp0<-Ximp[M[,j]==0, ]
    
    X0 <-X[M[,j]==0, ]
    
    n1<-nrow(Ximp1)
    n0<-nrow(Ximp0)
    
    
    if (n1 < 10){
      scorej[i]<-NA
      
      warning('Sample size of missing and nonmissing too small for nonparametric distributional regression, setting to NA')
      
    }else{
      
      
      # Evaluate on observed data
      Xtest <- Ximp0[,!(colnames(Ximp0) %in% j) &  (colnames(Ximp0) %in% indexfull), drop=F]
      Oj<-apply(X0[,!(colnames(Ximp0) %in% j) &  (colnames(Ximp0) %in% indexfull), drop=F],2,function(x) !any(is.na(x)) )
      # Only take those that are fully observed
      Xtest<-Xtest[,Oj, drop=F]
      
      Ytest <-Ximp0[,j, drop=F]
      
      if (is.null(Xtest)){
        scorej[i]<-NA
        #weighted
        weight[i]<-(n1/n)*(n0/n)
        warning("Oj was empty")
        next
      }
      
      ###Test 1:
      # Train DRF on imputed data
      Xtrain<-Ximp1[,!(colnames(Ximp1) %in% j) & (colnames(Ximp1) %in% indexfull), drop=F]
      # Only take those that are fully observed
      Xtrain<-Xtrain[,Oj, drop=F]
      
      Ytrain<-Ximp1[,j, drop=F]

      
      Xartificial<-cbind(c(rep(NA,nrow(Ytest)),c(Ytrain)),rbind(Xtest, Xtrain)   )
      colnames(Xartificial)<-c(colnames(Ytrain), colnames(Xtrain))
      
      Imputationlist<-imputationfunc(X=Xartificial  , m=N)
      
      Ymatrix<-do.call(cbind, lapply(Imputationlist, function(x)  x[1:nrow(Ytest),1]  ))
      
      scorej[i] <- -mean(sapply(1:nrow(Ytest), function(l)  { crps_sample(y = Ytest[l,], dat = Ymatrix[l,]) }))
      
    }
    
    
    
    #weighted
    weight[i]<-(n1/n)*(n0/n)
    
  }
  
  scorelist<-c(scorej)
  names(scorelist) <- names(dimwithNA)[1:maxlength]
  weightlist<-c(weight)
  names(weightlist) <- names(dimwithNA)[1:maxlength]
  
  weightedscore<-scorej*weight/(sum(weight, na.rm=T))
  
  ## Weight the score according to n0/n * n1/n!!
  return( list(score= sum(weightedscore, na.rm=T), scorelist=scorelist, weightlist=weightlist)  )
}

What Is a Good Imputation for Missing Values? was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to Evaluate Your Predictions

Jeffrey Näf — Fri, 17 May 2024 06:59:57 GMT

Be mindful of the measure you choose

Photo by Isaac Smith on Unsplash

Testing and benchmarking machine learning models by comparing their predictions on a test set, even after deployment, is of fundamental importance. To do this, one needs to think of a measure or score that takes a prediction and a test point and assigns a value measuring how successful the prediction is with respect to the test point. However, one should think carefully about which scoring measure is appropriate. In particular, when choosing a method to evaluate a prediction we should adhere to the idea of proper scoring rules. I only give a loose definition of this idea here, but basically, we want a score that is minimized at the thing we want to measure!

As a general rule: One can use MSE to evaluate mean predictions, MAE to evaluate median predictions, the quantile score to evaluate more general quantile predictions and the energy or MMD score to evaluate distributional predictions.

Consider a variable you want to predict, say a random variable Y, from a vector of covariates X. In the example below, Y will be income and X will be certain characteristics, such as age and education. We learned a predictor f on some training data and now we predict Y as f(x). Usually, when we want to predict a variable Y as well as possible we predict the expectation of y given x, i.e. f(x) should approximate E[Y | X=x]. But more generally, f(x) could be an estimator of the median, other quantiles, or even the full conditional distribution P(Y | X=x).

Now for a new test point y, we want to score your prediction, that is you want a function S(y,f(x)), that is minimized (in expectation) when f(x) is the best thing you can do. For instance, if we want to predict E[Y | X=x], this score is given as the MSE: S(y, f(x))= (y-f(x))².

Here we study the principle of scoring the predictor f over at test set of (y_i,x_i), i=1,…,ntest in more detail. In all examples we will compare the ideal estimation method to an other that is clearly wrong, or naive, and show that our scores do what they are supposed to. The full code used here can also be found on Github.

The Example

To illustrate things, I will simulate a simple dataset that should mimic income data. We will use this simple example throughout this article to illustrate the concepts.

library(dplyr)


#Create some variables:
# Simulate data for 100 individuals
n <- 5000

# Generate age between 20 and 60
age <- round(runif(n, min = 20, max = 60))

# Define education levels
education_levels <- c("High School", "Bachelor's", "Master's")

# Simulate education level probabilities
education_probs <- c(0.4, 0.4, 0.2)

# Sample education level based on probabilities
education <- sample(education_levels, n, replace = TRUE, prob = education_probs)

# Simulate experience correlated with age with some random error
experience <- age - 20 + round(rnorm(n, mean = 0, sd = 3)) 

# Define a non-linear function for wage
wage <- exp((age * 0.1) + (case_when(education == "High School" ~ 1,
                                 education == "Bachelor's" ~ 1.5,
                                 TRUE ~ 2)) + (experience * 0.05) + rnorm(n, mean = 0, sd = 0.5))

hist(wage)

Although this simulation may be oversimplified, it reflects certain well-known characteristics of such data: older age, advanced education, and greater experience are all linked to higher wages. The use of the “exp” operator results in a highly skewed wage distribution, which is a consistent observation in such datasets.

Wage distribution over the whole simulated population. Source: Author

Crucially, this skewness is also present when we fix age, education and experience to certain values. Let’s imagine we look at a specific person, Dave, who is 30 years old, has a Bachelor’s in Economics and 10 years of experience and let’s look at his actual income distribution according to our data generating process:

ageDave<-30
educationDave<-"Bachelor's"
experienceDave <- 10


wageDave <- exp((ageDave * 0.1) + (case_when(educationDave == "High School" ~ 1,
                                     educationDave == "Bachelor's" ~ 1.5,
                                     TRUE ~ 2)) + (experienceDave * 0.05) + rnorm(n, mean = 0, sd = 0.5))

hist(wageDave, main="Wage Distribution for Dave", xlab="Wage")

Wage distrbution for Dave. Source: Author

Thus the distribution of possible wages of Dave, given the information we have about him, is still highly skewed.

We also generate a test set of several people:


## Generate test set
ntest<-1000

# Generate age between 20 and 60
agetest <- round(runif(ntest, min = 20, max = 60))


# Sample education level based on probabilities
educationtest <- sample(education_levels, ntest, replace = TRUE, prob = education_probs)

# Simulate experience correlated with age with some random error
experiencetest <- agetest - 20 + round(rnorm(ntest, mean = 0, sd = 3))


## Generate ytest that we try to predict:

wagetest <- exp((agetest * 0.1) + (case_when(educationtest == "High School" ~ 1,
                                             educationtest == "Bachelor's" ~ 1.5,
                                             TRUE ~ 2)) + (experiencetest * 0.05) + rnorm(ntest, mean = 0, sd = 0.5))

We now start simple and first look at the scores for mean and median prediction.

The scores for mean and median prediction

In data science and machine learning, interest often centers on a single number that signifies the “center” or “middle” of the distribution we aim to predict, namely the (conditional) mean or median. To do this we have the mean squared error (MSE):

and the mean absolute error (MAE):

An important takeaway is that the MSE is the appropriate metric for predicting the conditional mean, while the MAE is the measure to use for the conditional median. Mean and median are not the same thing for skewed distributions like the one we study here.

Let us illustrate this for the above example with very simple estimators (that we would not have access to in real life), just for illustration:

conditionalmeanest <-
  function(age, education, experience, N = 1000) {
    mean(exp((age * 0.1) + (
      case_when(
        education == "High School" ~ 1,
        education == "Bachelor's" ~ 1.5,
        TRUE ~ 2
      )
    ) + (experience * 0.05) + rnorm(N, mean = 0, sd = 0.5)
    ))
  }


conditionalmedianest <-
  function(age, education, experience, N = 1000) {
    median(exp((age * 0.1) + (
      case_when(
        education == "High School" ~ 1,
        education == "Bachelor's" ~ 1.5,
        TRUE ~ 2
      )
    ) + (experience * 0.05) + rnorm(N, mean = 0, sd = 0.5)
    ))
  }

That is we estimate mean and median, by simply simulating from the model for fixed values of age, education, and experience (this would be a simulation from the correct conditional distribution) and then we simply take the mean/median of that. Let’s test this on Dave:


hist(wageDave, main="Wage Distribution for Dave", xlab="Wage")
abline(v=conditionalmeanest(ageDave, educationDave, experienceDave), col="darkred", cex=1.2)
abline(v=conditionalmedianest(ageDave, educationDave, experienceDave), col="darkblue", cex=1.2)

Blue: estimated conditional median of Dave, Red: estimated conditional mean of Dave. Source: Author

Clearly the mean and median are different, as one would expect from such a distribution. In fact, as is typical for income distributions, the mean is higher (more influenced by high values) than the median.

Now let’s use these estimators on the test set:

Xtest<-data.frame(age=agetest, education=educationtest, experience=experiencetest)

meanest<-sapply(1:nrow(Xtest), function(j)  conditionalmeanest(Xtest$age[j], Xtest$education[j], Xtest$experience[j])  )
median<-sapply(1:nrow(Xtest), function(j)  conditionalmedianest(Xtest$age[j], Xtest$education[j], Xtest$experience[j])  )

This gives a diverse range of conditional mean/median values. Now we calculate MSE and MAE:

(MSE1<-mean((meanest-wagetest)^2))
(MSE2<-mean((median-wagetest)^2))

MSE1 < MSE2
### Method 1 (the true mean estimator) is better than method 2!

# but the MAE is actually worse of method 1!
(MAE1<-mean(abs(meanest-wagetest)) )
(MAE2<-mean( abs(median-wagetest)))

MAE1 < MAE2
### Method 2 (the true median estimator) is better than method 1!

This shows what is known theoretically: MSE is minimized for the (conditional) expectation E[Y | X=x], while MAE is minimized at the conditional median. In general, it does not make sense to use the MAE when you try to evaluate your mean prediction. In a lot of applied research and data science, people use the MAE or both to evaluate mean predictions (I know because I did it myself). While this may be warranted in certain applications, this can have serious consequences for distributions that are not symmetric, as we saw in this example: When looking at the MAE, method 1 looks worse than method 2, even though the former estimates the mean correctly. In fact, in this highly skewed example, method 1 should have a lower MAE than method 2.

To score conditional mean prediction use the mean squared error (MSE) and not the mean absolute error (MAE). The MAE is minimized for the conditional median.

Scores for quantile and interval prediction

Assume we want to score an estimate f(x) of the quantile q_x such that

Simple quantile illustration. Source: Author

In this case, we can consider the quantile score:

whereby

To unpack this formula, we can consider two cases:

(1) y is smaller than f(x):

i.e. we incur a penalty which gets bigger the further away y is from f(x).

(2) y is larger than f(x):

i.e. a penalty which gets bigger the further away y is from f(x).

Notice that the weight is such that for a high alpha, having the estimated quantile f(x) smaller than y gets penalized more. This is by design and ensures that the right quantile is indeed the minimizer of the expected value of S(y,f(x)) over y. This score is in fact the quantile loss (up to a factor 2), see e.g. this nice article. It is implemented in the quantile_score function of the package scoringutils in R. Finally, note that for alpha=0.5:

simply the MAE! This makes sense, as the 0.5 quantile is the median.

With the power to predict quantiles, we can also build prediction intervals. Consider (l_x, u_x), where l_x ≤ u_x are quantiles such that

In fact, this is met if l_x is the alpha/2 quantile, and u_x is the 1-alpha/2 quantile. Thus we now estimate and score these two quantiles. Consider f(x)=(f_1(x), f_2(x)), whereby f_1(x) to be an estimate of l_x and f_2(x) an estimate of u_x. We provide two estimators, the “ideal” one that simulates again from the true process to then estimate the required quantiles and a “naive” one, which has the right coverage but is too big:

library(scoringutils)

## Define conditional quantile estimation
conditionalquantileest <-
  function(probs, age, education, experience, N = 1000) {
    quantile(exp((age * 0.1) + (
      case_when(
        education == "High School" ~ 1,
        education == "Bachelor's" ~ 1.5,
        TRUE ~ 2
      )
    ) + (experience * 0.05) + rnorm(N, mean = 0, sd = 0.5)
    )
    , probs =
      probs)
  }

## Define a very naive estimator that will still have the required coverage
lowernaive <- 0
uppernaive <- max(wage)

# Define the quantile of interest
alpha <- 0.05

lower <-
  sapply(1:nrow(Xtest), function(j)
    conditionalquantileest(alpha / 2, Xtest$age[j], Xtest$education[j], Xtest$experience[j]))
upper <-
  sapply(1:nrow(Xtest), function(j)
    conditionalquantileest(1 - alpha / 2, Xtest$age[j], Xtest$education[j], Xtest$experience[j]))



## Calculate the scores for both estimators

# 1. Score the alpha/2 quantile estimate
qs_lower <- mean(quantile_score(wagetest,
                           predictions = lower,
                           quantiles = alpha / 2))
# 2. Score the alpha/2 quantile estimate
qs_upper <- mean(quantile_score(wagetest,
                           predictions = upper,
                           quantiles = 1 - alpha / 2))

# 1. Score the alpha/2 quantile estimate
qs_lowernaive <- mean(quantile_score(wagetest,
                                predictions = rep(lowernaive, ntest),
                                quantiles = alpha / 2))
# 2. Score the alpha/2 quantile estimate
qs_uppernaive <- mean(quantile_score(wagetest,
                                predictions = rep(uppernaive, ntest),
                                quantiles = 1 - alpha / 2))

# Construct the interval score by taking the average
(interval_score <- (qs_lower + qs_upper) / 2)
# Score of the ideal estimator: 187.8337

# Construct the interval score by taking the average
(interval_scorenaive <- (qs_lowernaive + qs_uppernaive) / 2)
# Score of the naive estimator: 1451.464

Again we can clearly see that, on average, the correct estimator has a much lower score than the naive one!

Thus with the quantile score, we have a reliable way of scoring individual quantile predictions. However, the way of averaging the score of the upper and lower quantiles for the prediction interval might seem ad hoc. Luckily it turns out that this leads to the so-called interval score:

Thus through some algebraic magic, we can score a prediction interval by averaging the scores for the alpha/2 and the 1-alpha/2 quantiles as we did. Interestingly, the resulting interval score rewards narrow prediction intervals, and induces a penalty, the size of which depends on alpha, if the observation misses the interval. Instead of using the average of quantile scores, we can also directly calculate this score with the package scoringutils.

alpha <- 0.05
mean(interval_score(
  wagetest,
  lower=lower,
  upper=upper,
  interval_range=(1-alpha)*100,
  weigh = T,
  separate_results = FALSE
))
#Score of the ideal estimator: 187.8337

This is the exact same number we got above when averaging the scores of the two intervals.

The quantile score implemented in R in the package scoringutils can be used to score quantile predictions. If one wants to score a prediction interval directly, the interval_score function can be used.

Scores for distributional prediction

More and more fields have to deal with distributional prediction. Luckily there are even scores for this problem. In particular, here I focus on what is called the energy score:

for f(x) being an estimate of the distribution P(Y | X=x). The second term takes the expectation of the Eucledian distance between two independent samples from f(x). This is akin to a normalizing term, establishing the value if the same distribution was compared. The first term then compares the sample point y to a draw X from f(x). In expectation (over Y drawn from P(Y | X=x)) this will be minimized if f(x)=P(Y | X=x).

Thus instead of just predicting the mean or the quantiles, we now try to predict the whole distribution of wage at each test point. Essentially we try to predict and evaluate the conditional distribution we plotted for Dave above. This is a bit more complicated; how exactly do we represent a learned distribution? In practice this is resolved by assuming we can obtain a sample from the predicted distribution. Thus we compare a sample of N, obtained from the predicted distribution, to a single test point. This can be done in R using es_sample from the scoringRules package:

library(scoringRules)

## Ideal "estimate": Simply sample from the true conditional distribution 
## P(Y | X=x) for each sample point x
distributionestimate <-
  function(age, education, experience, N = 100) {
    exp((age * 0.1) + (
      case_when(
        education == "High School" ~ 1,
        education == "Bachelor's" ~ 1.5,
        TRUE ~ 2
      )
    ) + (experience * 0.05) + rnorm(N, mean = 0, sd = 0.5))
  }

## Naive Estimate: Only sample from the error distribution, without including the 
## information of each person.
distributionestimatenaive <-
  function(age, education, experience, N = 100) {
    exp(rnorm(N, mean = 0, sd = 0.5))
  }




scoretrue <- mean(sapply(1:nrow(Xtest), function(j)  {
  wageest <-
    distributionestimate(Xtest$age[j], Xtest$education[j], Xtest$experience[j])
  return(scoringRules::es_sample(y = wagetest[j], dat = matrix(wageest, nrow=1)))
}))

scorenaive <- mean(sapply(1:nrow(Xtest), function(j)  {
  wageest <-
    distributionestimatenaive(Xtest$age[j], Xtest$education[j], Xtest$experience[j])
  return(scoringRules::es_sample(y = wagetest[j], dat = matrix(wageest, nrow=1)))
}))

## scoretrue: 761.026
## scorenaive: 2624.713

In the above code, we again compare the “perfect” estimate (i.e. sampling from the true distribution P(Y | X=x)) to a very naive one, namely one that does not consider any information on wage, edicuation or experience. Again, the score reliably identifies the better of the two methods.

The energy score, implemented in the R package scoringRules can be used to score distributional prediction, if a sample from the predicted distribution is available.

Conclusion

We have looked at different ways of scoring predictions. Thinking about the right measure to test predictions is important, as the wrong measure might make us choose and keep the wrong model for our prediction task.

It should be noted that especially for distributional prediction this scoring is a difficult task and the score might not have much power in practice. That is, even a method that leads to a large improvement might only have a slightly smaller score. However, this is not a problem per se, as long as the score is able to reliably identify the better of the two methods.

References

[1] Tilmann Gneiting & Adrian E Raftery (2007) Strictly Proper Scoring Rules, Prediction, and Estimation, Journal of the American Statistical Association, 102:477, 359–378, DOI: 10.1198/016214506000001437

Appendix: All the code in one place

This file can also be found on Github.

library(dplyr)

#Create some variables:
# Simulate data for 100 individuals
n <- 5000

# Generate age between 20 and 60
age <- round(runif(n, min = 20, max = 60))

# Define education levels
education_levels <- c("High School", "Bachelor's", "Master's")

# Simulate education level probabilities
education_probs <- c(0.4, 0.4, 0.2)

# Sample education level based on probabilities
education <- sample(education_levels, n, replace = TRUE, prob = education_probs)

# Simulate experience correlated with age with some random error
experience <- age - 20 + round(rnorm(n, mean = 0, sd = 3)) 

# Define a non-linear function for wage
wage <- exp((age * 0.1) + (case_when(education == "High School" ~ 1,
                                     education == "Bachelor's" ~ 1.5,
                                     TRUE ~ 2)) + (experience * 0.05) + rnorm(n, mean = 0, sd = 0.5))

hist(wage)



ageDave<-30
educationDave<-"Bachelor's"
experienceDave <- 10

wageDave <- exp((ageDave * 0.1) + (case_when(educationDave == "High School" ~ 1,
                                             educationDave == "Bachelor's" ~ 1.5,
                                             TRUE ~ 2)) + (experienceDave * 0.05) + rnorm(n, mean = 0, sd = 0.5))

hist(wageDave, main="Wage Distribution for Dave", xlab="Wage")



## Generate test set
ntest<-1000

# Generate age between 20 and 60
agetest <- round(runif(ntest, min = 20, max = 60))

# Sample education level based on probabilities
educationtest <- sample(education_levels, ntest, replace = TRUE, prob = education_probs)

# Simulate experience correlated with age with some random error
experiencetest <- agetest - 20 + round(rnorm(ntest, mean = 0, sd = 3))

## Generate ytest that we try to predict:

wagetest <- exp((agetest * 0.1) + (case_when(educationtest == "High School" ~ 1,
                                             educationtest == "Bachelor's" ~ 1.5,
                                             TRUE ~ 2)) + (experiencetest * 0.05) + rnorm(ntest, mean = 0, sd = 0.5))





conditionalmeanest <-
  function(age, education, experience, N = 1000) {
    mean(exp((age * 0.1) + (
      case_when(
        education == "High School" ~ 1,
        education == "Bachelor's" ~ 1.5,
        TRUE ~ 2
      )
    ) + (experience * 0.05) + rnorm(N, mean = 0, sd = 0.5)
    ))
  }

conditionalmedianest <-
  function(age, education, experience, N = 1000) {
    median(exp((age * 0.1) + (
      case_when(
        education == "High School" ~ 1,
        education == "Bachelor's" ~ 1.5,
        TRUE ~ 2
      )
    ) + (experience * 0.05) + rnorm(N, mean = 0, sd = 0.5)
    ))
  }


hist(wageDave, main="Wage Distribution for Dave", xlab="Wage")
abline(v=conditionalmeanest(ageDave, educationDave, experienceDave), col="darkred", cex=1.2)
abline(v=conditionalmedianest(ageDave, educationDave, experienceDave), col="darkblue", cex=1.2)



Xtest<-data.frame(age=agetest, education=educationtest, experience=experiencetest)

meanest<-sapply(1:nrow(Xtest), function(j)  conditionalmeanest(Xtest$age[j], Xtest$education[j], Xtest$experience[j])  )
median<-sapply(1:nrow(Xtest), function(j)  conditionalmedianest(Xtest$age[j], Xtest$education[j], Xtest$experience[j])  )



(MSE1<-mean((meanest-wagetest)^2))
(MSE2<-mean((median-wagetest)^2))

MSE1 < MSE2
### Method 1 (the true mean estimator) is better than method 2!

# but the MAE is actually worse of method 1!
(MAE1<-mean(abs(meanest-wagetest)) )
(MAE2<-mean( abs(median-wagetest)))

MAE1 < MAE2
### Method 2 (the true median estimator) is better than method 1!








library(scoringutils)

## Define conditional quantile estimation
conditionalquantileest <-
  function(probs, age, education, experience, N = 1000) {
    quantile(exp((age * 0.1) + (
      case_when(
        education == "High School" ~ 1,
        education == "Bachelor's" ~ 1.5,
        TRUE ~ 2
      )
    ) + (experience * 0.05) + rnorm(N, mean = 0, sd = 0.5)
    )
    , probs =
      probs)
  }

## Define a very naive estimator that will still have the required coverage
lowernaive <- 0
uppernaive <- max(wage)

# Define the quantile of interest
alpha <- 0.05

lower <-
  sapply(1:nrow(Xtest), function(j)
    conditionalquantileest(alpha / 2, Xtest$age[j], Xtest$education[j], Xtest$experience[j]))
upper <-
  sapply(1:nrow(Xtest), function(j)
    conditionalquantileest(1 - alpha / 2, Xtest$age[j], Xtest$education[j], Xtest$experience[j]))

## Calculate the scores for both estimators

# 1. Score the alpha/2 quantile estimate
qs_lower <- mean(quantile_score(wagetest,
                                predictions = lower,
                                quantiles = alpha / 2))
# 2. Score the alpha/2 quantile estimate
qs_upper <- mean(quantile_score(wagetest,
                                predictions = upper,
                                quantiles = 1 - alpha / 2))

# 1. Score the alpha/2 quantile estimate
qs_lowernaive <- mean(quantile_score(wagetest,
                                     predictions = rep(lowernaive, ntest),
                                     quantiles = alpha / 2))
# 2. Score the alpha/2 quantile estimate
qs_uppernaive <- mean(quantile_score(wagetest,
                                     predictions = rep(uppernaive, ntest),
                                     quantiles = 1 - alpha / 2))

# Construct the interval score by taking the average
(interval_score <- (qs_lower + qs_upper) / 2)
# Score of the ideal estimator: 187.8337

# Construct the interval score by taking the average
(interval_scorenaive <- (qs_lowernaive + qs_uppernaive) / 2)
# Score of the naive estimator: 1451.464


library(scoringRules)

## Ideal "estimate": Simply sample from the true conditional distribution 
## P(Y | X=x) for each sample point x
distributionestimate <-
  function(age, education, experience, N = 100) {
    exp((age * 0.1) + (
      case_when(
        education == "High School" ~ 1,
        education == "Bachelor's" ~ 1.5,
        TRUE ~ 2
      )
    ) + (experience * 0.05) + rnorm(N, mean = 0, sd = 0.5))
  }

## Naive Estimate: Only sample from the error distribution, without including the 
## information of each person.
distributionestimatenaive <-
  function(age, education, experience, N = 100) {
    exp(rnorm(N, mean = 0, sd = 0.5))
  }

scoretrue <- mean(sapply(1:nrow(Xtest), function(j)  {
  wageest <-
    distributionestimate(Xtest$age[j], Xtest$education[j], Xtest$experience[j])
  return(scoringRules::es_sample(y = wagetest[j], dat = matrix(wageest, nrow=1)))
}))

scorenaive <- mean(sapply(1:nrow(Xtest), function(j)  {
  wageest <-
    distributionestimatenaive(Xtest$age[j], Xtest$education[j], Xtest$experience[j])
  return(scoringRules::es_sample(y = wagetest[j], dat = matrix(wageest, nrow=1)))
}))

## scoretrue: 761.026
## scorenaive: 2624.713

How to Evaluate Your Predictions was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Random Forests in 2023: Modern Extensions of a Powerful Method

Jeffrey Näf — Tue, 07 Nov 2023 19:04:16 GMT

Random Forests came a long way

Features of modern Random Forest methods. Source: Author.

In terms of Machine Learning timelines, Random Forests (RFs), introduced in the seminal paper of Breimann ([1]), are ancient. Despite their age, they keep impressing with their performance and are a topic of active research. The goal of this article is to highlight what a versatile toolbox Random Forest methods have become, focussing on Generalized Random Forest (GRF) and Distributional Random Forest (DRF).

In short, the main idea underlying both methods is that the weights implicitly produced by RF can be used to estimate targets other than the conditional expectation. The idea of GRF is to use a Random Forest with a splitting criterion that is adapted to the target one has in mind (e.g., conditional mean, conditional quantiles, or the conditional treatment effect). The idea of DRF is to adapt the splitting criterion such that the whole conditional distribution can be estimated. From this object, many different targets can then be derived in a second step. In fact, I mostly talk about DRF in this article, as I am more familiar with this method and it is somewhat more streamlined (only one forest has to be fitted for a wide range of targets). However, all the advantages, indicated in the figure above, also apply to GRF and in fact, the DRF package in R is built upon the professional implementation of GRF. Moreover, the fact that the splitting criterion of GRF forests is adapted to the target means it can have better performance than DRF. This is particularly true for binary Y, where probability_forests() should be used. So, even though I talk mostly about DRF, GRF should be kept in mind throughout this article.

The goal of this article is to provide an overview with links to deeper reading in the corresponding sections. We will go through each of the points in the above figure clock-wise, reference the corresponding articles, and highlight it with a little example. I first quickly summarize the most important links to further reading below:

Versatility/Performance: Medium Article and Original Papers (DRF/GRF)

Missing Values Incorporated: Medium Article

Uncertainty Measures: Medium Article

Variable Importance: Medium Article

The full code for this article can be found on Github.

The Example

We take X_1, X_2, X_4, …, X_10 independently uniform between (-1,1) and create dependence between X_1 and X_3 by taking X_3=X_1 + uniform error. Then we simulate Y as


## Load packages and functions needed
library(drf)
library(mice)



## Set parameters
set.seed(10)
n<-1000



##Simulate Data that experiences both a mean as well as sd shift
# Simulate from X
x1 <- runif(n,-1,1)
x2 <- runif(n,-1,1)
x3 <- x1+ runif(n,-1,1)
X0 <- matrix(runif(7*n,-1,1), nrow=n, ncol=7)
Xfull <- cbind(x1,x2, x3, X0)
colnames(Xfull)<-paste0("X", 1:10)

# Simulate dependent variable Y
Y <- as.matrix(rnorm(n,mean = 0.8*(x1 > 0), sd = 1 + 1*(x2 > 0)))

##Also add MAR missing values using ampute from the mice package
X<-ampute(Xfull)$amp

head(cbind(Y,X))

              Y          X1          X2          X3          X4         X5
[1,]  2.8230938  0.01495641  0.61541265  0.72124459  0.22751381 -0.2418149
[2,] -0.5506897 -0.38646299  0.42035842 -1.02849810 -0.78525849 -0.8512734
[3,]  0.1359363 -0.14618467          NA  0.07900811  0.03722511  0.5516002
[4,]  2.7453314  0.38620416  0.01956594  1.17097982  0.76213426  0.9813759
[5,]  0.3702340 -0.82972806  0.13751720 -0.63962047 -0.23973421 -0.4339398
[6,] -0.3955091 -0.54912677 -0.33686360 -0.76832645 -0.55324807 -0.4658600
             X6         X7         X8         X9        X10
[1,] -0.4815087  0.3548534  0.2010427 -0.8215880  0.4515748
[2,] -0.4561582 -0.8561308  0.3944132  0.5845884 -0.1409658
[3,] -0.2494117  0.1858109 -0.7487566  0.8355041  0.6216888
[4,]  0.6688906 -0.4936409 -0.2557489         NA -0.8048351
[5,]  0.5194512  0.4799429 -0.8374205  0.3405841 -0.9950487
[6,]  0.7471006  0.8961717 -0.0435499  0.4200485  0.6398618

Notice that with the function ampute from the mice package, we put Missing not at Random (MAR) missing values on X to highlight the ability of GRF/DRF to deal with missing values. Moreover, in the above process only X_1 and X_2 are relevant for predicting Y, all other variables are “noise” variables. Such a “sparse” setting might actually be common in real-life datasets.

We now choose a test point for this example that we will use throughout:

x<-matrix(c(0.2, 0.4, runif(8,-1,1)), nrow=1, ncol=10)
print(x)

     [,1] [,2]      [,3]      [,4]      [,5]      [,6]    [,7]      [,8]
[1,]  0.2  0.4 0.7061058 0.8364877 0.2284314 0.7971179 0.78581 0.5310279
           [,9]     [,10]
[1,] -0.5067102 0.6918785

Versatility

DRF estimates the conditional distribution P_{Y|X=x} in the form of simple weights:

From these weights, a wide range of targets can be calculated, or they can be used to simulate from the conditional distribution. A good reference for its versatility is the original research article, where a lot of examples were used, as well as the corresponding medium article.

In the example, we first simulate from this distribution:

DRF<-drf(X=X, Y=Y, ci.group.size=2000/50,num.trees=1000, min.node.size = 5)
DRFpred<-predict(DRF, newdata=x, estimate.uncertainty=TRUE)

## Sample from P_{Y| X=x}
Yxs<-Y[sample(1:n, size=n, replace = T, DRFpred$weights[1,])]
hist(Yxs, prob=T)
z<-seq(-6,7,by=0.01)
d<-dnorm(z, mean=0.8 * (x[1] > 0), sd=(1+(x[2] > 0)))
lines(z,d, col="darkred"  )

Histogram of the simulated conditional distribution overlaid with the true density (in red). Source: Author.

The plot shows the approximate draws from the conditional distribution overlaid with the true density in red. We now use this to estimate the conditional expectation and the conditional (0.05, 0.95) quantiles at x:

# Calculate quantile prediction as weighted quantiles from Y
qx <- quantile(Yxs, probs = c(0.05,0.95))

# Calculate conditional mean prediction
mux <- mean(Yxs)

# True quantiles
q1<-qnorm(0.05, mean=0.8 * (x[1] > 0), sd=(1+(x[2] > 0)))
q2<-qnorm(0.95, mean=0.8 * (x[1] > 0), sd=(1+(x[2] > 0)))
mu<-0.8 * (x[1] > 0)

hist(Yxs, prob=T)
z<-seq(-6,7,by=0.01)
d<-dnorm(z, mean=0.8 * (x[1] > 0), sd=(1+(x[2] > 0)))
lines(z,d, col="darkred"  )
abline(v=q1,col="darkred" )
abline(v=q2, col="darkred" )
abline(v=qx[1], col="darkblue")
abline(v=qx[2], col="darkblue")
abline(v=mu, col="darkred")
abline(v=mux, col="darkblue")

Histogram of the simulated conditional distribution overlaid with the true density (in red). Additionally, the estimated conditional expectation and the conditional (0.05, 0.95) quantiles are in blue, with true values in red. Source: Author.

Likewise, many targets can be calculated with GRF, only in this case for each of those two targets a different forest would need to be fit. In particular, regression_forest() for the conditional expectation and quantile_forest() for the quantiles.

What GRF cannot do is deal with multivariate targets, which is possible with DRF as well.

Performance

Despite all the work on powerful new (nonparametric) methods, such as neural nets, tree-based methods are consistently able to beat competitors on tabular data. See e.g., this fascinating paper, or this older paper on the strength of RF in classification.

To be fair, with parameter tuning, boosted tree methods, such as XGboost, often take the lead, at least when it comes to classical prediction (which corresponds to conditional expectation estimation). Nonetheless, the robust performance RF methods tend to have without any tuning is remarkable. Moreover, there has also been work on improving the performance of Random Forests, for example, the hedged Random Forest approach.

Missing Values Incorporated

“Missing incorporated in attributes criterion” (MIA) from this paper is a very simple but very powerful idea that allows tree-based methods to handle missing data. This was implemented in the GRF R package and so it is also available in DRF. The details are also explained in this medium article. As simple as the concept is, it works remarkably well in practice: In the above example, DRF had no trouble handling substantial MAR missingness in the training data X (!)

Uncertainty Measures

As a statistician I don’t just want point estimates (even of a distribution), but also a measure of estimation uncertainty of my parameters (even if the “parameter” is my whole distribution). Turns out that a simple additional subsampling baked into DRF/GRF allows for a principled uncertainty quantification for large sample sizes. The theory behind this in the case of DRF is derived in this research article, but I also explain it in this medium article. GRF has all the theory in the original paper.

We adapt this for the above example:

# Calculate uncertainty
alpha<-0.05
B<-nrow(DRFpred$weights.uncertainty[[1]])
qxb<-matrix(NaN, nrow=B, ncol=2)
muxb<-matrix(NaN, nrow=B, ncol=1)
for (b in 1:B){
  Yxsb<-Y[sample(1:n, size=n, replace = T, DRFpred$weights.uncertainty[[1]][b,])]
  qxb[b,] <- quantile(Yxsb, probs = c(0.05,0.95))
  muxb[b] <- mean(Yxsb)
}

CI.lower.q1 <- qx[1] - qnorm(1-alpha/2)*sqrt(var(qxb[,1]))
CI.upper.q1 <- qx[1] + qnorm(1-alpha/2)*sqrt(var(qxb[,1]))

CI.lower.q2 <- qx[2] - qnorm(1-alpha/2)*sqrt(var(qxb[,2]))
CI.upper.q2 <- qx[2] + qnorm(1-alpha/2)*sqrt(var(qxb[,2]))

CI.lower.mu <- mux - qnorm(1-alpha/2)*sqrt(var(muxb))
CI.upper.mu <- mux + qnorm(1-alpha/2)*sqrt(var(muxb))

hist(Yxs, prob=T)
z<-seq(-6,7,by=0.01)
d<-dnorm(z, mean=0.8 * (x[1] > 0), sd=(1+(x[2] > 0)))
lines(z,d, col="darkred"  )
abline(v=q1,col="darkred" )
abline(v=q2, col="darkred" )
abline(v=qx[1], col="darkblue")
abline(v=qx[2], col="darkblue")
abline(v=mu, col="darkred")
abline(v=mux, col="darkblue")
abline(v=CI.lower.q1, col="darkblue", lty=2)
abline(v=CI.upper.q1, col="darkblue", lty=2)
abline(v=CI.lower.q2, col="darkblue", lty=2)
abline(v=CI.upper.q2, col="darkblue", lty=2)
abline(v=CI.lower.mu, col="darkblue", lty=2)
abline(v=CI.upper.mu, col="darkblue", lty=2)

As can be seen from the above code, we essentially have B subtrees that can be used to calculate the measure each time. From these B samples of mean and quantiles, we can then calculate variances and use a normal approximation to obtain (asymptotic) confidence intervals seen in the dashed line in the Figure. Again, all of this can be done despite the missing values in X(!)

Variable Importance

A final important aspect of Random Forests is efficiently calculated variable importance measures. While traditional measures are somewhat ad hoc, for the traditional RF and for DRF, there are now principled measures available, as explained in this medium article. For RF, the Sobol-MDA method reliably identifies the variables most important for conditional expectation estimation, whereas for DRF, the MMD-MDA identifies the variables most important for the estimation of the distribution overall. As discussed in the article, using the idea of projected Random Forests, these measures can be very efficiently implemented. We demonstrate this in the example with a less efficient implementation of the MMD variable importance measure:

## Variable importance for conditional Quantile Estimation


## For the conditional quantiles we use a measure that considers the whole distribution,
## i.e. the MMD based measure of DRF.
MMDVimp <- compute_drf_vimp(X=X,Y=Y, print=F)
sort(MMDVimp, decreasing = T)

         X2          X1          X8          X9          X3         X10          X6 
0.812070057 0.149333294 0.015815104 0.007078924 0.006750151 0.005492872 0.001984300 
         X4          X5          X7 
0.000000000 0.000000000 0.000000000

Here both X_1 and X_2 are correctly identified as being the most relevant variable when trying to estimate the distribution. Remarkably, despite the dependence of X_3 and X_1, the measure correctly quantifies that X_3 is not important for the prediction of the distribution of Y. This is something that the original MDA measure of Random Forests tends to do wrong, as demonstrated in the medium article. Moreover, notice again that the missing values in X are no problem here.

Conclusion

GRF/DRF and also the traditional Random Forest should not be missing in the toolbox of any data scientist. While methods like XGboost can have a better performance in traditional prediction, the many strengths of modern RF-based approaches render them an incredibly versatile tool.

Of course, one should keep in mind that these methods are still fully nonparametric, and a lot of data points are needed for the fit to make sense. This is in particularly true for the uncertainty quantification, which is only valid asymptotically, i.e. for “large” samples.

Literature

[1] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.

Appendix: Variable Importance Code

require(drf)
require(Matrix)
require(kernlab)


#' Variable importance for Distributional Random Forests
#'
#' @param X Matrix with input training data.
#' @param Y Matrix with output training data.
#' @param X_test Matrix with input testing data. If NULL, out-of-bag estimates are used.
#' @param num.trees Number of trees to fit DRF. Default value is 500 trees.
#' @param silent If FALSE, print variable iteration number, otherwise nothing is print. Default is FALSE.
#'
#' @return The list of importance values for all input variables.
#' @export
#'
#' @examples
compute_drf_vimp <- function(X, Y, X_test = NULL, num.trees = 500, silent = FALSE){
  
  # fit initial DRF
  bandwidth_Y <- drf:::medianHeuristic(Y)
  k_Y <- rbfdot(sigma = bandwidth_Y)
  K <- kernelMatrix(k_Y, Y, Y)
  DRF <- drf(X, Y, num.trees = num.trees)
  wall <- predict(DRF, X_test)$weights
  
  # compute normalization constant
  wbar <- colMeans(wall)
  wall_wbar <- sweep(wall, 2, wbar, "-")
  I0 <- as.numeric(sum(diag(wall_wbar %*% K %*% t(wall_wbar))))
  
  # compute drf importance dropping variables one by one
  I <- sapply(1:ncol(X), function(j) {
    if (!silent){print(paste0('Running importance for variable X', j, '...'))}
    DRFj <- drf(X = X[, -j, drop=F], Y = Y, num.trees = num.trees) 
    DRFpredj <- predict(DRFj, X_test[, -j])
    wj <- DRFpredj$weights
    Ij <- sum(diag((wj - wall) %*% K %*% t(wj - wall)))/I0
    return(Ij)
  })
  
  # compute retraining bias
  DRF0 <- drf(X = X, Y = Y, num.trees = num.trees)
  DRFpred0 = predict(DRF0, X_test)
  w0 <- DRFpred0$weights
  vimp0 <- sum(diag((w0 - wall) %*% K %*% t(w0 - wall)))/I0
  
  # compute final importance (remove bias & truncate negative values)
  vimp <- sapply(I - vimp0, function(x){max(0,x)})
  
  names(vimp)<-colnames(X)
  
  return(vimp)
  
}

Random Forests in 2023: Modern Extensions of a Powerful Method was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Variable Importance in Random Forests

Jeffrey Näf — Fri, 03 Nov 2023 06:06:47 GMT

Traditional Methods and New Developments

Features of (Distributional) Random Forests. In this article: The ability to produce variable importance. Source: Author.

Random Forest and generalizations (in particular, Generalized Random Forests (GRF) and Distributional Random Forests (DRF) ) are powerful and easy-to-use machine learning methods that should not be absent in the toolbox of any data scientist. They not only show robust performance over a large range of datasets without the need for tuning, but can also easily handle missing values, and even provide confidence intervals. In this article, we focus on another feature they are able to provide: notions of feature importance. In particular, we focus on:

Traditional Random Forest (RF), which is used to predict the conditional expectation of a variable Y given p predictors X.
The Distributional Random Forest, which is used to predict the whole conditional distribution of a d-variate Y given p predictors X.

Unfortunately, like many modern machine learning methods, both forests lack interpretability. That is, there are so many operations involved, it seems impossible to determine what the functional relationship between the predictors and Y actually is. A common way to tackle this problem is to define Variable Importance measures (VIMP), that at least help decide which predictors are important. Generally, this has two different objectives:

(1) finding a small number of variables with maximal accuracy,

(2) detecting and ranking all influential variables to focus on for further exploration.

The difference between (1) and (2) matters as soon as there is dependence between the elements in X (so pretty much always). For example, if two variables are highly correlated together and with Y, one of the two inputs can be removed without hurting accuracy for objective (1), since both variables convey the same information. However, both should be included for objective (2), since these two variables may have different meanings in practice for domain experts.

Today we focus on (1) and try to find a smaller number of predictors that display more or less the same predictive accuracy. For instance, in the wage example below, we are able to reduce the number of predictors from 79 to about 20, with only a small reduction in accuracy. These most important predictors contain variables such as age and education which are well-known to influence wages. There are also many great articles on medium about (2), using Shapley values such as this one or this one. There is also very recent and exciting academic literature on how to efficiently calculate Shapley values with Random Forest. But this is material for a second article.

The two measures we look at today are actually more general variable importance measures that can be used for any method, based on the drop-and-relearn principle which we will look at below. We focus exclusively on tree-based methods here, however. Moreover, we don’t go into great detail explaining the methods, but rather try to focus on their applications and why newer versions are preferable to the more traditional ones.

Overview of Variable Importance Measures for Random Forests. Mean Decrease Impurity (MDI) and Mean Decrease Accuracy (MDA) were both postulated by Breiman. Due to their empirical nature, however, several problems remained, which were recently addressed by Sobol-MDA. Source: Author

The Beginnings

Variable importance measures for RFs are in fact as old as RF itself. The first accuracy the Mean Decrease Accuracy (MDA) was proposed by Breiman in his seminal Random Forest paper [1]. The idea is simple: For every dimension j=1,…,p, one compares the accuracy of the full prediction with the accuracy of the prediction when X_j is randomly permuted. The idea of this is to break the relationship between X_j and Y and compare the accuracy when X_j is not helping to predict Y by design, to the case when it is potentially of use.

There are various different versions of MDA implemented in R and Python:

Different Versions of MDA, implemented in different packages. Source: Table 1 in [3]

Unfortunately, permuting variable X_j in this way not only breaks its relationship to Y, but also to the other variables in X. This is not a problem if X_j is independent from all other variables, but it becomes a problem once there is dependence. Consequently, [3] is able to show that as soon as there is dependence in X, the MDA converges to something nonsensical. In particular, MDA can give high importance to a variable X_j that is not important to predict Y, but is highly correlated with another variable, say X_l, that is actually important for predicting Y (as demonstrated in the example below). At the same time, it can fail to detect variables that are actually relevant, as demonstrated by a long list of papers in [3, Section 2.1]. Intuitively, what we would want to measure is the performance of the model if X_j is not included, and instead, we measure the performance of a model with a permuted X_j variable.

The second traditional accuracy measure is Mean Decrease Impurity (MDI), which sums the weighted decreases of impurity over all nodes that split on a given covariate, averaged over all trees in the forest. Unfortunately, MDI is ill-defined from the start (it's not clear what it should measure) and several papers highlight the practical problem of this approach (e.g. [5]) As such, we will not go into detail about MDI, as MDA is often the preferred choice.

Modern Developments I: Sobol-MDA

For the longest time, I thought these somewhat informal measures were the best we could do. One paper that changed that, came out only very recently. In this paper, the authors demonstrate theoretically that the popular measures above are actually quite flawed and do not measure what we want to measure. So the first question might be: What do we actually want to measure? One potential answer: The Sobol-index (originally proposed in the computer science literature):

Let’s unpack this. First, tau(X)=E[ Y | X] is the conditional expectation function we would like to estimate. This is a random variable because it is a function of the random X. Now X^{(-j)} is the p-1 vector with covariate j removed. Thus ST^{(j)} is the reduction in output explained variance if the jth output variable is removed.

The above is the more traditional way of writing the measure. However, for me writing:

is much more intuitive. Here d is a distance between two random vectors and for the ST^{(j)} above, this distance is simply the usual Euclidean distance. Thus the upper part of ST^{(j)} is simply measuring the average squared distance between what we want (tau(X)) and what we get without variable j. The latter is

The question becomes how to estimate this efficiently. It turns out that the intuitive drop-and-relearn principle would be enough: Simply estimating tau(X) using RF and then dropping X_j and refitting the RF to obtain an estimate of tau(X^{(-j)}), one obtains the consistent estimator:

where tau_n(X_i) is the RF estimate for a test point X_i using all p predictors and similarly tau_n(X_i^{(-j)}) is the refitted forest using only p-1 predictors.

However, this means the forest needs to be refitted p times, not very efficient when p is large! As such the authors in [3] develop what they call the Sobol-MDA. Instead of refitting the forest each time, the forest is only fitted once. Then test points are dropped down the same forest and the resulting prediction is “projected” to form the measure in (1). That is, splits on X_j are simply ignored (remember the goal is to obtain an estimate without X_j). The authors are able to show that calculating (1) above with this projected approach also results in a consistent estimator! This is a beautiful idea indeed and renders the algorithm applicable even in high dimensions.

Illustration of the projection approach. On the left the division of the two-dimensional space by RF. On the right the projection approach ignores splits in X^(2), thereby removing it when making predictions. As can be seen the point X gets projected onto X^{(-j)} on the right using this principle. Source: Figure 1 in [3]

The method is implemented in R in the soboldMDA package, based on the very fast ranger package.

Modern Developments II: MMD-based sensitivity index

Looking at the formulation using the distance d, a natural question is to ask whether different distances could be used to get variable importance measures for more difficult problems. One such recent example is to use the MMD distance as d:

The MMD distance is a wonderful tool, that allows to quite easily build a distance between distributions using a kernel k (such as the Gaussian kernel):

For the moment I leave the details to further articles. The most important takeaway is simply that I^{(j)} considers a more general target than the conditional expectation. It recognizes a variable X_j as important, as soon as it influences the distribution of Y in any way. It might be that X_j only changes the variance or the quantiles and leaves the conditional mean of Y untouched (see example below). In this case, the Sobol-MDA would not recognize X_j as important, but the MMD method would. This doesn’t necessarily make it better, it is simply a different tool: If one is interested in predicting the conditional expectation, ST^{(j)} is the right measure. However, if one is interested in predicting other aspects of the distribution, especially quantiles, I^{(j)} would be better suited. Again I^{(j)} can be consistently estimated using the drop-and-relearn principle (refitting DRF for j=1,…,p eacht time with variable $j$ removed), or the same projection approach as for Sobol-MDA can be used. A drop-and-relearn-based implementation is attached at the end of this article. We refer to this method here as MMD-MDA.

Simulated Data

We now illustrate these two modern measures on a simple simulated example: We first download and install the Sobol-MDA package from Gitlab and then load all the packages necessary for this example:

library(kernlab)
library(drf)
library(Matrix)
library(DescTools)
library(mice)
library(sobolMDA)
source("compute_drf_vimp.R") ##Contents of this file can be found below
source("evaluation.R") ##Contents of this file can be found below

Then we simulate from this simple example: We take X_1, X_2, X_4, …, X_10 independently uniform between (-1,1) and create dependence between X_1 and X_3 by taking X_3=X_1 + uniform error. Then we simulate Y as

##Simulate Data that experiences both a mean as well as sd shift

# Simulate from X
x1 <- runif(n,-1,1)
x2 <- runif(n,-1,1)
X0 <- matrix(runif(7*n,-1,1), nrow=n, ncol=7)
x3 <- x1+ runif(n,-1,1)
X <- cbind(x1,x2, x3, X0)

# Simulate dependent variable Y
Y <- as.matrix(rnorm(n,mean = 0.8*(x1 > 0), sd = 1 + 1*(x2 > 0)))
colnames(X)<-paste0("X", 1:10)


head(cbind(Y,X))

We then analyze the Sobol-MDA approach to estimate the conditional expectation of Y given X:

## Variable importance for conditional Expectation Estimation

XY <- as.data.frame(cbind(Xfull, Y))
colnames(XY) <- c(paste('X', 1:(ncol(XY)-1), sep=''), 'Y')
num.trees <- 500
forest <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees, importance = 'sobolMDA')
sobolMDA <- forest$variable.importance
names(sobolMDA) <- colnames(X)

sort(sobolMDA, decreasing = T)

          X1           X8           X7           X6           X5           X9 
 0.062220958  0.021946135  0.016818860  0.016777223 -0.001290326 -0.001540919 
          X3          X10           X4           X2 
-0.001578540 -0.007400854 -0.008299478 -0.020334150

As can be seen, it correctly identifies that X_1 is the most important variable, while the others are ranked equally (un)important. This makes sense because the conditional expectation of Y is only changed by X_1. Crucially, the measure manages to do this despite the dependence between X_1 and X_3. Thus we successfully pursued goal (1), as explained above, in this example. On the other hand, we can also have a look at the traditional MDA:

forest <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees, importance = 'permutation')
MDA <- forest$variable.importance
names(MDA) <- colnames(X)

sort(MDA, decreasing = T)

          X1           X3           X6           X7           X8           X2 
 0.464516976  0.118147061  0.063969310  0.032741521  0.029004312 -0.004494380 
          X4           X9          X10           X5 
-0.009977733 -0.011030996 -0.014281844 -0.018062544

In this case, while it correctly identifies X_1 as the most important variable, it also places X_3 in second place, with a value that seems quite a bit higher than the remaining variables. This despite the fact, that X_3 is just as unimportant as X_2, X_4,…, X_10!

But what if we are interested in predicting the distribution of Y more generally, say for estimating quantiles? In this case, we need a measure that is able to recognize the influence of X_2 on the conditional variance of Y. Here the MMD variable importance measure comes into play:

MMDVimp <- compute_drf_vimp(X=X,Y=Y)
sort(MMDVimp, decreasing = T)

         X2          X1         X10          X6          X8          X3 
0.683315006 0.318517259 0.014066410 0.009904518 0.006859128 0.005529749 
         X7          X9          X4          X5 
0.003476256 0.003290550 0.002417677 0.002036174

Again the measure is able to correctly identify what matters: X_1 and X_2 are the two most important variables. And again, it does this despite the dependence between X_1 and X_3. Interestingly, it also gives the variance shift from X_2 a higher importance than the expectation shift from X_1.

Real Data

Finally, I present a real data application to demonstrate the variable importance measure. Note that with DRF, we could look even at multivariate Y but to keep things more simple, we focus on a univariate setting and consider the US wage data from the 2018 American Community Survey by the US Census Bureau. In the first DRF paper, we obtained data on approximately 1 million full-time employees from the 2018 American Community Survey by the US Census Bureau from which we extracted the salary information and all covariates that might be relevant for salaries. This wealth of data is ideal to experiment with a method like DRF (in fact we will only use a tiny subset for this analysis). The data we load can be found here.

# Load data (https://github.com/lorismichel/drf/blob/master/applications/wage_data/data/datasets/wage_benchmark.Rdata)
load("wage_benchmark.Rdata")

##Define the training data

n<-1000

Xtrain<-X[1:n,] 
Ytrain<-Y[1:n,]
Xtrain<-cbind(Xtrain,Ytrain[,"male"])
colnames(Xtrain)[ncol(Xtrain)]<-"male"
Ytrain<-Ytrain[,1, drop=F]


##Define the test data
ntest<-2000
Xtest<-X[(n+1):(n+ntest),]  
Ytest<-Y[(n+1):(n+ntest),]
Xtest<-cbind(Xtest,Ytest[,"male"])
colnames(Xtest)[ncol(Xtest)]<-"male"
Ytest<-Ytest[,1, drop=F]

We now calculate both variable importance measures (this will take a while as only the drop-and-relearn method is implemented for DRF):

# Calculate variable importance for both measures
# 1. Sobol-MDA
XY <- as.data.frame(cbind(Xtrain, Ytrain))
colnames(XY) <- c(paste('X', 1:(ncol(XY)-1), sep=''), 'Y')
num.trees <- 500
forest <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees, importance = 'sobolMDA')
SobolMDA <- forest$variable.importance
names(SobolMDA) <- colnames(Xtrain)

# 2. MMD-MDA
MMDVimp <- compute_drf_vimp(X=Xtrain,Y=Ytrain,silent=T)



print("Top 10 most important variables for conditional Expectation estimation")
sort(SobolMDA, decreasing = T)[1:10]
print("Top 5 most important variables for conditional Distribution estimation")
sort(MMDVimp, decreasing = T)[1:10]

Sobol-MDA:

education_level                   age                  male 
          0.073506769           0.027079349           0.013722756 
        occupation_11         occupation_43           industry_54 
          0.013550320           0.010025332           0.007744589 
          industry_44         occupation_23         occupation_15 
          0.006657918           0.005772662           0.004610835 
marital_never married 
          0.004545964

MMD-MDA:

education_level                   age                  male 
          0.420316085           0.109212519           0.027356393 
        occupation_43         occupation_11 marital_never married 
          0.016861954           0.014122583           0.003449910 
        occupation_29       marital_married           industry_81 
          0.002272629           0.002085207           0.001152210 
          industry_72 
          0.000984725

In this case, the two variable importance measures agree quite a bit on which variables are important. While this is not a causal analysis, it is also nice that variables that are known to be important to predict wages, specifically “age”, “education_level” and “gender”, are indeed seen as very important by the two measures.

To obtain a small set of predictive variables, one could now for j=1,…p-1,

(I) Remove the least important variable

(II) Calculate the loss (e.g. mean squared error) on a test set

(III) Recalculate the variable importance for the remaining variable

(IV) Repeat until a certain stopping criterion is met

One could stop, for instance, if the loss increased by more than 5%. To make my life easier in this article, I just use the same variable importance values saved in “SobolMDA” and “MMDVimp” above. That is, I ignore step (III) and only consider (I), (II) and (IV). When the goal of estimation is the full conditional distribution, step (II) is also not entirely clear. We use what we refer to as MMD loss, described in more detail in our paper ([4]). This loss considers the error we are making in the prediction of the distribution. For the conditional mean, we simply use the mean-squared error. This is done in the function “evalall” found below:

# Remove variables one-by-one accoring to the importance values saved in SobolMDA
# and MMDVimp.
evallistSobol<-evalall(SobolMDA, X=Xtrain ,Y=Ytrain ,Xtest, Ytest, metrics=c("MSE"), num.trees )
evallistMMD<-evalall(MMDVimp, X=Xtrain ,Y=Ytrain ,Xtest, Ytest, metrics=c("MMD"), num.trees )


plot(evallistSobol$evalMSE, type="l", lwd=2, cex=0.8, col="darkgreen", main="MSE loss" , xlab="Number of Variables removed", ylab="Values")
plot(evallistMMD$evalMMD, type="l", lwd=2, cex=0.8, col="darkgreen", main="MMD loss" , xlab="Number of Variables removed", ylab="Values")

This results in the following two pictures:

Notice that both have somewhat wiggly lines, which is first due to the fact that I did not recalculate the importance measure, e.g., left out step (III), and second due to the randomness of the forests. Aside from this, the graphs nicely show how the errors successively increase with each variable that is removed. This increase is first slow for the least important variables and then gets quicker for the most important ones, exactly as one would expect. In particular, the loss in both cases remains virtually unchanged if one removes the 50 least important variables! In fact, one could remove about 70 variables in both cases without increasing the loss by more than 6%. One has to note though that many predictors are part of one-hot encoded categorical variables and thus one needs to be somewhat careful when removing predictors, as they correspond to levels of one categorical variable. However, in an actual application, this might still be desirable.

Conclusion

In this article, we looked at modern approaches to variable importance in Random Forests, with the goal of obtaining a small set of predictors or covariates, both with respect to the conditional expectation and for the conditional distribution more generally. We have seen in the wage data example, that this can lead to a substantial reduction in predictors with virtually the same accuracy.

As noted above the measures presented are not strictly constrained to Random Forest, but can be used more generally in principle. However, forests allow for the elegant projection approach that allows for the calculation of the importance measure for all variables j, without having to refit the forest each time (!) This is described in both [3] and [4].

Literature

[1] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.

[2] Breiman, L. (2003a). Setting up, using, and understanding random forests v3.1. Technical report, UC Berkeley, Department of Statistics

[3] Bénard, C., Da Veiga, S., and Scornet, E. (2022). Mean decrease accuracy for random forests: inconsistency, and a practical solution via the Sobol-MDA. Biometrika, 109(4):881–900.

[4] Clément Bénard, Jeffrey Näf, and Julie Josse. MMD-based variable importance for distributional random forest, 2023.

[5] Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics, 8:25.

Appendix : Code


#### Contents of compute_drf_vimp.R ######

#' Variable importance for Distributional Random Forests
#'
#' @param X Matrix with input training data.
#' @param Y Matrix with output training data.
#' @param X_test Matrix with input testing data. If NULL, out-of-bag estimates are used.
#' @param num.trees Number of trees to fit DRF. Default value is 500 trees.
#' @param silent If FALSE, print variable iteration number, otherwise nothing is print. Default is FALSE.
#'
#' @return The list of importance values for all input variables.
#' @export
#'
#' @examples
compute_drf_vimp <- function(X, Y, X_test = NULL, num.trees = 500, silent = FALSE){
  
  # fit initial DRF
  bandwidth_Y <- drf:::medianHeuristic(Y)
  k_Y <- rbfdot(sigma = bandwidth_Y)
  K <- kernelMatrix(k_Y, Y, Y)
  DRF <- drf(X, Y, num.trees = num.trees)
  wall <- predict(DRF, X_test)$weights
  
  # compute normalization constant
  wbar <- colMeans(wall)
  wall_wbar <- sweep(wall, 2, wbar, "-")
  I0 <- as.numeric(sum(diag(wall_wbar %*% K %*% t(wall_wbar))))
  
  # compute drf importance dropping variables one by one
  I <- sapply(1:ncol(X), function(j) {
    if (!silent){print(paste0('Running importance for variable X', j, '...'))}
    DRFj <- drf(X = X[, -j, drop=F], Y = Y, num.trees = num.trees) 
    DRFpredj <- predict(DRFj, X_test[, -j])
    wj <- DRFpredj$weights
    Ij <- sum(diag((wj - wall) %*% K %*% t(wj - wall)))/I0
    return(Ij)
  })
  
  # compute retraining bias
  DRF0 <- drf(X = X, Y = Y, num.trees = num.trees)
  DRFpred0 = predict(DRF0, X_test)
  w0 <- DRFpred0$weights
  vimp0 <- sum(diag((w0 - wall) %*% K %*% t(w0 - wall)))/I0
  
  # compute final importance (remove bias & truncate negative values)
  vimp <- sapply(I - vimp0, function(x){max(0,x)})
  
  names(vimp)<-colnames(X)
  
  return(vimp)
  
}

#### Contents of evaluation.R ######


compute_mmd_loss <- function(Y_train, Y_test, weights){
  # Y_train <- scale(Y_train)
  # Y_test <- scale(Y_test)
  bandwidth_Y <- (1/drf:::medianHeuristic(Y_train))^2
  k_Y <- rbfdot(sigma = bandwidth_Y)
  K_train <- matrix(kernelMatrix(k_Y, Y_train, Y_train), ncol = nrow(Y_train))
  K_cross <- matrix(kernelMatrix(k_Y, Y_test, Y_train), ncol = nrow(Y_train))
  weights <- matrix(weights, ncol = ncol(weights))
  t1 <- diag(weights%*%K_train%*%t(weights))
  t2 <- diag(K_cross%*%t(weights))
  mmd_loss <- mean(t1) - 2*mean(t2)
  mmd_loss
}

evalall <- function(Vimp, X ,Y ,Xtest, Ytest, metrics=c("MMD","MSE"), num.trees ){
  
  if (ncol(Ytest) > 1 & "MSE" %in% metrics){
    metrics <- metrics[!( metrics %in% "MSE") ]
  }
  
  # Sort for increasing importance, such that the least important variables are removed first
  Vimp<-sort(Vimp)
  
  if ( is.null(names(Vimp)) ){
    stop("Need names for later")  
  }
  
  
  evalMMD<-matrix(0, nrow=ncol(X))
  evalMSE<-matrix(0, nrow=ncol(X))
  
  ###Idea: Create a function that takes a variable importance measure and does this loop!!
  
  for (j in 1:ncol(X)){
    
   
    
    if (j==1){
      
      if ("MMD" %in% metrics){
        
        DRFred<- drf(X=X,Y=Y)
        weights<- predict(DRFred, newdata=Xtest)$weights
        evalMMD[j]<-compute_mmd_loss(Y_train=Y, Y_test=Ytest, weights)
    
      }
      
      if ("MSE" %in% metrics){
        
        XY <- as.data.frame(cbind(X, Y))
        colnames(XY) <- c(paste('X', 1:(ncol(XY)-1), sep=''), 'Y')
        RFfull <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees)
        XtestRF<-Xtest
        colnames(XtestRF) <- paste('X', 1:ncol(XtestRF), sep='')
        predRF<-predict(RFfull, data=XtestRF)
        evalMSE[j] <- mean((Ytest - predRF$predictions)^2)
      
      }

    }else{
      
      
      if ("MMD" %in% metrics){
        
        DRFred<- drf(X=X[,!(colnames(X) %in% names(Vimp[1:(j-1)])), drop=F],Y=Y)
        weights<- predict(DRFred, newdata=Xtest[,!(colnames(Xtest) %in% names(Vimp[1:(j-1)])), drop=F])$weights
        evalMMD[j]<-compute_mmd_loss(Y_train=Y, Y_test=Ytest, weights)
        
      }
      
      
      
      if ("MSE" %in% metrics){
        
        XY <- as.data.frame(cbind(X[,!(colnames(X) %in% names(Vimp[1:(j-1)])), drop=F], Y))
        colnames(XY) <- c(paste('X', 1:(ncol(XY)-1), sep=''), 'Y')
        RFfull <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees)
        XtestRF<-Xtest[,!(colnames(Xtest) %in% names(Vimp[1:(j-1)])), drop=F]
        colnames(XtestRF) <- paste('X', 1:ncol(XtestRF), sep='')
        predRF<-predict(RFfull, data=XtestRF)
        evalMSE[j] <- mean((Ytest - predRF$predictions)^2)
        
        # DRFall <- drf(X=X[,!(colnames(X) %in% names(Vimp[1:(j-1)])), drop=F], Y=Y, num.trees=num.trees)
        # quantpredictall<-predict(DRFall, newdata=Xtest[,!(colnames(Xtest) %in% names(Vimp[1:(j-1)])), drop=F], functional="quantile",quantiles=c(0.5))
        # evalMAD[j] <- mean(sapply(1:nrow(Xtest), function(j)  abs(Ytest[j] - quantpredictall$quantile[,,"q=0.5"][j]) ))
      }
      
    }
    
  }
  
  return(list(Vimp=Vimp, evalMMD=evalMMD, evalMSE=evalMSE ))
  
}

Variable Importance in Random Forests was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

CLVTools Version 0.10.0

Jeffrey Näf — Thu, 26 Oct 2023 07:58:32 GMT

Model your customers like never before

Photo by Stephen Phillips - Hostreviews.co.uk on Unsplash

In this article I briefly discussed the CLVTools package for customer lifetime (CLV) modeling. The package just got a major upgrade and is better than ever.

Overview of the models included in CLVTools. Source: Walkthrough here, with permission of the authors.

One of the big advantages of the package was the inclusion of a well-designed implementation of the extended Pareto/NBD model, which allowed for time-varying covariates. Thus it allows for instance to model known seasonal patterns, such as holidays or the firms own marketing campaigns. Including well-designed time-varying covariates in this fashion can lead to a heavy performance boost compared to the already strong (original) Pareto/NBD model, as demonstrated for instance in the original paper. Since the Pareto/NBD model has shown consistent performance over a wide range of marketing datasets over the last 30 years, the extended Pareto/NBD model can be seen as a flagship model of modern marketing research. Crucially, in this update the implementation of the extended Pareto/NBD model got a makeover and is now done in Rcpp. That means it is an order of magnitude faster than before, which can make a huge difference for end-users. It’s not an understatement to say that this lifts the extended Pareto/NBD model from an academic curiosity to a model that can actually be used in practice.

This is by far the biggest change, but there are also some further nice additions, such as new formula interfaces and new plotting capabilities. In total there is:

MUCH faster fitting for the Pareto/NBD with time-varying covariates because the LL is now implemented in Rcpp
Added interface to specify models using a formula notation (latentAttrition() and spending())
New method to plot customer’s transaction timings (plot.clv.data(which='timings'))
Ability to draw diagnostic plots of multiple models in single plot (plot(other.models=list(), label=c()))

I quickly demonstrate some of the new features here, with the same example as in my first article. More details about the package including a walkthrough can be found here.

library(CLVTools)
data("apparelTrans")
data("apparelDynCov")

clv.apparel <- clvdata(apparelTrans,
                       date.format = "ymd",
                       time.unit= "week",
                       estimation.split=40,
                       name.id="Id",
                       name.date="Date",
                       name.price="Price")

This data set also comes with covariates:

“Gender” and “Channel” are static covariates that change over customers, but not over time. Marketing on the other hand is both customer and time specific and thus dynamic.

Now let’s look at two of the changes:

An Rcpp implementation of the extended Pareto/NBD model

In the first article, we first set the covariates

clv.dyn <- SetDynamicCovariates(clv.data=clv.apparel,
                                data.cov.life = apparelDynCov,
                                data.cov.trans = apparelDynCov,
                                names.cov.life = c("Marketing", "Gender", "Channel"),
                                names.cov.trans = c("Marketing", "Gender", "Channel"),
                                name.id = "Id",
                                name.date = "Cov.Date"

and then optimized:

# Estimate the PNBD with Covariates (This takes a while (!))
est.pnbd.dyn <- pnbd(clv.dyn)

The latter optimization took around 9 minutes on my Surface Studio 2, despite this being a relatively small dataset. Now it takes only 24 seconds!! This is an impressive 96% decrease in computation time.

New Plotting Capabilities

Models are nice, especially when patterns cannot be easily spotted. However sometimes a picture is worth a thousand models. Thus there are some further plotting capabilities in the new version. In particular, one can now plot the transaction behavior of the dataset beforehand:

plot(clv.dyn,which='timings')

This results in the following nice overview over the transactions of each customer ( each line is a customer over time and each dot represents a transaction).

In particular this provides a nice overview of what the (extended) Pareto/NBD model tries to model: The purchase behavior of each customer and how likely it is that the customer actually left the firm (i.e. attrition).

Conclusion

This article briefly discussed the new features of CLVTools 0.10.0. The most important change is the Rcpp implementation of the flagship model (the extended Pareto/NBD model). More detailed information on new features and the package itself can be found here.

The package is already widely used, but we hope that this improved implementation, new formula interface and further plotting capabilities will lead to an even more widespread adoption of the package.

Random Forests and Missing Values

Jeffrey Näf — Wed, 21 Jun 2023 13:51:30 GMT

There is a very Intriguing Practical Fix

Features of (Distributional) Random Forests. In this article: The ability to deal with missing values. Source: Author.

Outside of some excessively cleaned data sets that one finds online, missing values are everywhere. In fact, the more complex and large the dataset, the more likely it is that missing values are present. Missing values are a fascinating field of statistical research, but in practice they are often a nuisance.

If you deal with a prediction problem where you want to predict a variable Y from p-dimensional covariates X=(X_1,…,X_p) and you face missing values in X, there is an interesting solution for tree-based methods. This method is actually rather old but appears to work remarkably well in a wide range of data sets. I am talking about the “missing incorporated in attributes criterion” (MIA; [1]). While there are many good articles about missing values (such as this one), this powerful approach seems somewhat underused. In particular, one does not need to impute, delete or predict your missing values in any way, but instead can just run your prediction as if you have fully observed data.

I will quickly explain how the method itself works, and then present an example with the distributional random forest (DRF) explained here. I chose DRF because it is a very general version of Random Forest (in particular, it can also be used to predict a random vector Y) and because I am somewhat biased here. MIA is actually implemented for the generalized random forest (GRF), which covers a wide range of forest implementations. In particular, since the implementation of DRF on CRAN is based on GRF, after a slight modification, it can use the MIA method as well.

Of course, be aware that this is a quick fix that (as far as I know) has no theoretical guarantees. Depending on the missingness mechanism, it might heavily bias the analysis. On the other hand, most commonly used methods for dealing with missing values don’t have any theoretical guarantees or are outright known to bias the analysis and, at least empirically, MIA appears to work well and

How it works

Recall that in a RF, splits are build of the form X_j < S or X_j ≥ S, for a dimension j=1,…,p. To find this split valule S it optimizes some kind of criterion on the Y’s, for example the CART criterion. Thus the observations are successively divided through decision rules that depend on X.

Illustration of the splitting done in a RF. Image by author.

The original paper explains it a bit confusingly, but as far as I understand MIA works as follows: Let us consider a sample (Y_1, X_1),…, (Y_n, X_n), with

X_i=(X_i1,…,X_ip)’.

Splitting without missing values is just looking for the value S as above and then throwing all Y_i with X_ij < S in Node 1 and all Y_i with X_ij ≥ S in Node 2. Calculating the target criterion such as CART for each value S, we can choose the best one. With missing values there are instead 3 options for every candidate split value S to consider:

Use the usual rule for all observations i such that X_ij is observed and send i to Node 1 if X_ij is missing.
Use the usual rule for all observations i such that X_ij is observed and send i to Node 2 if X_ij is missing.
Ignore the usual rule and just send i to Node 1 if X_ij is missing and to Node 2 if it is observed.

Which of these rules to follow is again decided according to the criterion on Y_i we use.

Illustration of how I understand the MIA procedure. Given observations in the parent node, we are looking for the best split value S. For each split value we consider the 3 options and try until we find the minimum. The sets {} on the left indicate the observations i that get sent to the left or the right. Image by author.

A Small Example

It needs to be mentioned at this point that the drf package on CRAN is not yet updated with the newest methodology. There will be a point in the future where all of this is implemented in one package on CRAN(!) However, for the moment, there are two versions:

If you want to use the fast drf implementation with missing values (without confidence intervals), you can use the “drfown” function attached at the end of this article. This code is adapted from

lorismichel/drf: Distributional Random Forests (Cevid et al., 2020) (github.com)

If on the other hand, you want confidence intervals with your parameters, use this (slower) code

drfinference/drf-foo.R at main · JeffNaef/drfinference (github.com)

In particular, drf-foo.R contains all you need in the latter case.

We will focus on the slower code with confidence intervals, as explained in this article and also consider the same example as in said article:

set.seed(2)

n<-2000
beta1<-1
beta2<--1.8


# Model Simulation
X<-mvrnorm(n = n, mu=c(0,0), Sigma=matrix(c(1,0.7,0.7,1), nrow=2,ncol=2))
u<-rnorm(n=n, sd = sqrt(exp(X[,1])))
Y<- matrix(beta1*X[,1] + beta2*X[,2] + u, ncol=1)

Note that this is a heteroskedastic linear model with p=2 and with the variance of the error term depending on the X_1 values. Now we also add missing values to X_1 in a Missing at Random (MAR) fashion:

prob_na <- 0.3
X[, 1] <- ifelse(X[, 2] <= -0.2 & runif(n) < prob_na, NA, X[, 1])

This means that X_1 is missing with a probability of 0.3, whenever X_2 has a value smaller than -0.2. Thus the probability of X_1 being missing depends on X_2, which is what is referred to as “Missing at Random”. This is already a complex situation and there is information to be gained by looking at the pattern of missing values. That is, the missingness is not “Missing Completely at Random (MCAR)”, because the missingness of X_1 depends on the value of X_2. This in turn means that the distribution of X_2 we draw from is different, conditional on whether X_1 is missing or not. This in particular means that deleting the rows with missing values might severely bias the analysis.

We now fix x and estimate the conditional expectation and variance given X=x, exactly as in the previous article.

# Choose an x that is not too far out
x<-matrix(c(1,1),ncol=2)

# Choose alpha for CIs
alpha<-0.05

We then also fit DRF and predict the weights for the test point x (which corresponds to predicting the conditional distribution of Y|X=x):

## Fit the new DRF framework
drf_fit <- drfCI(X=X, Y=Y, min.node.size = 5, splitting.rule='FourierMMD', num.features=10, B=100)

## predict weights
DRF = predictdrf(drf_fit, x=x)
weights <- DRF$weights[1,]

Example 1: Conditional Expectation

We first estimate the conditional expectation of Y|X=x.

# Estimate the conditional expectation at x:
condexpest<- sum(weights*Y)

# Use the distribution of weights, see below
distofcondexpest<-unlist(lapply(DRF$weightsb, function(wb)  sum(wb[1,]*Y)  ))

# Can either use the above directly to build confidence interval, or can use the normal approximation.
# We will use the latter
varest<-var(distofcondexpest-condexpest)

# build 95%-CI
lower<-condexpest - qnorm(1-alpha/2)*sqrt(varest)
upper<-condexpest + qnorm(1-alpha/2)*sqrt(varest)
round(c(lower, condexpest, upper),2)

# without NAs: (-1.00, -0.69 -0.37)
# with NAs: (-1.15, -0.67, -0.19)

Remarkably, the values obtained with NAs are very close to the ones from the first analysis without NAs in the previous article! This really is quite astounding to me, as this missing mechanism is not easy to deal with. Interestingly, the estimated variance of the estimator also doubles, from around 0.025 without missing values to roughly 0.06 with missing values.

The truth is given as:

so we have a slight error, but the confidence intervals contain the truth, as they should.

The result looks similar for a more complex target, like the conditional variance:

# Estimate the conditional expectation at x:
condvarest<- sum(weights*Y^2) - condexpest^2

distofcondvarest<-unlist(lapply(DRF$weightsb, function(wb)  {
  sum(wb[1,]*Y^2) - sum(wb[1,]*Y)^2
} ))

# Can either use the above directly to build confidence interval, or can use the normal approximation.
# We will use the latter
varest<-var(distofcondvarest-condvarest)

# build 95%-CI
lower<-condvarest - qnorm(1-alpha/2)*sqrt(varest)
upper<-condvarest + qnorm(1-alpha/2)*sqrt(varest)

c(lower, condvarest, upper)

# without NAs: (1.89, 2.65, 3.42)
# with NAs: (1.79, 2.74, 3.69)

Here the difference in the estimated values is a bit larger. As the truth is given as

the estimate with NAs is even slightly more accurate (though of course this is likely just randomness). Again the variance estimate of the (variance) estimator increases with missing values, from 0.15 (no missing values) to 0.23.

Conclusion

In this article, we discussed MIA, which is an adaptation of the splitting method in Random Forest to deal with missing values. Since it is implemented in GRF and DRF, it can be used broadly and the small example we looked at indicates that it works remarkably well.

However, I’d like to note again that there is no theoretical guarantee for consistency or for the confidence intervals to make sense, even for a very large number of datapoints. The reason for missing values are numerous and one has to be very careful to not bias one’s analysis through a careless handling of this issue. The MIA method is by no means a well-understood fix for this problem. However, it seems to be a reasonable quick fix for the moment, that appears to be able to make some use of the pattern of missingness in the data. If somebody does/has a more extensive simulation analysis I would be curious about the results.

Code

require(drf)            
             
drfown <-               function(X, Y,
                              num.trees = 500,
                              splitting.rule = "FourierMMD",
                              num.features = 10,
                              bandwidth = NULL,
                              response.scaling = TRUE,
                              node.scaling = FALSE,
                              sample.weights = NULL,
                              sample.fraction = 0.5,
                              mtry = min(ceiling(sqrt(ncol(X)) + 20), ncol(X)),
                              min.node.size = 15,
                              honesty = TRUE,
                              honesty.fraction = 0.5,
                              honesty.prune.leaves = TRUE,
                              alpha = 0.05,
                              imbalance.penalty = 0,
                              compute.oob.predictions = TRUE,
                              num.threads = NULL,
                              seed = stats::runif(1, 0, .Machine$integer.max),
                              compute.variable.importance = FALSE) {
  
  # initial checks for X and Y
  if (is.data.frame(X)) {
    
    if (is.null(names(X))) {
      stop("the regressor should be named if provided under data.frame format.")
    }
    
    if (any(apply(X, 2, class) %in% c("factor", "character"))) {
      any.factor.or.character <- TRUE
      X.mat <- as.matrix(fastDummies::dummy_cols(X, remove_selected_columns = TRUE))
    } else {
      any.factor.or.character <- FALSE
      X.mat <- as.matrix(X)
    }
    
    mat.col.names.df <- names(X)
    mat.col.names <- colnames(X.mat)
  } else {
    X.mat <- X
    mat.col.names <- NULL
    mat.col.names.df <- NULL
    any.factor.or.character <- FALSE
  }
  
  if (is.data.frame(Y)) {
    
    if (any(apply(Y, 2, class) %in% c("factor", "character"))) {
      stop("Y should only contain numeric variables.")
    }
    Y <- as.matrix(Y)
  }
  
  if (is.vector(Y)) {
    Y <- matrix(Y,ncol=1)
  }
  
  
  #validate_X(X.mat)
  
  if (inherits(X, "Matrix") && !(inherits(X, "dgCMatrix"))) {
        stop("Currently only sparse data of class 'dgCMatrix' is supported.")
    }
  
  drf:::validate_sample_weights(sample.weights, X.mat)
  #Y <- validate_observations(Y, X)
  
  # set legacy GRF parameters
  clusters <- vector(mode = "numeric", length = 0)
  samples.per.cluster <- 0
  equalize.cluster.weights <- FALSE
  ci.group.size <- 1
  
  num.threads <- drf:::validate_num_threads(num.threads)
  
  all.tunable.params <- c("sample.fraction", "mtry", "min.node.size", "honesty.fraction",
                          "honesty.prune.leaves", "alpha", "imbalance.penalty")
  
  # should we scale or not the data
  if (response.scaling) {
    Y.transformed <- scale(Y)
  } else {
    Y.transformed <- Y
  }
  
  data <- drf:::create_data_matrices(X.mat, outcome = Y.transformed, sample.weights = sample.weights)
  
  # bandwidth using median heuristic by default
  if (is.null(bandwidth)) {
    bandwidth <- drf:::medianHeuristic(Y.transformed)
  }
  
  
  args <- list(num.trees = num.trees,
               clusters = clusters,
               samples.per.cluster = samples.per.cluster,
               sample.fraction = sample.fraction,
               mtry = mtry,
               min.node.size = min.node.size,
               honesty = honesty,
               honesty.fraction = honesty.fraction,
               honesty.prune.leaves = honesty.prune.leaves,
               alpha = alpha,
               imbalance.penalty = imbalance.penalty,
               ci.group.size = ci.group.size,
               compute.oob.predictions = compute.oob.predictions,
               num.threads = num.threads,
               seed = seed,
               num_features = num.features,
               bandwidth = bandwidth,
               node_scaling = ifelse(node.scaling, 1, 0))
  
  if (splitting.rule == "CART") {
    ##forest <- do.call(gini_train, c(data, args))
    forest <- drf:::do.call.rcpp(drf:::gini_train, c(data, args))
    ##forest <- do.call(gini_train, c(data, args))
  } else if (splitting.rule == "FourierMMD") {
    forest <- drf:::do.call.rcpp(drf:::fourier_train, c(data, args))
  } else {
    stop("splitting rule not available.")
  }
  
  class(forest) <- c("drf")
  forest[["ci.group.size"]] <- ci.group.size
  forest[["X.orig"]] <- X.mat
  forest[["is.df.X"]] <- is.data.frame(X)
  forest[["Y.orig"]] <- Y
  forest[["sample.weights"]] <- sample.weights
  forest[["clusters"]] <- clusters
  forest[["equalize.cluster.weights"]] <- equalize.cluster.weights
  forest[["tunable.params"]] <- args[all.tunable.params]
  forest[["mat.col.names"]] <- mat.col.names
  forest[["mat.col.names.df"]] <- mat.col.names.df
  forest[["any.factor.or.character"]] <- any.factor.or.character
  
  if (compute.variable.importance) {
    forest[['variable.importance']] <- variableImportance(forest, h = bandwidth)
  }
  
  forest
}

Citations

[1] Twala, B. E. T. H., M. C. Jones, and David J. Hand. Good methods for coping with missing data in decision trees. Pattern Recognition Letters 29,2008.

Random Forests and Missing Values was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Studying the Gender Wage Gap in the US Using Distributional Random Forests

Jeffrey Näf — Sat, 18 Feb 2023 00:35:04 GMT

An example of a real data analysis with DRF

Photo by Ehimetalor Akhere Unuabona on Unsplash

In two previous articles, I explained Distributional Random Forests (DRFs), a Random Forest that is able to estimate conditional distributions, as well as an extension of the method that allows for uncertainty quantification, like Confidence Intervals, etc. Here I present an example of a real-world application to wage data from the 2018 American Community Survey by the US Census Bureau. In the first DRF paper, we obtained data on approximately 1 million full-time employees from the 2018 American Community Survey by the US Census Bureau from which we extracted the salary information and all covariates that might be relevant for salaries. This wealth of data is ideal to experiment with a method like DRF (in fact we will only use a tiny subset for this analysis).

When one studies the raw data on hourly wages, there is a consistent gap between the two genders, in that men tend to earn more. An interesting question is whether the observed gap in hourly wages (W) of men (G=1) and women (G=0) is due to gender alone or whether it can be explained by some other confounding variables X, which are influenced by gender and in turn influence wage. That is, we want to study the effect size corresponding to the bold arrow in the following causal graph:

Assumed Causal Graph, G=Gender, W=Wage and X are confounders

For example, let’s assume that X only includes occupation and that women have a tendency to choose occupations that do not entail a high monetary reward, such as doctors, nurses, or teachers, while men tend to have professional gambling jobs with obscene hourly wages. If this alone were to drive the difference in hourly wages between genders, we would still see a wage gap when looking directly at the hourly wage data. However if we then fix the occupation (X) only to doctors and compare the two wage distributions there, any statistically significant difference can only come from gender alone.

We focus on a two-stage analysis:

We fix X to a particular value and compare the distribution of wages in the two groups for the covariates fixed to X=x. This is interesting from two points of view: First, if X really includes all other factors that influence wages and are related to gender, then fixing X=x and looking at wages for both genders means we really observe the effect of Gender on wages. Second, it allows for the prediction of the whole wage distribution for an individual with given characteristics x.
We use the assumed causal graph above and the rules of causality to estimate a counterfactual distribution with DRF: The distribution of women’s wages had they been treated as men for setting the wage. If X contains all the relevant covariates and there is no gender pay gap, this distribution should be the same as the wage distribution for men (neglecting statistical randomness).

This article is the culmination of the work of several people: The code and the dataset were obtained from the original DRF repository and then combined with the methods developed in our new paper on arXiv, written together with Corinne Emenegger.

Before going on, I want to point out that this is only an example to illustrate the use of DRF. I do not want to make any serious (causal) claims here, simply because the analysis is surely flawed in some regard and the causal graph we assume below is surely wrong. Moreover, we only use a tiny subset of the available data.

Also, note that the code is quite slow to run. This is because, while DRF itself is coded in C, the repeated fitting needed for Confidence Intervals is implemented in R so far.

That said, let’s dive in. In the following, all images, unless otherwise noted, are by the author.

Data

The PUMS (Public Use Microdata Area) data from the 2018 1-Year American Community Survey is obtained from the US Census Bureau API. The survey is sent to ≈ 3.5 million people annually and aims to give more up-to-date data than the official census that is carried out every decade. The 2018 data set has about 3 million anonymized data points for the 51 states and the District of Columbia. For the DRF paper linked above, we retrieved only the subset of variables that might be relevant for the salaries, such as a person’s gender, age, race, person’s marital status, education level, and level of English knowledge.

The preprocessed data can be found here. We first do some further cleaning:

##Further data cleaning ##

which = rep(TRUE, nrow(wage))
which = which & (wage$age >= 17)
which = which & (wage$weeks_worked > 48)
which = which & (wage$hours_worked > 16)
which = which & (wage$employment_status == 'employed')
which = which & (wage$employer != 'self-employed')
which[is.na(which)] = FALSE

data = wage[which, ]
sum(is.na(data))
colSums(is.na(data))
rownames(data) = 1:nrow(data)
#data = na.omit(data)

data$log_wage = log(data$salary / (data$weeks_worked * data$hours_worked))


## Prepare data and fit drf
## Define X and Y
X = data[,c(
  'age',
  'race',
  'hispanic_origin',
  'citizenship',
  'nativity', 
  'marital',
  'family_size',
  'children',
  'education_level',
  'english_level',
  'economic_region'
)]
X$occupation = unlist(lapply(as.character(data$occupation), function(s){return(substr(s, 1, 2))}))
X$occupation = as.factor(X$occupation)
X$industry = unlist(lapply(as.character(data$industry), function(s){return(substr(s, 1, 2))}))
X$industry[X$industry %in% c('32', '33', '3M')] = '31'
X$industry[X$industry %in% c('42')] = '41'
X$industry[X$industry %in% c('45', '4M')] = '44'
X$industry[X$industry %in% c('49')] = '48'
X$industry[X$industry %in% c('92')] = '91'
X$industry = as.factor(X$industry)
X=dummy_cols(X, remove_selected_columns = TRUE)
X = as.matrix(X)

Y = data[,c('sex', 'log_wage')]
Y$sex = (Y$sex == 'male')
Y = as.matrix(Y)

These are actually way more observations than we need, and we instead subsample 4'000 training data points at random for the analysis here.

train_idx = sample(1:nrow(data), 4000, replace = FALSE)

## Focus on training data
Ytrain=Y[train_idx,]
Xtrain=X[train_idx,]

Again, this is because it is just an illustration— in reality, you would want to take as many data points as you can get. The estimated wage densities of the two sexes for these 4'000 data points are plotted in Figure 1, with this code:

## Plot the test data without adjustment
plotdfunadj = data[train_idx, ]
plotdfunadj$weight=1
plotdfunadj$plotweight[plotdfunadj$sex=='female'] = plotdfunadj$weight[plotdfunadj$sex=='female']/sum(plotdfunadj$weight[plotdfunadj$sex=='female'])
plotdfunadj$plotweight[plotdfunadj$sex=='male'] = plotdfunadj$weight[plotdfunadj$sex=='male']/sum(plotdfunadj$weight[plotdfunadj$sex=='male'])

#pooled data
ggplot(plotdfunadj, aes(log_wage)) +
  geom_density(adjust=2.5, alpha = 0.3, show.legend = TRUE,  aes(fill=sex, weight=plotweight)) +
  theme_light()+
  scale_fill_discrete(name = "gender", labels = c('female', "male"))+
  theme(legend.position = c(0.83, 0.66),
        legend.text=element_text(size=18),
        legend.title=element_text(size=20),
        legend.background = element_rect(fill=alpha('white', 0.5)),
        axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title.x = element_text(size=19),
        axis.title.y = element_text(size=19))+
  labs(x='log(hourly_wage)')

Estimated densities for the (unconditional) raw log of the hourly wages

Calculating the percentage median difference between the two wages, that is

(median wage men- median wage women)/(median wage women) *100,

we obtain around 18 percent. That is, the median salary of men is 18 percent higher than that of women in the unadjusted data (!)

## Median Difference before adjustment!
quantile_maleunadj = wtd.quantile(x=plotdfunadj$log_wage, weights=plotdfunadj$plotweight*(plotdfunadj$sex=='male'), normwt=TRUE, probs=0.5)
quantile_femaleunadj = wtd.quantile(x=plotdfunadj$log_wage, weights=plotdfunadj$plotweight*(plotdfunadj$sex=='female'), normwt=TRUE, probs=0.5)
(1-exp(quantile_femaleunadj)/exp(quantile_maleunadj))

Analysis

The question now becomes whether this is truly “unfair”. That is, we assume the causal graph above, where Gender (G) influences both Wage (W), as well as covariates X that in turn influence W. What we would like to know is whether Gender directly influences Wage (the bold arrow). That is, if a woman and a man with the exact same characteristics X=x get the same wage, or whether she gets less, simply because of her gender.

We will study this in two settings. The first one is to truly hold X=x fixed and to use the machinery explained in the earlier article. Intuitively, if we fix all other covariates that could influence wage besides gender, and now compare the two wage distributions, then any observed difference must be by wage alone.

The second one tries to quantify this difference over all possible values of X This is done by calculating the counterfactual distribution of

W(male, X(female)).

This quantity is the counterfactual wage a man gets if he has exactly the characteristics of a woman. That is, we ask for the wage a woman gets if treated like a man.

Note that this assumes that the causal graph above is correct. In particular, it assumes that X captures all relevant factors besides gender, that would determine the wage. It could very well be that this is not the case, thus the disclaimer at the beginning of this article.

Studying the conditional distributional differences

In the following, we fix x to an arbitrary point:

i<-47

# Important: Test point needs to be a matrix
test_point<-X[i,, drop=F]

The following picture shows some of the values contained in this test point x — we are looking at childcare workers with high school diplomas that are married and have 1 child. With DRF we can estimate and plot the densities for the two groups conditional on X=x:

# Load all relevant functions (the CIdrf.R file can be found at the end of this 
# article
source('CIdrf.R')


## Fit the new DRF framework (I forgot to include this in an earlier
## version of the article, apologies)
drf_fit <- drfCI(X=Xtrain, Y=Ytrain, min.node.size = 20, splitting.rule='FourierMMD', num.features=10, B=100)

# predict with the new framework
DRF = predictdrf(drf_fit, x=x)
weights <- DRF$weights


## Conditional Density Plotting
plotdfx = data[train_idx, ]

propensity = sum(weights[plotdfx$sex=='female'])
plotdfx$plotweight = 0
plotdfx$plotweight[plotdfx$sex=='female'] = weights[plotdfx$sex=='female']/propensity
plotdfx$plotweight[plotdfx$sex=='male'] = weights[plotdfx$sex=='male']/(1-propensity)

gg = ggplot(plotdfx, aes(log_wage)) +
  geom_density(adjust=5, alpha = 0.3, show.legend=TRUE,  aes(fill=sex, weight=plotweight)) +
  labs(x='log(hourly wage)')+
  theme_light()+
  scale_fill_discrete(name = "gender", labels = c(sprintf("F: %g%%", round(100*propensity, 1)), sprintf("M: %g%%", round(100*(1-propensity), 1))))+
  theme(legend.position = c(0.9, 0.65),
        legend.text=element_text(size=18),
        legend.title=element_text(size=20),
        legend.background = element_rect(fill=alpha('white', 0)),
        axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title.x = element_text(size=19),
        axis.title.y = element_text(size=19))+
  annotate("text", x=-1, y=Inf, hjust=0, vjust=1, size=5, label = point_description(data[i,]))
plot(gg)

Estimated density of the log(hourly wage) for the two genders given X=x. The code for this plot can be found at the end of the article.

In this plot, there appears to be a clear difference in wages, even in the case of this fixed x (remember, all the assumed confounders are fixed in this case, so we really just compare wages directly). With DRF we now estimate and test the median difference

## Getting the respective weights
weightsmale<-weights*(Ytrain[, "sex"]==1)/sum(weights*(Ytrain[, "sex"]==1))
weightsfemale<-weights*(Ytrain[, "sex"]==0)/sum(weights*(Ytrain[, "sex"]==0))


## Choosing alpha:
alpha<-0.05


# Step 1: Doing Median comparison for fixed x

quantile_male = wtd.quantile(x=data$log_wage[train_idx], weights=matrix(weightsmale), normwt=TRUE, probs=0.5)
quantile_female = wtd.quantile(x=data$log_wage[train_idx], weights=matrix(weightsfemale), normwt=TRUE, probs=0.5)

(medianx<-unname(1-exp(quantile_female)/exp(quantile_male)))


mediandist <- sapply(DRF$weightsb, function(wb) {
  
  wbmale<-wb*(Ytrain[, "sex"]==1)/sum(wb*(Ytrain[, "sex"]==1))
  wbfemale<-wb*(Ytrain[, "sex"]==0)/sum(wb*(Ytrain[, "sex"]==0))
  
  
  quantile_maleb = wtd.quantile(x=data$log_wage[train_idx], weights=matrix(wbmale), normwt=TRUE, probs=0.5)
  quantile_femaleb = wtd.quantile(x=data$log_wage[train_idx], weights=matrix(wbfemale), normwt=TRUE, probs=0.5)
  
  
  return( unname(1-exp(quantile_femaleb)/exp(quantile_maleb)) ) 
})

varx<-var(mediandist)

## Use Gaussian CI:
(upper<-medianx + qnorm(1-alpha/2)*sqrt(varx))
(lower<-medianx - qnorm(1-alpha/2)*sqrt(varx))

This gives a confidence interval of the median difference of

(0.06, 0.40) or (6%, 40%)

This interval very clearly does not contain zero and thus the median difference is indeed significant.

Using the Witobj function, we can make this difference better visible


Witobj<-Witdrf(drf_fit, x=test_point, groupingvar="sex", alpha=0.05)



hatmun<-function(y,Witobj){
  
  c<-Witobj$c
  k_Y<-Witobj$k_Y
  Y<-Witobj$Y
  weightsall1<-Witobj$weightsall1
  weightsall0<-Witobj$weightsall0
  Ky=t(kernelMatrix(k_Y, Y , y = y))
  
  out<-list()
  out$val <- tcrossprod(Ky, weightsall1  ) - tcrossprod(Ky, weightsall0  )
  out$upper<-  out$val+sqrt(c)
  out$lower<-  out$val-sqrt(c)
  
  return( out )
  
  
  
}

all<-hatmun(sort(Witobj$Y),Witobj)


plot(sort(Witobj$Y),all$val , type="l", col="darkblue", lwd=2, ylim=c(min(all$lower), max(all$upper)),
     xlab="log(wage)", ylab="witness function")
lines(sort(Witobj$Y),all$upper , type="l", col="darkgreen", lwd=2 )
lines(sort(Witobj$Y),all$lower , type="l", col="darkgreen", lwd=2 )
abline(h=0)

which leads to the Figure:

Estimate of the conditional witness function for wage men minus wage women.

We refer to the companion article for a more detailed explanation of this concept. Essentially it can be thought of as

conditional density of log(wage) of men given x — conditional density of log(wage) of women given x

That is, the conditional witness function shows where the density of one group is larger than the other, without having to actually estimate the density. In this example, negative values mean the density of women’s wages is higher than that of men conditional on x, and positive values mean the density of women’s wages is lower. Since we already estimated the conditional densities above, the conditional witness function alone does not add much more. But it’s good for illustration purposes. Indeed, we see that it is negative at the start, for values where the conditional density of women’s wages is higher than the conditional density of men’s wages. Vice-versa it turns positive for larger values, for which the conditional density of men’s wages is higher than for women. Thus the relevant information about the two densities is summarized in the witness function plot: We see that the density for women’s wages is higher for lower values of wages and lower for higher values, indicating that the density is shifted to the left and women earn less! Moreover, we can also provide 95% confidence bands in green that include the true function with 95%, uniformly over all y. (Though one really needs a lot of data to make this valid) Since this uniform confidence band does not contain the zero line between around 2 and 2.5, we see again that the difference between the two distributions is statistically significant.

Conditioning on a particular x, allows us to study individual effects in great detail and with a notion of uncertainty. However, it can also be interesting to study the overall effect. We do this by estimating the counterfactual distribution, in the next section.

Estimating the counterfactual distribution

Using the calculation laws of causality on our assumed causal graph, it can be derived that:

I.e. the distribution of the counterfactual we are looking for is obtained by averaging the conditional distribution of W | G=male, X=x, over the x of the gender female.

As the distributions are given as simple weights, this is easily done with DRF as follows:

## Add code

## Male is 1, Female is 0

# obtain all X from the female test population
Xtestf<-Xtest[Ytest[,"sex"]==0,]


# Obtain the conditional distribution of W | G=male, X=x, for x in the female
# population.

# These weights correspond to P(W, G=male | X=x  )
weightsf<-predictdrf(drf_fit, x=Xtestf)$weights*(Ytrain[, "sex"]==1)
weightsf<-weightsf/rowSums(weightsf)

# The counterfactual distribution is the average over those weights/distributions 
counterfactualw<-colMeans(weightsf)

which leads to the following counterfactual density estimate:

plotdfc<-rbind(plotdfc, plotdfunadj[plotdfunadj$sex=='female',])
plotdfc$sex2<-c(rep(1, length(train_idx)), rep(0,nrow(plotdfunadj[plotdfunadj$sex=='female',])))

plotdfc$sex2<-factor(plotdfc$sex2)


#interventional distribution
ggplot(plotdfc, aes(log_wage)) +
  geom_density(adjust=2.5, alpha = 0.3, show.legend=TRUE,  aes(fill=sex2, weight=plotweight)) +
  theme_light()+
  scale_fill_discrete(name = "", labels = c("observed women's wages", "wages if treated as men"))+
  theme(legend.position = c(0.2, 0.98),
        legend.text=element_text(size=16),
        legend.title=element_text(size=20),
        legend.background = element_rect(fill=alpha('white', 0)),
        axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title.x = element_text(size=19),
        axis.title.y = element_text(size=19))+
  labs(x='log(hourly wage)')

These two densities are now the density of women’s wages in red, and the density of women had they been treated as men for setting the wage in green-bluish. Clearly, the densities are now closer together than they were before — adjusting for the confounders made the difference in gender pay smaller. However, the median difference is still

quantile_male = wtd.quantile(x=plotdfc$log_wage[plotdfc$sex2==1], weights=counterfactualw, normwt=TRUE, probs=0.5)
quantile_female = wtd.quantile(x=plotdfunadj$log_wage, weights=plotdfunadj$plotweight*(plotdfunadj$sex=='female'), normwt=TRUE, probs=0.5)
(1-exp(quantile_female)/exp(quantile_male))

0.11 or 11 percent!

Thus if our analysis is correct, 11% of the difference in pay can still be attributed to gender alone. In other words, while we managed to reduce the difference from 18% in median income in the unadjusted data to 11%, there is still a substantial difference left over, indicating an “unfair” wage gap between the genders (at least if X really captures the relevant confounders).

Conclusion

In this article, we studied an example of how DRF could be used for a real data analysis. We studied both the case of a fixed x, for which the methods discussed in this article, allows to build uncertainty measures, as well as the distribution of a counterfactual quantity. In both cases, we saw that there was still a substantial, and in the case of fixed x, significant difference when adjusting for the available confounders.

Though I did not check, it might be interesting to see how the results from this little experiment stack up against more serious analysis. In any case, I hope this article showcased how DRF could be used in a real-world data analysis.

Additional Code

The full code can also be found on Github.

## Functions in CIdrf.R that is loaded above ##

drfCI <- function(X, Y, B, sampling = "binomial",...) {

### Function that uses DRF with subsampling to obtain confidence regions as
### as described in https://arxiv.org/pdf/2302.05761.pdf
### X: Matrix of predictors
### Y: Matrix of variables of interest
### B: Number of half-samples/mini-forests


  n <- dim(X)[1]
  
  # compute point estimator and DRF per halfsample S
  # weightsb: B times n matrix of weights
  DRFlist <- lapply(seq_len(B), function(b) {
    
    # half-sample index
    indexb <- if (sampling == "binomial") {
      seq_len(n)[as.logical(rbinom(n, size = 1, prob = 0.5))]
    } else {
      sample(seq_len(n), floor(n / 2), replace = FALSE)
    }
    
    ## Using refitting DRF on S
    DRFb <- 
      drf(X = X[indexb, , drop = F], Y = Y[indexb, , drop = F],
          ci.group.size = 1, ...)
    
    
    return(list(DRF = DRFb, indices = indexb))
  })
  
  return(list(DRFlist = DRFlist, X = X, Y = Y) )
}


predictdrf<- function(DRF, x, ...) {

### Function to predict from DRF with Confidence Bands
### DRF: DRF object
### x: Testpoint
  
  ntest <- nrow(x)
  n <- nrow(DRF$Y)
  
  ## extract the weights w^S(x)
  weightsb <- lapply(DRF$DRFlist, function(l) {
    
    weightsbfinal <- Matrix(0, nrow = ntest, ncol = n , sparse = TRUE)
    
    weightsbfinal[, l$indices] <- predict(l$DRF, x)$weights 
    
    return(weightsbfinal)
  })
  
  
  ## obtain the overall weights w
  weights<- Reduce("+", weightsb) / length(weightsb)
  
    
return(list(weights = weights, weightsb = weightsb ))
}



Witdrf<- function(DRF, x, groupingvar, alpha=0.05, ...){
  
### Function to calculate the conditional witness function with
### confidence bands from DRF
### DRF: DRF object
### x: Testpoint
  
  if (is.null(dim(x)) ){
    
  stop("x needs to have dim(x) > 0")
  }
  
  ntest <- nrow(x)
  n <- nrow(DRF$Y)
  coln<-colnames(DRF$Y)
  
  
  ## Collect w^S
  weightsb <- lapply(DRF$DRFlist, function(l) {
    
    weightsbfinal <- Matrix(0, nrow = ntest, ncol = n , sparse = TRUE)
    
    weightsbfinal[, l$indices] <- predict(l$DRF, x)$weights 
    
    return(weightsbfinal)
  })
  
  ## Obtain w
  weightsall <- Reduce("+", weightsb) / length(weightsb)
  
  #weightsall0<-weightsall[, DRF$Y[, groupingvar]==0, drop=F]
  #weightsall1<-weightsall[,DRF$Y[, groupingvar]==1, drop=F]
  
  
  # Get the weights of the respective classes (need to standardize by propensity!)
  weightsall0<-weightsall*(DRF$Y[, groupingvar]==0)/sum(weightsall*(DRF$Y[, groupingvar]==0))
  weightsall1<-weightsall*(DRF$Y[, groupingvar]==1)/sum(weightsall*(DRF$Y[, groupingvar]==1))
  
  
  bandwidth_Y <- drf:::medianHeuristic(DRF$Y)
  k_Y <- rbfdot(sigma = bandwidth_Y)

  K<-kernelMatrix(k_Y, DRF$Y[,coln[coln!=groupingvar]], y = DRF$Y[,coln[coln!=groupingvar]])

  
  nulldist <- sapply(weightsb, function(wb){
    # iterate over class 1

    wb0<-wb*(DRF$Y[, groupingvar]==0)/sum(wb*(DRF$Y[, groupingvar]==0))
    wb1<-wb*(DRF$Y[, groupingvar]==1)/sum(wb*(DRF$Y[, groupingvar]==1))
    
    
    diag( ( wb0-weightsall0 - (wb1-weightsall1) )%*%K%*%t( wb0-weightsall0 - (wb1-weightsall1) )  )
    
    
  })
  
  # Choose the right quantile
  c<-quantile(nulldist, 1-alpha)

  
  return(list(c=c, k_Y=k_Y, Y=DRF$Y[,coln[coln!=groupingvar]], nulldist=nulldist, weightsall0=weightsall0, weightsall1=weightsall1))
  
  
  
}

### Code to generate plots

## Step 0: Choosing x

point_description = function(test_point){
  out = ''
  
  out = paste(out, 'job: ', test_point$occupation_description[1], sep='')
  out = paste(out, '\nindustry: ', test_point$industry_description[1], sep='')
  
  out = paste(out, '\neducation: ', test_point$education[1], sep='')
  out = paste(out, '\nemployer: ', test_point$employer[1], sep='')
  out = paste(out, '\nregion: ', test_point$economic_region[1], sep='')
  
  out = paste(out, '\nmarital: ', test_point$marital[1], sep='')
  out = paste(out, '\nfamily_size: ', test_point$family_size[1], sep='')
  out = paste(out, '\nchildren: ', test_point$children[1], sep='')
  
  out = paste(out, '\nnativity: ', test_point$nativity[1], sep='')
  out = paste(out, '\nhispanic: ', test_point$hispanic_origin[1], sep='')
  out = paste(out, '\nrace: ', test_point$race[1], sep='')
  out = paste(out, '\nage: ', test_point$age[1], sep='')
  
  return(out)
}

Studying the Gender Wage Gap in the US Using Distributional Random Forests was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Inference for Distributional Random Forests

Jeffrey Näf — Fri, 17 Feb 2023 08:23:31 GMT

Confidence intervals for a powerful nonparametric method

Features of (Distributional) Random Forests. In this article: The ability to provide uncertainty measures. Source: Author.

In a previous article, I extensively discussed the Distributional Random Forest method, a Random Forest-type algorithm that can nonparametrically estimate multivariate conditional distributions. This means that we are able to learn the whole distribution of a multivariate response Y given some covariates X nonparametrically, instead of “just” learning an aspect such as its conditional expectation. DRF does this by learning weights w_i(x) for the i=1,…,n training points that define the distribution and can be used to estimate a wide range of targets.

So far this method only produced a “point estimate” of the distribution (i.e. a point estimate for the n weights w_i(x)). While this is enough to predict the whole distribution of a response, it doesn’t give a way to make inference that considers the randomness of the data-generating mechanism. That is, even though this point estimate gets increasingly close to the truth for large sample sizes (under a list of assumptions), there is still uncertainty in its estimate for finite sample sizes. Luckily there is now a (provable) method to quantify this uncertainty as I lay out in this article. This is based on our new paper on arXiv.

The goal of this article is twofold: First, I want to discuss how to add uncertainty estimates to the DRF, based on our paper. The paper is quite theoretical, so I start with a few examples. The subsequent sections take a quick glance at these theoretical results, for those interested. I then explain how this can be used to get a (sampling-based) uncertainty measure for a wide range of targets. Second, I discuss the CoDiTE of [1] and a particularly interesting example of this concept, the conditional witness function. This function is a complicated object, yet, as we will see below, we can estimate it easily with DRF and can even provide asymptotic confidence bands, based on the concepts introduced in this article. An extensive real-data example of how this could be applied is given in this article.

Throughout we assume to have a d-variate i.i.d. sample Y_1, …, Y_n of variables of interest and a p-variate i.i.d. sample X_1,…,X_n of covariates. The goal is to estimate the conditional distribution of Y|X=x.

We will need the following packages and functions for this:

library(kernlab)
library(drf)
library(Matrix)
library(Hmisc)
library(MASS)
library(ggplot2)

In the following, all images, unless otherwise noted, are by the author.

The full code for this article can be found on Github.

Examples

We simulate from a simple example with d=1 and p=2:

 
set.seed(2)

n<-2000
beta1<-1
beta2<--1.8


# Model Simulation
X<-mvrnorm(n = n, mu=c(0,0), Sigma=matrix(c(1,0.7,0.7,1), nrow=2,ncol=2))
u<-rnorm(n=n, sd = sqrt(exp(X[,1])))
Y<- matrix(beta1*X[,1] + beta2*X[,2] + u, ncol=1)

Note that this is simply a heteroskedastic linear model, with the variance of the error term depending on the X_1 values. Of course, knowing the effect of X on Y is just linear, you would not use DRF, or any Random Forest for that matter, but directly go with linear regression. But for this purpose, it is convenient to know the truth. Since DRF’s job is to estimate a conditional distribution given X=x, we now fix x and estimate the conditional expectation and variance given X=x.

We choose a point that is right in the center of the X distribution, with lots of observations surrounding it. In general, one should be careful when using any Random Forest method for points on the border of the X observations.

# Choose an x that is not too far out
x<-matrix(c(1,1),ncol=2)

# Choose alpha for CIs
alpha<-0.05

Finally, we fit our DRF and obtain the weights w_i(x):

## Fit the new DRF framework
drf_fit <- drf(X=X, Y=Y, min.node.size = 5, num.trees=2000, num.features=10, ci.group.size=2000/50)

## predict weights
DRF = predict(drf_fit, newdata=x, estimate.uncertainty = TRUE)
weights <- DRF$weights[1,]

# Bxn matrix of uncertainty weights
weightsb<-DRF$weights.uncertainty[[1]]

Note the option estimate.uncertainty = TRUE in the predict function. As explained below, the DRF object we built here not only contains the weights w_i(x), but also a sample of B weights that correspond to draws from the distribution of w_i(x). We can use these B draws to approximate the distribution of anything we want to estimate, as I illustrate now in two examples.

Example 1: Conditional Expectation

First, we simply do what most prediction methods do: We estimate the conditional expectation. With our new method, we also build a confidence interval around it.

# Estimate the conditional expectation at x:
condexpest<- sum(weights*Y)

# Use the distribution of weights, see below
distofcondexpest<-apply(weightsb,1, function(wb)  sum(wb*Y)  )

# Can either use the above directly to build confidence interval, or can use the normal approximation.
# We will use the latter
varest<-var(distofcondexpest-condexpest)

# build 95%-CI
lower<-condexpest - qnorm(1-alpha/2)*sqrt(varest)
upper<-condexpest + qnorm(1-alpha/2)*sqrt(varest)

c(round(lower,2), round(condexpest,2), round(upper,2))
(-1.16, -0.59, -0.03)

Importantly, though the estimated value is a bit off, this CI contains the truth, which is given as

Example 2: Conditional Variance

Assume now we would like to find the variance Var(Y|X=x) instead of the conditional mean. This is quite a challenging example for a nonparametric method that cannot make use of the linearity. The truth is given as

Using DRF, we can estimate this as follows:

# Estimate the conditional expectation at x:
condvarest<- sum(weights*Y^2) - condexpest^2

distofcondvarest<-apply(weightsb,1,  function(wb)  {
  sum(wb*Y^2) - sum(wb*Y)^2
}  )

# Can either use the above directly to build confidence interval, or can use the normal approximation.
# We will use the latter
varest<-var(distofcondvarest-condvarest)

# build 95%-CI
lower<-condvarest - qnorm(1-alpha/2)*sqrt(varest)
upper<-condvarest + qnorm(1-alpha/2)*sqrt(varest)

c(round(lower,2), round(condvarest,2), round(upper,2))

(1.29, 2.24, 3.18)

Thus the true parameter is contained in the CI, as we would hope, and in fact, we are quite close to the truth with our estimate!

We now study the theory underlying these examples, before we come to a third example in Causal Analysis.

Asymptotic Normality in the RKHS

In this and the next section, we briefly focus on the theoretical results derived in the paper. As explained above and in the article, DRF presents a distributional prediction at a test point x. That is, we obtain an estimate

of the conditional distribution of Y given X=x. This is just a typical way of writing an empirical measure, the magic lies in the weights w_i(x) — they can be used to easily obtain estimators of quantities of interest, or even to sample directly from the distribution.

To obtain this estimate, DRF actually estimates the conditional mean, but in a reproducing kernel Hilbert space (RKHS). An RKHS is defined through a kernel function k(y_1, y_2). With this kernel, we can map each observation Y_i into the Hilbert space, as k(Y_i, .). There is a myriad of methods using this extremely powerful tool, such as kernel ridge regression. The key point is that under some conditions, any distribution can be expressed as an element of this RKHS. It turns out that the true conditional distribution can be represented in the RKHS as the following expectation:

So this is just another way of expressing the conditional distribution of Y given X=x. We then try to estimate this element with DRF like this:

Again we are using the weights obtained from DRF, but now form a weighted sum with k(Y_i,.) instead of the Dirac measures above. We can map back and forth between the two estimates by writing either of the two. The reason this matters is that we can write the conditional distribution estimate as a weighted mean in the RKHS! Just as the original Random Forest estimates a mean in the real numbers (the conditional expectation of Y given X=x), DRF estimates a mean in the RKHS. Only with the latter, it turns out we also obtain an estimate of the conditional distribution.

The reason this is important for our story is that this weighted mean in the RKHS behaves quite similarly in some regards to a (weighted) mean in d dimensions. That is, we can study its consistency and asymptotic normality using the myriad of tools that are available for averages. This is quite remarkable, as all interesting RKHS will be infinite-dimensional. The first DRF paper already establishes consistency of the estimator in (1) in the RKHS. Our new paper now proves that, in addition,

where sigma_n is a standard deviation that goes to zero and Sigma_x is an operator that takes the place of a covariance matrix (again it all works quite similarly as in d-dimensional Euclidean space).

Obtaining the sampling distribution

Ok so, we have an asymptotic normality result in an infinite-dimensional space, what exactly does that mean? Well first, it means estimators derived from the DRF estimate that are “smooth’’ enough will also tend to be asymptotically normal. But this alone is still not useful, as we also need to have a variance estimate. Here a further result in our paper comes into play.

We leave away a lot of details here, but essentially we can use the following subsample scheme: Instead of just fitting say N trees to build our forest, we build B groups of L trees (such that N=B*L). Now for each group of trees or mini forests, we subsample at random about half of the data points and then fit the forest using only this subsample. Let’s call this subset of samples chosen S. For each drawn S we then get another DRF estimator in the Hilbert space denoted

only using the samples in S. Note that, as in bootstrapping, we now have two sources of randomness, even disregarding the randomness of the forest (in theory we assume B to be so large, as to make the randomness of the forest(s) negligible). One source from the data themselves and another artificial source of randomness, we introduce when choosing S at random. Crucially the randomness from S, given the data, is in our control — we can draw as many subsets S as we want. So the question is, what happens with our estimator in (2) if we only consider the randomness of S and fix the data? Remarkably, we can show that

This just means that if we fix the randomness of the data and only consider the randomness from S, the estimator (2) minus the estimator in (1) will converge in distribution to the same limit as the original estimator minus the truth! This is actually how bootstrap theory works: We have shown that something we can sample from, namely

converges to the same limit as what we cannot access, namely

So to make inference about the latter, we can use the former! This is actually the standard argument people make in bootstrap theory to justify why the bootstrap can be used to approximate the sampling distribution! That’s right, even bootstrap, a technique that people often use in small samples, only really makes sense (theoretically) in a large sample regime.

Let’s use this now.

What does this actually mean?

We now show what this means in practice. In the following, we define a new function derived from the drf function of the CRAN package drf.



Witdrf<- function(DRF, x, groupingvar, alpha=0.05, ...){
  
  ### Function to calculate the conditional witness function with
  ### confidence bands from DRF
  ### DRF: DRF object
  ### x: Testpoint
  
  if (is.null(dim(x)) ){
    
    stop("x needs to have dim(x) > 0")
  }
  
  ntest <- nrow(x)
  n <- nrow(DRF$Y)
  coln<-colnames(DRF$Y.orig)
  
  drfpred<- predict(DRF, newdata=x, estimate.uncertainty = TRUE)
  weightsall<-drfpred$weights[1,]
  weightsb<-drfpred$weights.uncertainty[[1]]
  
  #weightsall0<-weightsall[, DRF$Y[, groupingvar]==0, drop=F]
  #weightsall1<-weightsall[,DRF$Y[, groupingvar]==1, drop=F]
  
  
  # Get the weights of the respective classes (need to standardize by propensity!)
  weightsall0<-weightsall*(DRF$Y.orig[, groupingvar]==0)/sum(weightsall*(DRF$Y.orig[, groupingvar]==0))
  weightsall1<-weightsall*(DRF$Y.orig[, groupingvar]==1)/sum(weightsall*(DRF$Y.orig[, groupingvar]==1))
  
  
  bandwidth_Y <- drf:::medianHeuristic(DRF$Y.orig)
  k_Y <- rbfdot(sigma = bandwidth_Y)
  
  K<-kernelMatrix(k_Y, DRF$Y.orig[,coln[coln!=groupingvar]], y = DRF$Y.orig[,coln[coln!=groupingvar]])
  
  
  nulldist <- apply(weightsb,1, function(wb){
    # iterate over class 1
    
    wb0<-wb*(DRF$Y[, groupingvar]==0)/sum(wb*(DRF$Y[, groupingvar]==0))
    wb1<-wb*(DRF$Y[, groupingvar]==1)/sum(wb*(DRF$Y[, groupingvar]==1))
    
    
    diag( ( wb0-weightsall0 - (wb1-weightsall1) )%*%K%*%( wb0-weightsall0 - (wb1-weightsall1) )  )
    
    
  })
  
  # Choose the right quantile
  c<-quantile(nulldist, 1-alpha)
  
  
  return(list(c=c, k_Y=k_Y, Y=DRF$Y[,coln[coln!=groupingvar]], nulldist=nulldist, weightsall0=weightsall0, weightsall1=weightsall1))
  
  
  
}

So from our method, we not only get the point estimate in form of weights w_i(x), but a sample of B weights, each representing an independent draw from the distribution of the estimator of the conditional distribution (that sounds more confusing than it should be, please keep the examples in mind). This just means we are not only having an estimator, but also an approximation to its distribution!

I now turn to a more interesting example of something we can only do with DRF (as far as I know).

Causal Analysis Example: Witness Function

Let’s assume we have two sets of observations, say group W=1 and group W=0 and we want to find the causal relationship between the group belonging and a variable Y. In the example of this article, the two groups would be male and female and Y would be the hourly wage. In addition, we have confounders X, which we assume affect both W and Y. We assume here that X really includes all relevant confounders. This is a BIG assumption. Formally, we assume unconfoundedness:

and overlap:

Often people then compare the conditional expectation between the two groups:

This is the Conditional Average Treatment Effect (CATE) at x. This is a natural first starting point, but in a recent paper ([1]), the CoDiTE was introduced as a generalization of this idea. Instead of just looking at the difference in expected values the CoDiTE proposes to look at differences in other quantities as well. A particularly interesting example of this idea is the conditional witness function: For both groups, we take as above

So we consider the representation of the two conditional distributions in the RKHS. In addition to being representations of the conditional distributions, these quantities are also real-valued functions: For j=0,1,

The function that gives the difference between those two quantities,

is called the conditional witness function.

Why is this function interesting? It turns out that this function shows how the two densities behave in relation to each other: For values of y for which the function is negative, the conditional density of class 1 at y is smaller than the conditional density of 0. Similarly, if the function is positive at y, it means the density of 1 is higher at y than the conditional density of 0 (whereby “conditional” always refers to conditioning on X=x). Crucially, this can be done without having to estimate the densities, which is hard, especially for multivariate Y.

Finally, we can provide uniform confidence bands for our estimated conditional witness functions, by using the B samples from above. I do not go into details here, but these are essentially the analog to the confidence intervals for the conditional mean we used above. Crucially, these bands should be valid uniformly over the function values y, for one specific x.

We note that we mostly showcase the witness function here as an example. If interest centers on this causal application and the witness function, a more powerful approach is available that uses only one forest in the CausalDRF.

Let’s illustrate this with an example: We simulate the following data-generating process:

That is, X_1, X_2 are independently uniformly distributed on (0,1), W is either 0 or 1, with a probability depending on X_2 and Y is a function of W and X_1. This is a really hard problem; not only does X influence the probability of belonging to class 1 (i.e. the propensity), it also changes the treatment effect of W on Y. In fact, a small calculation shows that the CATE is given as:

(1 - 0.2)*X_1 - (0 - 0.2)*X_1 = X_1.

Graph corresponding to the data-generating process

set.seed(2)

n<-5000
p<-2

X<-matrix(runif(n*p), ncol=p)
W<-rbinom(n,size=1, prob= exp(-X[,2])/(1+exp(-X[,2])))

Y<-(W-0.2)*X[,1] + rnorm(n)
Y<-matrix(Y,ncol=1)

We now randomly choose a test point x and use the following code to estimate the witness function plus confidence band:


x<-matrix(runif(1*p), ncol=2)
Yall<-cbind(Y,W)
## For the current version of the Witdrf function, we need to give
## colnames to Yall
colnames(Yall) <- c("Y", "W")

## Fit the new DRF framework
drf_fit <- drf(X=X, Y=Yall, min.node.size = 5, num.trees=4000, ci.group.size=4000/40)

Witobj<-Witdrf(drf_fit, x=x, groupingvar="W", alpha=0.05)

hatmun<-function(y,Witobj){
  
  c<-Witobj$c
  k_Y<-Witobj$k_Y
  Y<-Witobj$Y
  weightsall1<-Witobj$weightsall1
  weightsall0<-Witobj$weightsall0
  Ky=t(kernelMatrix(k_Y, Y , y = y))
  
  out<-list()
  out$val <- (Ky%*%weightsall1 - Ky%*%weightsall0)
  out$upper<-  out$val+sqrt(c)
  out$lower<-  out$val-sqrt(c)
  
  return( out )
  
  
  
}

all<-hatmun(sort(Witobj$Y),Witobj)

plot(sort(Witobj$Y),all$val , type="l", col="darkblue", lwd=2, ylim=c(min(all$lower), max(all$upper)),
     xlab="y", ylab="witness function")
lines(sort(Witobj$Y),all$upper , type="l", col="darkgreen", lwd=2 )
lines(sort(Witobj$Y),all$lower , type="l", col="darkgreen", lwd=2 )
abline(h=0)

We can read from this plot that:

(1) The conditional density of group 1 is lower than the density of group 0 for values of y between -3 and 0.3. Moreover, this difference gets larger the larger y is until about y = -1, after which point the difference in densities starts to decrease again until the two densities are the same at around 0.3.

(2) Symmetrically, the density of group 1 is higher than the density of group 0 for values of y between 0.3 and 3 and this difference gets larger until it reaches a maximum at about y = 1.5. After this point, the difference decreases until it is almost zero again at y = 3.

(3) The difference between the two densities is statistically significant at the 95% percent level, as can be seen from the fact that for y approximately between -1.5 and -0.5 and between 1 and 2, the asymptotic confidence bands do not include the zero line.

Let’s check (1) and (2) for the simulated true conditional densities. That is, we simulate the truth a great number of times:

# Simulate truth for a large number of samples ntest
ntest<-10000
Xtest<-matrix(runif(ntest*p), ncol=2)

Y1<-(1-0.2)*Xtest[,1] + rnorm(ntest)
Y0<-(0-0.2)*Xtest[,1] + rnorm(ntest)


## Plot the test data without adjustment
plotdf = data.frame(Y=c(Y1,Y0), W=c(rep(1,ntest),rep(0,ntest) ))
plotdf$weight=1
plotdf$plotweight[plotdf$W==0] = plotdf$weight[plotdf$W==0]/sum(plotdf$weight[plotdf$W==0])
plotdf$plotweight[plotdf$W==1] = plotdf$weight[plotdf$W==1]/sum(plotdf$weight[plotdf$W==1])

plotdf$W <- factor(plotdf$W)

#plot pooled data
ggplot(plotdf, aes(Y)) +
  geom_density(adjust=2.5, alpha = 0.3, show.legend=TRUE,  aes(fill=W, weight=plotweight)) +
  theme_light()+
  scale_fill_discrete(name = "Group", labels = c('0', "1"))+
  theme(legend.position = c(0.83, 0.66),
        legend.text=element_text(size=18),
        legend.title=element_text(size=20),
        legend.background = element_rect(fill=alpha('white', 0.5)),
        axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title.x = element_text(size=19),
        axis.title.y = element_text(size=19))+
  labs(x='y')

This leads to:

It is a bit hard to compare visually, but we see that the two densities behave quite close to what the witness function above predicted. In particular, we see that the densities are about the same around 0.3 and the difference in densities appears to be maximal approximately around -1 and 1.5. Thus both points (1) and (2) can be seen in the actual densities!

Moreover, to get (3) into context, a repeated simulation in the paper shows how the estimated witness function tends to look when no effect is visible:

Simulation of a 1000 witness functions in a similar setting as described here. In blue are the 1000 estimated witness functions, while in grey one can see the corresponding confidence bands. Taken from our paper on arXiv. There is no effect in this example, and 99% of CIs do not contain the zero line.

A real data example in Causal Inference is given in this article. The CausalDRF that implements all of this in one forest can be explored here.

Conclusion

In this article, I discussed the new inferential tools available for Distributional Random Forests. I also looked at an important application of these new capabilities; estimating the conditional witness function with uniform confidence bands.

However, I also want to offer a few words of warning:

The results are only valid for a given test point x
The results are only valid asymptotically

The first point is actually not so bad, in simulations, the asymptotic normality often also holds over a range of x. Just be careful with test points that are close to the boundary of your sample! Intuitively, DRF (and all other nearest neighborhood methods) need many sample points around the test point x to estimate the response for x. So if the covariates X in your training set are standard normal, with most points between -2 and 2, then predicting an x in [-1,1] should be no problem. But if your x reaches -2 or 2, performance starts to deteriorate fast.

Random Forests (and nearest neighbourhood methods in general) are not good at predicting for points that only have a few neighbours in the training set, such as points at the boundary of the support of X.

The second point is also quite important. Asymptotic results have fallen somewhat out of fashion in contemporary research, in favor of finite sample results that in turn require assumptions such as “sub-Gaussianity”. Still, asymptotic results provide extremely powerful approximations in complicated settings like these. And in fact, this approximation is pretty accurate for many targets for more than 1000 or 2000 data points (maybe you have 92% coverage instead of 95% for your conditional mean/quantile). However, the witness function we introduced is a complicated object, and thus the more data points you have to estimate the uncertainty bands around it, the better!

Citations

[1] Junhyung Park, Uri Shalit, Bernhard Schölkopf, and Krikamol Muandet. “Conditional distributional treatment effect with kernel conditional mean embeddings and U-statistic regression.” In Proceedings of 38th International Conference on Machine Learning (ICML) , volume 139, pages 8401–8412. PMLR, July 2021.

Inference for Distributional Random Forests was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Deep Dive into HPLBs for A/B testing using Random Forest

Jeffrey Näf — Mon, 12 Dec 2022 15:12:05 GMT

Deep Dive into HPLBs for A/B Testing using Random Forest

An Alternative to p-values in Testing

Photo by Egor Myznik on Unsplash

In a recent article, we introduced the concept of a high probability lower bound (HPLB) of the TV distance, based on our article on arXiv:

https://arxiv.org/abs/2005.06006,

joint work with Loris Michel.

In the current article, we dive into a detailed treatment of this topic and on the way touch upon some very useful (and beautiful) statistical concepts. In particular, we will need to draw balls without replacement from an urn with m balls and n squares. Many thanks to Maybritt Schillinger for a wealth of constructive comments!

Outline

For some time now it has been known that powerful classifiers (such as the Random Forest classifier) can be used in two-sample or A/B testing, as explained here: We observe two independent groups of samples, one coming from a distribution P (e.g., blood pressure before treatment) and one coming from a distribution Q (e.g., blood pressure of an independent group of people, after treatment) and we want to test, H_0: P=Q. Given these two sets of data, we give one a label of 0, and the other a label of 1, train a classifier and then evaluate this classifier on some independent data. Then it seems intuitive that, the better the classifier can differentiate the two groups, the more evidence against the Null there is. This can be made formal, leading to a valid p-value and to a rejection decision when the p-value is smaller than a prespecified alpha.

This is nice because today classifiers are powerful and thus this approach leads to powerful two-sample tests that can potentially detect any difference between P and Q. On the other hand, we all heard about the problems with p-values and classical testing. In particular, a significant p-value does not tell one how different P and Q are ( this is related to effect size in medicine). The picture below illustrates an example where P and Q get progressively more different. In each case, even a strong two-sample test would simply only give a binary rejection decision.

So it would be more interesting if we could somehow meaningfully calculate how different P is from Q, ideally still using a powerful classifier in the process. Here we construct a meaningful method based on an estimate of the TV distance between P and Q.

P (in red) and Q(in blue), get progressively more different. A test, no matter how powerful, just rejects without giving additional information. Source: author.

In the following we assume to observe an i.i.d. sample X_1, …, X_m from P and an independent i.i.d. sample Y_1, …, Y_n from Q. We then use the probability estimate of a classifier (e.g., probability of belonging to class 1) as a “projection” that takes the vectors of data and maps them onto the real line as probability estimates. Building the univariate order statistic with these values and finding a connection to TV(P,Q), we will then be able to construct our lower bound. In the following, we will also sometimes just write lambda for TV(P,Q).

Using a (powerful) Classifier to get a Univariate Problem

Concepts in this section: Drawing circles without replacement from an urn, with m circles and n squares: using the hypergeometric distribution for two-sample testing.

In general, the samples from P, Q are d-dimensional random vectors. Here the classifier comes into play already: As most classifiers can take these m+n sample points and transform them into a sequence of real numbers, the prediction of the probability of the observation i to have label 1. Thus, we can just focus on the real numbers between 0 and 1 to construct our estimator. Of course, what we really lower bound then, is the TV distance of the probability estimates. It is thus important that the classifier is strong, to make sure we do not lose too much information.

So, let’s assume we have a sample of N=m+n real numbers, and we know for each of those whether the original observation comes from P or Q. Then we can build this magical thing called order statistic. That is, we take the N numbers and order them from smallest to largest. To illustrate this, let’s represent samples from P as circles and samples from Q as squares. Then the order statistics may look like this:

Now here is the important point: The classifier tries to estimate the probability of an observation to be in class 1, or from Q, or being a square, as accurately as possible. Thus if the classifier is good, we should expect to see more squares on the right, because the estimated probability should be larger for squares than for circles! Thus, the order statistic is like a centrifuge: If there is a discernible difference between P and Q, and the classifier is able to detect it, the order statistics of the probabilities push the circles to the left and the squares to the right. Since there is still randomness and estimation error, this will not look perfect in general. However, we want it to be ‘’sufficiently different from randomness’’.

One very elegant way to quantify this is the statistic we call V_z, the number of circles below z. This statistic has been used for (univariate) testing for a long time. That is, at any point z=1,…, N, we simply count how many circles we have below z:

What should we expect if P and Q are the same? In this case, we simply have N i.i.d. draws from a single distribution. Thus there should be just a random arrangement of circles and squares in the order statistic, with no pattern. In this case, V_z is actually drawing, with uniform probability, circles without replacement from an urn with m circles and n squares. Thus this connects back to the very basics of probability. Mathematically:

see e.g., this nice article.

Under H_0: P=Q, V_z is hypergeometric: It is the number of times you draw a circle if you draw z times without replacement from an urn with m circles and n squares.

What is cool, is that we are now even able to find a function in z , q(z, alpha), for any alpha, such that when P=Q:

Finding this q(z, alpha) can be done by using asymptotic theory (see e.g. our paper and the references therein) or simply by simulation. The main point is that we know the distribution of V_z and it is always the same, no matter what P and Q exactly are. So even if we don’t have a closed-form distribution for the maximum, we can still approximate q(z,alpha) quite readily. This can be directly used for a (univariate) two-sample test! If max_z V_z-q(z,alpha) overshoots zero, we can reject that P=Q.

Ok so this is quite nice, but the whole point of this article is that we want to get away from simple rejection decisions, and instead get a lower bound for the Total Variation Distance. Unfortunately, under a general alternative where P and Q are different (i.e., TV(P,Q) > 0), the distribution of V_z is no longer known! The goal will now be to find another process B_z that is easier to analyze and such that for all z=1,…,N, B_z ≥ V_z. If we can bound this process correctly, then the bound also holds for V_z.

Playing around with TV(P,Q)

Concepts in this section: Using the sampling interpretation of TV to introduce the concept of Distributional Witnesses and using this to identify an area where P=Q holds, even if P is not equal to Q in general.

By finding q(z,alpha) in the last section, we essentially found a first step of the construction of a lower bound for the case when TV(P,Q)=0, i.e. if there is no difference between P and Q. We now extend this by connecting with our first article and playing around with the definition of the TV distance between P and Q. Generally, this is given as

Thus we look for the set A, out of all possible sets, such that the difference between P(A) and Q(A) is largest. Let’s make this more specific: Let p and q be the densities of P and Q (as a technicality, the data does not need to be continuous in the usual sense, we can always do that in this case). Then the maximal A is given as

so that

Let in the following X have distribution P and Y distribution Q. Now here comes the crucial part: We can use this to define a new density

Then this is a valid density (integrates to 1) and we can define similarly a density q_+. What this means is best seen graphically:

Illustration of the TV concept. Left: the two original densities p and q, with p_+ in red, h in blue and q_+ in green all unstandardized. Right: Densities p_+ in red, h in blue and q_+ in green. Source:author

The picture shows that the densities p, q can be split up into the densities p_+, q_+, and some middle part, that corresponds to the minimum value of both densities and integrates exactly to 1 if we standardize it with 1-TV(P,Q):

Instead of seeing X as a draw from P and Y as a draw from Q directly, we can now see X as drawn from the mixture

What this means is the following; before we draw X, we flip a coin wherewith probability TV(P,Q), we draw X from the red density p_+ and with probability (1-TV(P,Q)) we instead draw it from h. For the distribution of X it doesn’t matter how we look at it, in the end, X will have distribution P. But clearly, it appears interesting whether X actually came from p_+ or from h, because the former corresponds to the ‘’unique’’ part of p. Indeed looking at either the graphic or the densities themselves, we see that p_+ and q_+ are disjoint. So X either comes from p_+ or from h and similarly, Y is either drawn from q_+ or from h. Crucially, if both Y and X are drawn from the density h, they obviously come from the same distribution and there is no way to differentiate them, it is as if we were under the Null.

So, for the i.i.d. observations X_1, …, X_m and Y_1,…,Y_n, each observation is either drawn from the specific part (p_+ or q_+) or from the joint part h. We call observations that are drawn from the specific part p_+(q_+) witnesses for P (Q).

Observations drawn from the specific part p_+ are called witnesses (for P). Observations drawn from h cannot be differentiated, so this corresponds to the part where P and Q are the same.

Ok so if we go back to the order statistics, we can now think of it like this:

Circles with blue crosses correspond to witnesses from P, while squares with blue crosses to witnesses from Q. Basically from the crossed-out observations we can learn something about the difference of P, Q, while the observations without crosses are basically from the Null. In a sense all of this is just a thought experiment — we have no way of knowing whether X_i is drawn from p_+ or h. So we don’t know which points are witnesses, in fact, we don’t even know how many there are. Nonetheless, this thought experiment will be helpful to construct our estimator.

In the following we will propose a candidate for TV(P,Q), say lambda_c and then check whether this candidate fits a condition. If it does we choose a new candidate lambda_c that is higher than the old one and check the condition again. We do that until lambda_c violates the condition.

Let’s do some cleaning

Concepts in this section: Bounding the number of witnesses with high probability and the use an intuitive “cleaning operation” to get a better behaved process B_z that is always larger or equal V_z.

We now want to use this idea that some points are witnesses and others come from the part where P=Q for the statistics V_z. As mentioned above, we don’t really know which points are witnesses! That would be ok, what we need is actually just the number of witnesses, though we don’t even know that. However, we can find an upper bound for this number.

Recall that we assume X_1,…, X_m were sampled by drawing each with probability TV(P,Q) from p_+ and with probability 1-TV(P,Q) from h. So the number of witnesses in m observations, denoted w_p, actually follows a Binomial distribution with success probability TV(P,Q). So if we have a candidate lambda_c, which we suspect should be the true TV distance, the number of witnesses should follow the distribution

This is still random, and thus we don’t know the exact outcome for a given sample. But since we know the distribution, we can find a higher quantile W_p, such that w_p is overshooting this quantile with a probability less than alpha/3. For instance, for m=10 and lambda_c=0.1, this can be found as

lambda_c<-0.1
m<-10
alpha<-0.05

W_p<-qbinom(p=alpha/3, size=m, prob=lambda_c, lower.tail=F)


# test
w_p<-rbinom(n=3000, size=m, prob=lambda_c)
hist(wp,main="")
abline(v=W_p, col="darkred")

We can do the same for the witnesses W_q.

So we have a candidate lambda_c and, based on this candidate, two values W_p and W_q that bound w_p and w_q with high probability. In particular, W_p and W_q depend directly on lambda_c, so it would actually be better to write W_p(lambda_c) and W_q(lambda_c), but that would blow up the notation too much.

To obtain the new process B_z, we first make up new witnesses in the order statistics above:

The red squares now denote random points that we designated to be witnesses. This is done so that the number of witnesses matches the upper bounds W_p and W_q. This is ok in our context, as it actually does not change V_z.

Now we perform our cleaning operation. This will get us from V_z to B_z in a way that guarantees B_z is at least as large as V_z. We go through the order statistics from left to right and from right to left. First, from left to right, every time we see a circle without a cross, we randomly choose a witness from P (circle with a cross) on the right and put it before the empty circle. We do so without changing the order of the squares and circles without crosses, like so:

The first circle was already a witness, so we left it as is. The second circle was a nonwitness, so we randomly moved a witness circle from further up to where it was before. The next thing was a square which was no a witness, so we moved a circle witness from the right before it and so on. The whole idea is simply to move all witnesses from P to the left and all witnesses of Q to the right, without changing the order of the non-witnesses within themselves:

Now B_z is just counting the number of observations below z=1,…,N that belong to P in this new ordering! Note that for the first set of observations B_z just increases linearly by 1. Then there is a middle part in which B_z behaves like a hypergeometric process:

Finally, the last few observations are only squares, so the value of B_z just reaches m and stays there.

## Using the function defineB_z below

## Define n + m and confidence alpha
n<-50
m<-100
alpha<-0.05

# Define the candidate
lambda_c <- 0.4


plot(1:(m+n),defineB_z(m,n,alpha,lambda_c), type="l", cex=0.5, col="darkblue")

for (b in (1:100)){
  
  lines(defineB_z(m,n,alpha,lambda_c), col="darkblue")
  
}

The key is that we moved all the circles more to the left than they were before! This is why B_z is always larger (or the same) than V_z. In particular, for lambda=0, we expect W_p=0 and thus B_z=V_z.

library(stats)

defineB_z <- function(m,n,alpha,lambda_c){

## Upper bound witesses for the given m, n, alpha and lamba_c

W_p<-qbinom(p=alpha/3, size=m, prob=lambda_c, lower.tail=F)
W_q<-qbinom(p=alpha/3, size=n, prob=lambda_c, lower.tail=F)

B_z<-matrix(0, nrow=n+m)

# First part: B_z=z
B_z[1:W_p,] <- 1:W_p

# Last part: B_z=m
B_z[(m+n-W_q):(m+n),] <- m

# Middle part: Hypergeometric
for (z in (W_p+1):(m+n-W_q-1) ){
  
  B_z[z,]<-rhyper(1, m-W_p, n-W_q, z-W_p)+W_p
  
}

return(B_z)
}

Putting it all together

Concepts in this section: Using the above to get a bound on B_z, leading to a bound on V_z and a subsequent HPLB lambdahat we can use. It is defined through an infimum, and to find it, we need to cycle through several candidates.

Next, given the true lambda=TV(P,Q), we want to find a Q(z,alpha, lambda) that has

This will be used to define the final estimator in a second. Crucially, Q(z,alpha, lambda_c) needs to be defined for any lambda_c, but the probability statement only needs to be true at the true candidate lambda_c=lambda. So this is the “candidate” I focus on for the moment.

From above we know that for the first W_p values, B_z is just linearly increasing. So B_z=z and we can also set Q(z,alpha, lambda)=z, for z=1,…,W_p. Similarly on the other side, when all m circles are counted, we know B_z=m and thus we can set Q(z,alpha, lambda)=m for all z=m+n-W_q, …m +n. (Remember that in each case lambda=TV(P,Q) enters through W_p and W_q.)

What remains is the part in the middle that behaves as if under the Null. This is true for z=W_p+1,…, m+n-W_q. But since here B_z is again hypergeometric, we can use the same q function as above to get

The alpha/3 we need, since we can potentially make mistakes with W_p and W_q, i.e. there is an alpha/3 chance we do not overestimate.

So we have all cases considered! For z=1,…,W_p, B_z-Q(z,alpha,lambda)=0, the same holds true for the last W_q z’s. The middle part finally is covered by the above equation. But since B_z is larger than V_z, we also have

Using this Q function, we can define our final estimator as

This looks horrible, but all it means is that starting from lambda_c=0, you (1) calculate W_p(lambda_c), W_q(lambda_c), and thus Q(z,alpha,lambda_c), and (2) check whether

is true. If it is, you can increase lambda_c a little bit and repeat steps (1) and (2). If it is not true, you stop and set the estimator as lambda_c.

Mathematically, why does this inf definition of the estimator work? It just means that lambdahat is the smallest lambda_c such that

is true. So if the true lambda (=TV(P,Q)) is smaller than that smallest value (our estimator), this condition cannot hold, and instead, the > 0 condition above is true. But we have just seen above that this >0 condition has a probability of occurring ≤ alpha, so we are fine.

All of this can be found implemented in the HPLB package on CRAN. The next section presents two examples.

Some Examples

Here, we use the estimator derived in the last section on two examples. In the first example, we use a Random Forest-induced estimate of the probability of belonging to class 1, as discussed above. In the second example, we actually use a regression function, showing that one can generalize the concepts discussed here.

In the first article, we already studied the following example

library(mvtnorm)
library(HPLB)set.seed(1)
n<-2000
p<-2#Larger delta -> more difference between P and Q
#Smaller delta -> Less difference between P and Q
delta<-0# Simulate X~P and Y~Q for given delta
U<-runif(n)
X<-rmvnorm(n=n, sig=diag(p))
Y<- (U <=delta)*rmvnorm(n=n, mean=rep(2,p), sig=diag(p))+ (1-(U <=delta))*rmvnorm(n=n, sig=diag(p))plot(Y, cex=0.8, col="darkblue")
points(X, cex=0.8, col="red")

In the above simulation, the delta parameter determines how different P and Q are, from delta=0, whereby P=Q, to delta=1, whereby P is a bivariate normal with mean (0,0) and Q is a bivariate normal with mean (2,2). Even a strong two-sample test would simply reject in all of these cases. With our method, using Random Forest probability estimates, we get

#Estimate HPLB for each case (vary delta and rerun the code)
t.train<- c(rep(0,n/2), rep(1,n/2) )
xy.train <-rbind(X[1:(n/2),], Y[1:(n/2),])
t.test<- c(rep(0,n/2), rep(1,n/2) )
xy.test <-rbind(X[(n/2+1):n,], Y[(n/2+1):n,])
rf <- ranger::ranger(t~., data.frame(t=t.train,x=xy.train))
rho <- predict(rf, data.frame(t=t.test,x=xy.test))$predictions
tvhat <- HPLB(t = t.test, rho = rho, estimator.type = "adapt")
tvhat

as we would have hoped: The lower bound is zero when the distributions are the same (i.e., the implicit test cannot reject) and progressively increases as P and Q get more different (i.e., delta increases).

We can also look at a more general example. Suppose we observe a (more or less) independent sample, that has however a mean shift in the middle:

Let us consider the index of observations as time t in this example and from left to right we move from time t=1 to t=1000. The code follows below. We can now check at each point of interest (for instance at each sample point) how large the TV distance is between P=points on the left and Q=points on the right. This can be done by again using a probability estimate for each t, but to speed things up, we instead use a regression of time t on the observations z_t. That is we check whether the observation value gives us an indication of whether it lies more on the left or right.

The following picture shows the true TV in red and our HPLB in black:

It can be seen that the method nicely detects the increase in TV distance when we move from left to right in the graph. It then peaks at the point where the change in distribution happens, indicating that the main change happens there. Of course, as can be seen in the code below, I cheated a bit here; I generated the whole process two times, once for training once for testing. In general in this example, one has to be a bit more careful about how to choose training and test set.

Importantly, at each point where we calculate the TV distance, we implicitly calculate a two-sample test.

library(HPLB)
library(ranger)
library(distrEx)


n <- 500
mean.shift <- 2
t.train <- runif(n, 0 ,1)
x.train <- ifelse(t.train>0.5, stats::rnorm(n, mean.shift), stats::rnorm(n))
rf <- ranger::ranger(t~x, data.frame(t=t.train,x=x.train))

n <- 500
t.test <- runif(n, 0 ,1)
x.test <- ifelse(t.test>0.5, stats::rnorm(n, mean.shift), stats::rnorm(n))
rho <- predict(rf, data.frame(t=t.test,x=x.test))$predictions

## out-of-sample
tv.oos <- HPLB(t = t.test, rho = rho, s = seq(0.1,0.9,0.1), estimator.type = "adapt")


## total variation values
tv <- c()
for (s in seq(0.1,0.9,0.1)) {
  
  if (s<=0.5) {
    
    D.left <- Norm(0,1)
  } else {
    
    D.left <- UnivarMixingDistribution(Dlist = list(Norm(0,1),Norm(mean.shift,1)),
                                       mixCoeff = c(ifelse(s<=0.5, 1, 0.5/s), ifelse(s<=0.5, 0, (s-0.5)/s)))
  }
  if (s < 0.5) {
    
    D.right <- UnivarMixingDistribution(Dlist = list(Norm(0,1),Norm(mean.shift,1)),
                                        mixCoeff = c(ifelse(s<=0.5, (0.5-s)/(1-s), 0), ifelse(s<=0.5, (0.5/(1-s)), 1)))
  } else {
    
    D.right <- Norm(mean.shift,1)
  }
  tv <- c(tv, TotalVarDist(e1 = D.left, e2 = D.right))
}

## plot
oldpar <- par(no.readonly =TRUE)
par(mfrow=c(2,1))
plot(t.test,x.test,pch=19,xlab="t",ylab="x")
plot(seq(0.1,0.9,0.1), tv.oos$tvhat,type="l",ylim=c(0,1),xlab="t", ylab="TV")
lines(seq(0.1,0.9,0.1), tv, col="red",type="l")
par(oldpar)

Conclusion

In this article, we took a deep dive into the construction of an HPLB for the TV distance. Of course, what we really lower-bounded was the TV distance on the probability estimates. In fact, the challenge is to find a “projection” or classifier that is powerful enough to still find a signal. Algorithms like Random Forest, as we used here, are examples of such powerful methods, that moreover don’t really require any tuning.

We hope that with the code provided here and on CRAN in the HPLB package, these basic probability considerations we did here might actually be used on some real-world problems.

Deep Dive into HPLBs for A/B testing using Random Forest was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

R-NL: Robust Nonlinear Shrinkage

Jeffrey Näf — Wed, 02 Nov 2022 17:39:01 GMT

High-dimensional covariance estimation when tails are heavy

In this article, I discuss a new covariance estimation method from our recent paper “R-NL: Covariance Matrix Estimation for Elliptical Distributions based on Nonlinear Shrinkage’’. I introduce the problem we are solving, try to give some intuition on how we are solving it, and briefly present the simple code we developed. On the way, I touch upon some interesting concepts, like the (robust) “Tyler’s estimator”, which I feel are somewhat underused in modern data science, probably because no author apparently ever provided code with their paper (or maybe it is because most of the papers regarding this topic appear to appear in the signal processing community, which is another potential reason they get overlooked by data scientists).

Nonlinear Shrinkage

Nonlinear shrinkage is a powerful tool in high-dimensional covariance estimation. In the new paper, I and my coauthors introduce an adapted version that tends to produce even better results in heavy-tailed models while keeping the strong results in other cases. To showcase this new approach and the problem it solves, let’s start with an R example. We first load the necessary functions and define the dimensions p and numbers of examples n, as well as the true covariance matrix we want to study.

# For simulating the data:
library(mvtnorm)

# The NL/QIS method available on https://github.com/MikeWolf007/covShrinkage/blob/main/qis.R
source("qis.R")

# Set the seed, n and p
set.seed(1)

# p quite high relativ to n
n<-300
p<-200

#Construct the dispersion matrix
Sig<-sapply(1:p, function(i) {sapply(1:p, function(j) 0.7^{abs(i-j)} )} )

The covariance matrix we defined here corresponds to an AR process. That is, while the observations are independent, the dimensions X_i and X_j are less related the bigger the difference between i and j are in absolute values. This means correlations get exponentially smaller, the farther away one is from the diagonal and there is actually some structure to be learned here.

First, let’s simulate an i.i.d. sample of random vectors from a Gaussian distribution with the correlation structure defined before:

### Multivariate Gaussian case
X<-rmvnorm(n = n, sigma = Sig)

In the nonlinear shrinkage article, I explained what the optimal estimator is if one is only allowed to modify the eigenvalues of the sample covariance matrix, but has to leave the eigenvectors intact (which is what a lot of shrinkage methods do): Let thus u_j, j=1,..p, be the eigenvectors of the sample covariance matrix and

the matrix of eigenvectors. Then the optimal values are given by

resulting in the optimal estimator

Note that the optimal values in (1) are not exactly equal to the true eigenvalues in general, but instead linear combinations of the true eigenvalues, according to the (fixed) sample eigenvectors.

So let’s calculate the sample covariance matrix and the nonlinear shrinkage matrix and check how close we are to this ideal case:

## Sample Covariance Matrix
samplespectral<-eigen(cov(X))

## Nonlinear Shrinkage
Cov_NL<-qis(X)
NLvals<-sort( diag( t(samplespectral$vectors)%*%Cov_NL%*%samplespectral$vectors  ), decreasing=T)

## Optimal: u_j'*Sig*u_j for all j=1,...,p
optimalvals<-sort(diag( t(samplespectral$vectors)%*%Sig%*%samplespectral$vectors  ), decreasing=T)

plot(sort(samplespectral$values, decreasing=T), type="l", cex=1.5, lwd=2, lty=2, ylab="Eigenvalues",)
lines(optimalvals, type="l", col="red", cex=1.5, lwd=2, lty=1)
lines(NLvals, type="l", col="green", cex=1.5, lwd=2, lty=3)
title(main="Multivariate Gaussian")

legend(200, 8, legend=c("Sample Eigenvalues", "Attainable Truth", "NL"),col=c("black", "red", "green"), lwd=2, lty=c(2,1,3), cex=1.5)

which gives the plot:

Source: Author

The plot shows the optimal value on the diagonal of (1), as well as the sample and nonlinear shrinkage estimates. It looks like one would hope, the sample eigenvalues show excess dispersion (too large for large values, too small for small ones), while nonlinear shrinkage is pretty close to the ideal values. This is what we would expect simply because the dimension p=200 is quite high compared to the sample size n=300. The larger p is chosen relative to n the worse this would look for the sample covariance matrix.

Now we do the same, but simulating from a multivariate t-distribution with 4 degrees of freedom, a (very) heavy-tailed distribution:

### Multivariate t case
X<-rmvt(n=n, sigma=Sig, df=4)

## Truth
Sig <-4/(4-2)*Sig  ## Need to rescale with a t distribution

## Sample Covariance Matrix
samplespectral<-eigen(cov(X))

## Nonlinear Shrinkage
Cov_NL<-QIS(X)$Sig
NLvals<-sort( diag( t(samplespectral$vectors)%*%Cov_NL%*%samplespectral$vectors  ), decreasing=T)

## Optimal: u_j'*Sig*u_j for all j=1,...,p
optimalvals<-sort(diag( t(samplespectral$vectors)%*%Sig%*%samplespectral$vectors  ), decreasing=T)

plot(sort(samplespectral$values, decreasing=T), type="l", cex=15, lwd=2, lty=2, ylab="Eigenvalues",)
lines(optimalvals, type="l", col="red", cex=1.5, lwd=2, lty=1)
lines(NLvals, type="l", col="green", cex=1.5, lwd=2, lty=3)
title(main="Multivariate t")

legend(200, 40, legend=c("Sample Eigenvalues", "Attainable Truth", "NL"),col=c("black", "red", "green"), lty=c(2,1,3), cex=1.5, lwd=2)

Source:author

Now, this doesn’t look as good anymore! The nonlinear shrinkage values in green also show some excess dispersion: Large values get way too large, while small values are a bit too small. Apparently, in finite samples, heavy tails can distort nonlinear shrinkage. It would be nice to have a method that displays the same amazing results in the Gaussian case but also keeps good results in heavy-tailed models. This is the motivation behind our new method. I now go into some details, beginning with the key to the approach: elliptical distributions.

Elliptical distributions

The class of elliptical distributions includes a reasonable range of different distributions such as multivariate Gaussians, multivariate t, multivariate generalized hyperbolic, and so on. If a random vector X follows an elliptical distribution, it can be written as

where S is uniformly distributed on the unit sphere in p dimensions and R is some nonnegative random variable independent of S. This may sound complicated, but it just means that an elliptical distribution can be reduced to a uniform distribution on a circle (in two dimensions) or on a sphere (in general). Thus these kinds of distributions have a very specific structure. In particular, we need to mention an important point: In the equation above, H is called the dispersion matrix, as opposed to the covariance matrix. In this article, we want to estimate the covariance matrix, which, if it exists, is given as

This is already interesting: In elliptical models, H exists by assumption but the covariance matrix might not if the expected value of R is not finite! For instance, the multivariate t distribution with degrees of freedom smaller than 2 has no covariance matrix. Nonetheless, we could still estimate H in this case. So the dispersion matrix is in a sense a more general concept. In this article, however, we will assume the covariance matrix exists, and in this case, we see from the above that dispersion and covariance matrix are the same up to a constant.

Interestingly, one can then look at Z=X/||X||, that is the random vector X divided by its Euclidean norm and this thing will always have the same distribution! In fact, this is just the uniform distribution on the p-dimensional sphere. We can see this in an example for p=2.

X<-rmvnorm(n = n, sigma = diag(2))
Y<-rmvt(n = n, sigma = diag(2), df=4)

# standardize by norm
ZGaussian<-t(apply(X,1, function(x) x/sqrt(sum(x^2)) ))
Zt <- t(apply(Y,1, function(x) x/sqrt(sum(x^2)) ))

par(mfrow=c(1,2))
plot(ZGaussian)
plot(Zt)

which gives

Source:author

(Linear-Shrinkage) Tyler’s Estimator

Tyler’s estimator of the dispersion matrix H, derived in [1], uses the fact that Z=X/||X|| always has the same distribution. Using the likelihood of that distribution, one can derive a maximum likelihood estimator (basically just taking derivatives and setting to zero) that looks like this:

Note that this only defines H implicitly (it is both on the left and on the right), so the natural way to try to get to H is iterative:

where the second step is just a renormalization that is needed for technical reasons. One can show that indeed, this simple iterative scheme will converge to the solution in (2). This estimator of H is what is referred to as ‘’Tyler’s estimator’’.

Tyler’s estimator is an iterative estimate of the dispersion matrix of an i.i.d. sample from an elliptical distribution. It is derived using the fact that an elliptical random vector standardized by its Eucledian norm always has the same distribution.

Ok, so this is a way of robustifying the covariance or dispersion estimator against heavy tails. But the above Tyler’s estimator only works for p < n, and deteriorates when p gets closer to n, so we still need to robustify against the case when p is close or even larger than n. A lot of papers in the signal processing community simply do this by using linear shrinkage (which I also explained here) in each iteration. This then looks like this:

where different choices of rho have inspired different papers. For instance, one of the papers that started this line of research is [2]. We now instead use nonlinear shrinkage together with Tyler’s method.

One can robustify Tyler’s estimator also for high dimensions, by using linear shrinkage in each iteration. Different papers have found smart adaptive ways to calculate the free parameter rho.

Robust Nonlinear Shrinkage

The goal would now be to use the same iterative scheme as above, but instead of using linear shrinkage, iterate with nonlinear shrinkage. Unfortunately, this is not so straightforward. I won’t go into detail in this article, since some trickery is needed to make it work, but instead, I refer to our implementation on Github and the paper. The implementation we provide also may be quite handy, because none of the above-mentioned papers of the signal processing community appear to give out code or implement their method in a package.

To showcase the performance and the code, we repeat the multivariate t analysis at the beginning with this new method. We restate the whole procedure above for completeness:

### Multivariate t case
X<-rmvt(n=n, sigma=Sig, df=4)

## Truth
Sig <-4/(4-2)*Sig  ## Need to rescale with a t distribution

## Sample Covariance Matrix
samplespectral<-eigen(cov(X))

## Nonlinear Shrinkage
Cov_NL<-QIS(X)$Sig
NLvals<-sort( diag( t(samplespectral$vectors)%*%Cov_NL%*%samplespectral$vectors  ), decreasing=T)

## R-NL code from https://github.com/hedigers/RNL_Code
Cov_RNL<-RNL(X)
RNLvals<-sort( diag( t(samplespectral$vectors)%*%Cov_RNL%*%samplespectral$vectors  ), decreasing=T)

## Optimal: u_j'*Sig*u_j for all j=1,...,p
optimalvals<-sort(diag( t(samplespectral$vectors)%*%Sig%*%samplespectral$vectors  ), decreasing=T)

plot(sort(samplespectral$values, decreasing=T), type="l", cex=1.5, lwd=2, lty=2, ylab="Eigenvalues",)
lines(optimalvals, type="l", col="red", cex=1.5, lwd=2, lty=1)
lines(NLvals, type="l", col="green", cex=1.5, lwd=2, lty=3)
lines( RNLvals, type="l", col="darkblue", cex=1.5, lwd=2, lty=3)
title(main="Multivariate t")

legend(200, 40, legend=c("Sample Eigenvalues", "Attainable Truth", "NL", "R-NL"),
       col=c("black", "red", "green", "darkblue"), lty=c(2,1,3,4), cex=1.5, lwd=2)

Source:author

R-NL (the blue line) is almost perfectly on the red line and thus mirrors the good performance we saw for NL in the Gaussian case! This is exactly what we wanted. Also, though it might be obscure in the code above, using the function is super easy, RNL(X) gives the estimator of the covariance matrix (if you can assume it exists) and RNL(X, cov=F) gives an estimate of H.

Finally, if we were to use the RNL function in the beginning for the Gaussian example, the values would look almost perfectly the same as they do with nonlinear shrinkage. In fact, the figure below shows simulation results from the paper using our two methods R-NL and a slight adaptation, R-C-NL, and a range of competitors. The setting is almost exactly as in the code above, with the same dispersion matrix and n=300, p=200. The difference is that we now vary the tail-determining parameter of the multivariate t distribution, from 3 (extremely heavy-tailed) to infinity (Gaussian case) on a grid. We won't go into details about what the numbers on the y-axis exactly mean, just that larger is better and 100 is the maximal value. The competitors “R-LS” and “R-GMV-LS” are two linear shrinkage Tyler’s estimators, as mentioned above, while “NL” is nonlinear shrinkage. It can be seen that we are (much) better than the latter for heavy tails and then converge to the same values, once we approach the Gaussian tail behavior.

Simulation results from the paper on arXiv.

Conclusion

This article discussed the paper “R-NL: Fast and Robust Covariance Estimation for Elliptical Distributions in High Dimensions’’. The paper on arXiv contains a wide range of simulation settings, indicating that the R-NL and R-C-NL estimators do exceedingly well in a wide range of situations.

Thus I hope that the estimator(s) can be successfully used in a lot of real applications as well!

References

[1] Tyler, D. E. (1987a). A distribution-free M-estimator of multivariate scatter. Annals of Statistics, 15(1):234–251.

[2] Chen, Y., Wiesel, A., and Hero, A. O. (2011). Robust shrinkage estimation of high dimensional covariance matrices. IEEE Transactions on Signal Processing, 59(9):4097– 4107

R-NL: Robust Nonlinear Shrinkage was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.