<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Jeffrey Näf on Medium]]></title>
        <description><![CDATA[Stories by Jeffrey Näf on Medium]]></description>
        <link>https://medium.com/@jeffrey_85949?source=rss-ca780798011a------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*fTRnnMS3RSMCfs2m5zRcDQ.jpeg</url>
            <title>Stories by Jeffrey Näf on Medium</title>
            <link>https://medium.com/@jeffrey_85949?source=rss-ca780798011a------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sat, 16 May 2026 02:00:00 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@jeffrey_85949/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[What Is a Good Imputation for Missing Values?]]></title>
            <link>https://medium.com/data-science/what-is-a-good-imputation-for-missing-values-e9256d45851b?source=rss-ca780798011a------2</link>
            <guid isPermaLink="false">https://medium.com/p/e9256d45851b</guid>
            <category><![CDATA[deep-dives]]></category>
            <category><![CDATA[statistics]]></category>
            <category><![CDATA[data-imputation]]></category>
            <category><![CDATA[missing-data]]></category>
            <category><![CDATA[missing-values]]></category>
            <dc:creator><![CDATA[Jeffrey Näf]]></dc:creator>
            <pubDate>Sat, 08 Jun 2024 03:11:56 GMT</pubDate>
            <atom:updated>2024-06-08T08:05:19.853Z</atom:updated>
            <content:encoded><![CDATA[<h4>My current take on what imputation should be</h4><p>This article is a first article summarizing and discussing my most recent <a href="https://hal.science/hal-04521894">paper</a>. We study general-purpose imputation of tabular datasets. That is, the imputation should be done in a way that works for many different tasks in a second step (sometimes referred to as “broad imputation”).</p><p>In this article, I will write 3 lessons that I learned working on this problem over the last years. I am very excited about this paper in particular, but also cautious, as the problem of missing values has many aspects and it can be difficult to not miss something. So I invite you to judge for yourself if my lessons make sense to you.</p><p>If you do not want to get into great discussions about missing values, I will summarize my recommendations at the end of the article.</p><p><strong><em>Disclaimer:</em></strong><em> The goal of this article is to use imputation to recreate the original data distribution. While I feel this is what most researchers and practitioners actually want, this is a difficult goal that might not be necessary in all applications. For instance, when performing (conditional mean) prediction, there are several recent papers showing that even simple imputation methods are sufficient for large sample sizes.</em></p><p>All images in this article were created by the author.</p><h4>Preliminaries</h4><p>Before continuing we need to discuss how I think about missing values in this article.</p><p>We assume there is an underlying distribution <em>P*</em> from which observations <em>X*</em> are drawn. In addition, there is a vector of 0/1s of the same dimension as <em>X*</em> that is drawn, let’s call this vector <em>M</em>. The actual observed data vector <em>X</em> is then <em>X*</em> masked by <em>M</em>. Thus, we observe <em>n</em> independently and identically distributed (i.i.d.) copies of the joint vector <em>(X,M)</em>. If we write this up in a data matrix, this might look like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*fCF9TEO2vf764syUGUXPZw.png" /><figcaption>The data generating process: X* and M are drawn, then we observe n i.id. copies of (X,M), where X is X* but masked by M.</figcaption></figure><p>As usual small values <em>x, m</em> means “observed”, while large values refer to random quantities. The missingness mechanisms everyone talks about are then assumptions about the relationship or joint distribution of <em>(X*,M):</em></p><p><strong>Missing Completely at Random (MCAR): </strong>The probability of a value being missing is a coin flip, independent of any variable in the dataset. Here missing values are but a nuisance. You could ignore them and just focus on the fully observed part of your dataset and there would be no bias. In math for all <em>m </em>and <em>x</em>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/535/1*QOb9DZ-N7zgtvFdcMOKA7Q.png" /></figure><p><strong>Missing at Random (MAR):</strong> The probability of missingness can now depend on the <em>observed</em> variables in your dataset. A typical example would be two variables, say income and age, whereby age is always observed, but income might be missing for certain values of age. This is the example we study below. This may sound reasonable, but here it can get complicated. In math, for all <em>m</em> and <em>x</em>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/688/1*-lr0TKZuekzdT7CbHUnvrA.png" /></figure><p><strong>Missing Not at Random (MNAR):</strong> Everything is possible here, and we cannot say anything about anything in general.</p><p>The key is that for imputation, we need to learn the conditional distribution of missing values given observed values in one pattern <em>m’</em> to impute in another pattern <em>m</em>.</p><p>A well-known method of achieving this is the Multiple Imputation by Chained Equations (<strong>MICE</strong>) method: Initially fill the values with a simple imputation, such as mean imputation. Then for each iteration <em>t</em>, for each variable <em>j</em> regress the observed <em>X_j </em>on all other variables (which are imputed). Then plug in the values of these variables into the learned imputer for all <em>X_j </em>that are not observed. This is explained in detail in <a href="https://medium.com/@ofirdi/mice-is-nice-but-why-should-you-care-e66698f245a3">this article</a>, with an amazing illustration that will make things immediately clear. In R this is conveniently implemented in the <a href="https://cran.r-project.org/web/packages/mice/index.html">mice R package</a>. As I will outline below, I am a huge fan of this method, based on the performance I have seen. In fact, the ability to recreate the underlying distribution of certain instances of MICE, such as mice-cart, is uncanny. In this article, we focus on a very simple example with only one variable missing, and so we can code by hand what MICE would usually do iteratively, to better illustrate what is happening.</p><blockquote>A first mini-lesson is that MICE is a host of methods; whatever method you choose to regress <em>X_j </em>on the other variables gives you a different imputation method. As such, there are countless variants in the mice R package, such as mice-cart, mice-rf, mice-pmm, mice-norm.nob, mice-norm.predict and so on. These methods will perform widely differently as we will see below. Despite this, at least some papers (in top conferences such as NeurIPS) confidently proclaim that they compare their methods to “MICE”, without any detail on what exactly they are using.</blockquote><h4>The Example</h4><p>We will look at a very simple but illustrative example: Consider a data set with two jointly normal variables, <em>X_1, X_2</em>. We assume both variables have variance of 1 and a positive correlation of 0.5. To give some context, we can imagine <em>X_1 </em>to be (the logarithm of) income and <em>X_2 </em>to be age. (This is just for illustration, obviously no one is between -3 and 3 years old). Moreover, assume a missing mechanism for the income <em>X_1</em>, whereby <em>X_1 </em>tends to be missing whenever age is “high”. That is we set:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/846/1*ouF4f8_2k93dVqL94Ce-9w.png" /></figure><p>So <em>X_1 </em>(income) is missing with probability 0.8 whenever <em>X_2 </em>(age) is “large” (i.e., larger zero). As we assume <em>X_2 </em>is always observed, this is a textbook MAR example with two patterns, one where all variables are fully observed (<em>m1</em>) and a second (<em>m2</em>), wherein <em>X_1 </em>is missing. Despite the simplicity of this example, if we assume that higher age is related to higher income, there is a<em> clear shift in the distribution of income and age when moving from one pattern to the other</em>. In pattern <em>m2</em>, where income is missing, values of both the observed age and the (unobserved) income tend to be higher. Let’s look at this in code:</p><pre>library(MASS)<br>library(mice)<br><br><br><br>set.seed(10)<br>n&lt;-3000<br><br>Xstar &lt;- mvrnorm(n=n, mu=c(0,0), Sigma=matrix( c(1,0.7,0.7,1), nrow=2, byrow=T   ))<br><br>colnames(Xstar) &lt;- paste0(&quot;X&quot;,1:2)<br><br><br><br>## Introduce missing mechanisms<br>M&lt;-matrix(0, ncol=ncol(Xstar), nrow=nrow(Xstar))<br>M[Xstar[,2] &gt; 0, 1]&lt;- sample(c(0,1), size=sum(Xstar[,2] &gt; 0), replace=T, prob = c(1-0.8,0.8) )<br><br><br>## This gives rise to the observed dataset by masking X^* with M:<br>X&lt;-Xstar<br>X[M==1] &lt;- NA<br><br><br>## Plot the distribution shift<br>par(mfrow=c(2,1))<br>plot(Xstar[!is.na(X[,1]),1:2], xlab=&quot;&quot;, main=&quot;&quot;, ylab=&quot;&quot;, cex=0.8, col=&quot;darkblue&quot;, xlim=c(-4,4), ylim=c(-3,3))<br>plot(Xstar[is.na(X[,1]),1:2], xlab=&quot;&quot;, main=&quot;&quot;, ylab=&quot;&quot;, cex=0.8, col=&quot;darkblue&quot;, xlim=c(-4,4), ylim=c(-3,3))</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*R1I9okHOAFcHlhKqdGh4ig.png" /><figcaption>Top: Distribution of (X_1,X_2) in the pattern where X_1 is observed, Bottom: Distribution of (X_1,X_2) in the pattern where X_1 is missing.</figcaption></figure><h4>Lesson 1: Imputation is a distributional prediction problem</h4><p>In my view, the goal of (general purpose) imputation should be to replicate the underlying data distribution as well as possible. To illustrate this, consider again the first example with <em>p=0</em>, such that only <em>X_1</em> has missing values. We will now try to impute this example, using the famous <a href="https://medium.com/@ofirdi/mice-is-nice-but-why-should-you-care-e66698f245a3">MICE </a>approach. Since only <em>X_1</em> is missing, we can implement this by hand. We start with the <em>mean imputation</em>, which simply calculates the mean of <em>X_1 </em>in the pattern where it is observed, and plugs this mean in the place of NA. We also use the <em>regression imputation</em> which is a bit more sophisticated: We regress <em>X_1 </em>onto <em>X_2 </em>in the pattern where <em>X_1 </em>is observed and then for each missing observation of <em>X_1 </em>we plug in the prediction of the regression. Thus here we impute the conditional mean of <em>X_1 </em>given <em>X_2</em>. Finally, for the <em>Gaussian imputation</em>, we start with the same regression of <em>X_1 </em>onto <em>X_2</em>, but then impute each missing value of <em>X_1 </em>by drawing from a Gaussian distribution. <em>In other words, instead of imputing the conditional expectation (i.e. just the center of the conditional distribution), we draw from this distribution.</em> This leads to a random imputation, which may be a bit counterintuitive at first, but will actually lead to the best result:</p><pre>## (0) Mean Imputation: This would correspond to &quot;mean&quot; in the mice R package ##<br><br><br># 1. Estimate the mean<br>meanX&lt;-mean(X[!is.na(X[,1]),1])<br><br>## 2. Impute<br>meanimp&lt;-X<br>meanimp[is.na(X[,1]),1] &lt;-meanX<br><br>## (1) Regression Imputation: This would correspond to &quot;norm.predict&quot; in the mice R package ##<br><br># 1. Estimate Regression<br>lmodelX1X2&lt;-lm(X1~X2, data=as.data.frame(X[!is.na(X[,1]),])   )<br><br>## 2. Impute<br>impnormpredict&lt;-X<br>impnormpredict[is.na(X[,1]),1] &lt;-predict(lmodelX1X2, newdata= as.data.frame(X[is.na(X[,1]),])  )<br><br><br>## (2) Gaussian Imputation: This would correspond to &quot;norm.nob&quot; in the mice R package ##<br><br># 1. Estimate Regression<br>#lmodelX1X2&lt;-lm(X1~X2, X=as.data.frame(X[!is.na(X[,1]),])   )<br># (same as before)<br><br>## 2. Impute<br>impnorm&lt;-X<br>meanx&lt;-predict(lmodelX1X2, newdata= as.data.frame(X[is.na(X[,1]),])  )<br>var &lt;- var(lmodelX1X2$residuals)<br>impnorm[is.na(X[,1]),1] &lt;-rnorm(n=length(meanx), mean = meanx, sd=sqrt(var) )<br><br><br><br>## Plot the different imputations<br><br>par(mfrow=c(2,2))<br><br><br>plot(meanimp[!is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], main=paste(&quot;Mean Imputation&quot;), cex=0.8, col=&quot;darkblue&quot;, cex.main=1.5)<br>points(meanimp[is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], col=&quot;darkred&quot;, cex=0.8 )<br><br>plot(impnormpredict[!is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], main=paste(&quot;Regression Imputation&quot;), cex=0.8, col=&quot;darkblue&quot;, cex.main=1.5)<br>points(impnormpredict[is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], col=&quot;darkred&quot;, cex=0.8 )<br><br>plot(impnorm[!is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], main=paste(&quot;Gaussian Imputation&quot;), col=&quot;darkblue&quot;, cex.main=1.5)<br>points(impnorm[is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], col=&quot;darkred&quot;, cex=0.8 )<br><br>#plot(Xstar[,c(&quot;X2&quot;,&quot;X1&quot;)], main=&quot;Truth&quot;, col=&quot;darkblue&quot;, cex.main=1.5)<br>plot(Xstar[!is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], main=&quot;Truth&quot;, col=&quot;darkblue&quot;, cex.main=1.5)<br>points(Xstar[is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], col=&quot;darkgreen&quot;, cex=0.8 )<br><br></pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*K_D_vnLDSDPojplK7R-8Ww.png" /><figcaption>The distribution of (X_1, X_2) plotted for different imputation methods. Different Imputation methods (red are the imputed points).</figcaption></figure><p>Studying this plot immediately reveals that the mean and regression imputations might not be ideal, as they completely fail at recreating the original data distribution. In contrast, the Gaussian imputation looks pretty good, in fact, I’d argue it would be hard to differentiate it from the truth. This might just seem like a technical notion, but this has consequences. Imagine you were given any of those imputed data sets and now you would like to find the regression coefficient when regressing <em>X_2 </em>onto <em>X_1 </em>(the opposite of what we did for imputation). The truth in this case is given by <em>beta=cov(X_1,X_2)/var(X_1)=0.7</em>.</p><pre>## Regressing X_2 onto X_1<br><br>## mean imputation estimate<br>lm(X2~X1, data=data.frame(meanimp))$coefficients[&quot;X1&quot;]<br>## beta= 0.61<br><br>## regression imputation estimate<br>round(lm(X2~X1, data=data.frame(impnormpredict))$coefficients[&quot;X1&quot;],2)<br>## beta= 0.90<br><br>## Gaussian imputation estimate<br>round(lm(X2~X1, data=data.frame(impnorm))$coefficients[&quot;X1&quot;],2)<br>## beta= 0.71<br><br>## Truth imputation estimate<br>round(lm(X2~X1, data=data.frame(Xstar))$coefficients[&quot;X1&quot;],2)<br>## beta= 0.71</pre><p>The Gaussian imputation is pretty close to 0.7 (0.71), and importantly, it is very close to the estimate using the full (unobserved) data! On the other hand, the mean imputation underestimates <em>beta</em>, while<em> the regression imputation overestimates beta</em>. The latter is natural, as the conditional mean imputation artificially inflates the relationship between variables. This effect is particularly important, as this will result in effects that are overestimated in science and (data science) practice!!</p><p>The regression imputation might seem overly simplistic. However, the key is that very commonly used imputation methods in machine learning and other fields work exactly like this. For instance, knn imputation and random forest imputation (i.e., <a href="https://academic.oup.com/bioinformatics/article/28/1/112/219101">missForest</a>). Especially the latter has been praised and recommended in several benchmarking papers and appears very widely used. However, missForest fits a Random Forest on the observed data and then simply imputes by the conditional mean. So, using it in this example the result would look very similar to the regression imputation, thus resulting in an artificial strengthening of relations between variable and biased estimates!</p><blockquote>A lot of commonly used imputation methods, such as mean imputation, knn imputation, and missForest fail at replicating the distribution. What they estimate and approximate is the (conditional) mean, and so the imputation will look like that of the regression imputation (or even worse for the mean imputation). Instead, we should try to impute by drawing from estimated (conditional) distributions.</blockquote><h4>Lesson 2: Imputation should be evaluated as a distributional prediction problem</h4><p>There is a dual problem connected to the discussion of the first lesson. How should imputation methods be evaluated?</p><p>Imagine we developed a new imputation method and now want to benchmark this against methods that exist already such as missForest, MICE, or <a href="https://arxiv.org/abs/1806.02920">GAIN</a>. In this setting, we artificially induce the missing values and so we have the actual data set just as above. We now want to compare this true dataset to our imputations. For the sake of the example, let us assume the regression imputation above is our new method, and we would like to compare it to mean and Gaussian imputation.</p><p>Even in the most prestigious conferences, this is done by calculating the root mean squared error (RMSE):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/635/1*oT9oRh8jZIgqGCxKbX9WYw.png" /></figure><p>This is implemented here:</p><pre>## Function to calculate the RMSE:<br># impX is the imputed data set<br># Xstar is the fully observed data set<br><br>RMSEcalc&lt;-function(impX, Xstar){<br>  <br>  round(mean(apply(Xstar - impX,1,function(x) norm(as.matrix(x), type=&quot;F&quot;  ) )),2)<br>  <br>}</pre><p>This discussion is related to the discussion on how to correctly score predictions. <a href="https://medium.com/towards-data-science/how-to-evaluate-your-predictions-cef80d8f6a69">In this article</a>, I discussed that (R)MSE is the right score to evaluate (conditional) mean predictions. It turns out the exact same logic applies here; using RMSE like this to evaluate our imputation, will favor methods that impute the conditional mean, such as the regression imputation, knn imputation, and missForest.</p><p>Instead, imputation should be <em>evaluated</em> as a distributional prediction problem. I suggest using the <em>energy distance between the distribution of the fully observed data and the imputation “distribution”</em>. Details can be found in the paper, but in R it is easily coded using the nice “energy” R package:</p><pre>library(energy)<br><br>## Function to calculate the energy distance:<br># impX is the imputed data set<br># Xstar is the fully observed data set<br><br>## Calculating the energy distance using the eqdist.e function of the energy package<br>energycalc &lt;- function(impX, Xstar){<br>  <br>  # Note: eqdist.e calculates the energy statistics for a test, which is actually<br>  # = n^2/(2n)*energydistance(impX,Xstar), but we we are only interested in relative values<br>  round(eqdist.e( rbind(Xstar,impX), c(nrow(Xstar), nrow(impX))  ),2)<br>  <br>}</pre><p>We now apply the two scores to our imaginary research project and try to figure out whether our regression imputation is better than the other two:</p><pre>par(mfrow=c(2,2))<br><br><br>## Same plots as before, but now with RMSE and energy distance <br>## added<br><br>plot(meanimp[!is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], main=paste(&quot;Mean Imputation&quot;, &quot;\nRMSE&quot;, RMSEcalc(meanimp, Xstar), &quot;\nEnergy&quot;, energycalc(meanimp, Xstar)), cex=0.8, col=&quot;darkblue&quot;, cex.main=1.5)<br>points(meanimp[is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], col=&quot;darkred&quot;, cex=0.8 )<br><br>plot(impnormpredict[!is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], main=paste(&quot;Regression Imputation&quot;,&quot;\nRMSE&quot;, RMSEcalc(impnormpredict, Xstar), &quot;\nEnergy&quot;, energycalc(impnormpredict, Xstar)), cex=0.8, col=&quot;darkblue&quot;, cex.main=1.5)<br>points(impnormpredict[is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], col=&quot;darkred&quot;, cex=0.8 )<br><br>plot(impnorm[!is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], main=paste(&quot;Gaussian Imputation&quot;,&quot;\nRMSE&quot;, RMSEcalc(impnorm, Xstar), &quot;\nEnergy&quot;, energycalc(impnorm, Xstar)), col=&quot;darkblue&quot;, cex.main=1.5)<br>points(impnorm[is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], col=&quot;darkred&quot;, cex=0.8 )<br><br><br>plot(Xstar[!is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], main=&quot;Truth&quot;, col=&quot;darkblue&quot;, cex.main=1.5)<br>points(Xstar[is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], col=&quot;darkgreen&quot;, cex=0.8 )<br><br><br></pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ais6yaMMEz35kyvJKFs6Jg.png" /></figure><p>If we look at RMSE, then our regression imputation appears great! It beats both mean and Gaussian imputation. However this clashes with the analysis from above, and choosing the regression imputation can and likely will lead to highly biased results. On the other hand, the (scaled) energy distance correctly identifies that the Gaussian imputation is the best method, agreeing with both visual intuition and better parameter estimates.</p><blockquote>When evaluating imputation methods (when the true data are available) measures such as RMSE and MAE should be avoided. Instead, the problem should be treated and evaluated as a distributional prediction problem, and distributional metrics such as the energy distance should be used. The overuse of RMSE as an evaluation tool has some serious implications for research in this area.</blockquote><p>Again this is not surprising, identifying the best mean prediction is what RMSE does. What is surprising, is how consistently it is used in research to evaluate imputation methods. In my view, this throws into question at least some recommendations of recent papers, about what imputation methods to use. Moreover, as new imputation methods get developed they are compared to other methods in terms of RMSE and are thus likely not replicating the distribution correctly. One thus has to question the usefulness of at least some of the myriad of imputation methods developed in recent years.</p><p>The question of evaluation gets much harder, <strong>when the underlying observations are not available. </strong>In the paper we develope a score that allows to rank imputation methods, even in this case! (a refinement of the idea presented in <a href="https://towardsdatascience.com/i-scores-how-to-choose-the-best-method-to-fill-in-nas-in-your-data-set-43f3f0df971f">this article</a>). The details are reserved for another medium post, but we can try it for this example. The “Iscore.R” function can be found on <a href="https://github.com/JeffNaef/MARimputation/tree/c1f5a1e48e8a60db95c727876086db5b7305f614/Useable">Github </a>or at the end of this article.</p><pre><br>library(mice)<br>source(&quot;Iscore.R&quot;)<br><br><br>methods&lt;-c(&quot;mean&quot;,       #mice-mean<br>           &quot;norm.predict&quot;,   #mice-sample<br>           &quot;norm.nob&quot;) # Gaussian Imputation<br><br>## We first define functions that allow for imputation of the three methods:<br><br>imputationfuncs&lt;-list()<br><br>imputationfuncs[[&quot;mean&quot;]] &lt;- function(X,m){ <br># 1. Estimate the mean<br>  meanX&lt;-mean(X[!is.na(X[,1]),1])<br>## 2. Impute<br>  meanimp&lt;-X<br>  meanimp[is.na(X[,1]),1] &lt;-meanX<br>  <br>  res&lt;-list()<br>  <br>  for (l in 1:m){<br>    res[[l]] &lt;- meanimp<br>  }<br>  <br>  return(res)<br>  <br>}<br><br>imputationfuncs[[&quot;norm.predict&quot;]] &lt;- function(X,m){ <br> # 1. Estimate Regression<br>  lmodelX1X2&lt;-lm(X1~., data=as.data.frame(X[!is.na(X[,1]),])   )<br> ## 2. Impute<br>  impnormpredict&lt;-X<br>  impnormpredict[is.na(X[,1]),1] &lt;-predict(lmodelX1X2, newdata= as.data.frame(X[is.na(X[,1]),])  )<br>  <br>res&lt;-list()<br><br>for (l in 1:m){<br>  res[[l]] &lt;- impnormpredict<br>}<br><br>return(res)<br>  <br>  }<br><br><br>imputationfuncs[[&quot;norm.nob&quot;]] &lt;- function(X,m){ <br> # 1. Estimate Regression<br>  lmodelX1X2&lt;-lm(X1~., data=as.data.frame(X[!is.na(X[,1]),])   )<br> ## 2. Impute<br>  impnorm&lt;-X<br>  meanx&lt;-predict(lmodelX1X2, newdata= as.data.frame(X[is.na(X[,1]),])  )<br>  var &lt;- var(lmodelX1X2$residuals)<br>  <br>  res&lt;-list()<br>  <br>  for (l in 1:m){<br>    impnorm[is.na(X[,1]),1] &lt;-rnorm(n=length(meanx), mean = meanx, sd=sqrt(var) )<br>    res[[l]] &lt;- impnorm<br>  }<br><br>  <br>  return(res)<br>  <br>}<br><br><br>scoreslist &lt;- Iscores_new(X,imputations=NULL, imputationfuncs=imputationfuncs, N=30)  <br><br>scores&lt;-do.call(cbind,lapply(scoreslist, function(x) x$score ))<br>names(scores)&lt;-methods<br>scores[order(scores)]<br><br>#    mean       norm.predict     norm.nob <br>#  -0.7455304   -0.5702136   -0.4220387 <br></pre><p>Thus <em>without every seeing the values of the missing data</em>, our score is able to identify that norm.nob is the best method! This comes in handy, especially when the data has more than two dimensions. I will give more details on how to use the score and how it works in a next article.</p><h4>Lesson 3: MAR is weirder than you think</h4><p>When reading the literature on missing value imputation, it is easy to get a sense that MAR is a solved case, and all the problems arise from whether it can be assumed or not. While this might be true under standard procedures such as maximum likelihood, if one wants to find a good (nonparametric) imputation, this is not the case.</p><p>Our paper discusses how complex distribution shifts are possible under MAR when changing from say the fully observed pattern to a pattern one wants to impute. We will focus here on the shift in distribution that can occur in the observed variables. For this, we turn to the example above, where we took <em>X_1 </em>to be income and <em>X_2 </em>to be age. As we have seen in the first figure the distribution looks quite different. However, the conditional distribution of <em>X_1 | X_2</em> remains the same! This allows to identify the right imputation distribution in principle.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*19oJoQciz20EH9omhh5_BA.png" /><figcaption>Distribution of X_2 in the pattern where X_1 is observed, Bottom: Distribution of (X_1,X_2) in the pattern where X_1 is missing.</figcaption></figure><p>The problem is that even if we can nonparametrically estimate the conditional distribution in the pattern where <em>X_1 </em>is missing, we need to extrapolate this to the distribution of <em>X_2 </em>where <em>X_1 </em>is missing. To illustrate this I will now introduce two very important nonparametric mice methods. One old (<em>mice-cart</em>) and one new (<em>mice-DRF</em>). The former uses one tree to regress <em>X_j</em> on all the other variables and then imputes by drawing samples from that tree. Thus instead of using the conditional expectation prediction of a tree/forest, as missForest does, it draws from the leaves to approximate sampling from the conditional distribution. In contrast, mice-DRF uses the <a href="https://medium.com/towards-data-science/drf-a-random-forest-for-almost-everything-625fa5c3bcb8">Distributional Random Forest</a>, a forest method designed to estimate distributions and samples from those predictions. Both work exceedingly well, as I will lay out below!</p><pre>library(drf)<br><br><br>## mice-DRF ##<br>par(mfrow=c(2,2))<br><br>#Fit DRF<br>DRF &lt;- drf(X=X[!is.na(X[,1]),2, drop=F], Y=X[!is.na(X[,1]),1, drop=F], num.trees=100)<br>impDRF&lt;-X<br># Predict weights for unobserved points<br>wx&lt;-predict(DRF, newdata= X[is.na(X[,1]),2, drop=F]  )$weights<br>impDRF[is.na(X[,1]),1] &lt;-apply(wx,1,function(wxi) sample(X[!is.na(X[,1]),1, drop=F], size=1, replace=T, prob=wxi))<br><br><br>plot(impDRF[!is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], main=paste(&quot;DRF Imputation&quot;, &quot;\nRMSE&quot;, RMSEcalc(impDRF, Xstar), &quot;\nEnergy&quot;, energycalc(impDRF, Xstar)), cex=0.8, col=&quot;darkblue&quot;, cex.main=1.5)<br>points(impDRF[is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], col=&quot;darkred&quot;, cex=0.8 )<br><br><br>## mice-cart##<br>impcart&lt;-X<br>impcart[is.na(X[,1]),1] &lt;-mice.impute.cart(X[,1], ry=!is.na(X[,1]), X[,2, drop=F], wy = NULL)<br><br>plot(impDRF[!is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], main=paste(&quot;cart Imputation&quot;, &quot;\nRMSE&quot;, RMSEcalc(impcart, Xstar), &quot;\nEnergy&quot;, energycalc(impcart, Xstar)), cex=0.8, col=&quot;darkblue&quot;, cex.main=1.5)<br>points(impDRF[is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], col=&quot;darkred&quot;, cex=0.8 )<br><br>plot(impnorm[!is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], main=paste(&quot;Gaussian Imputation&quot;,&quot;\nRMSE&quot;, RMSEcalc(impnorm, Xstar), &quot;\nEnergy&quot;, energycalc(impnorm, Xstar)), col=&quot;darkblue&quot;, cex.main=1.5)<br>points(impnorm[is.na(X[,1]),c(&quot;X2&quot;,&quot;X1&quot;)], col=&quot;darkred&quot;, cex=0.8 )</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ljMLmWmuyFS9R0zoX2Aqew.png" /></figure><p>Though both mice-cart and mice-DRF do a good job, they are still not quite as good as the Gaussian imputation. This is not surprising per se, as the Gaussian imputation is the ideal imputation in this case (because <em>(X_1, X_2)</em> are indeed Gaussian). Nonetheless the distribution shift in <em>X_2</em> likely plays a role in the difficulty of mice-cart and mice-DRF to recover the distribution even for 3000 observations (these methods are usually really really good). Note that this kind of extrapolation is not a problem for the Gaussian imputation.</p><p>The paper also discusses a similar, but more extreme example with two variables <em>(X_1, X_2)</em>. In this example, the distribution shift is much more pronounced, and the forest-based methods struggle accordingly:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JLrjrzUMQB2H5nunyAr0nA.png" /><figcaption>More extreme example of distribution shift in the paper. While the Gausisan imputation is near perfect, mice-RF and mice-DRF are not able to extrapolate correctly.</figcaption></figure><p>The problem is that these kinds of extreme distribution shifts are possible under MAR and forest-based methods have a hard time extrapolating outside of the data set (so do neural nets btw). Indeed, can you think of a method that can (1) learn a distribution nonparametrically and (2) extrapolate from <em>X_2</em> coming from the upper distribution to <em>X_2 </em>drawn from the lower distribution reliably? For now, I cannot.</p><blockquote>Imputation is hard, even if MAR can be assumed, and the search for reliable imputation methods is not over.</blockquote><h4>Conclusion: My current recommendations</h4><p>Missing values are a hairy problem. Indeed, the best way to deal with missing values is to not have them. Accordingly, Lesson 3 shows that the search for imputation methods is not yet concluded, even if one only considers MAR. We still lack a method that can do (1) nonparametric distributional prediction and (2) adapt to distribution shifts that are possible under MAR. That said, I also sometimes feel people make the problem more complicated than it is; some MICE methods perform extremely well and might be enough for many missing value problems already.</p><p>I first want to mention that that there are very fancy machine learning methods like <a href="https://arxiv.org/abs/1806.02920">GAIN</a> and variants, that try to impute data using neural nets. I like these methods because they follow the right idea: Impute the conditional distributions of missing given observed. However, after using them a bit, I am somewhat disappointed by their performance, especially compared to MICE.</p><p>Thus, if I had a missing value problem the first thing I’d try is <em>mice-cart</em> (implemented in the mice R package) or the new <em>mice-DRF</em> (code on <a href="https://github.com/JeffNaef/MARimputation/tree/c1f5a1e48e8a60db95c727876086db5b7305f614/Useable">Github</a>) we developed in the paper. I have tried those two on quite a few examples and their ability to recreate the data is uncanny. However note that these observations of mine are not based on a large, systematic benchmarking and should be taken with a grain of salt. Moreover, this requires at least an intermediate sample size of say above 200 or 300. Imputation is not easy and completely nonparametric methods will suffer if the sample size is too low. In the case of less than 200 observations, I would go with simpler methods such as Gaussian imputation (<em>mice-norm.nob</em> in the R package). If you would then like to find the best out of these methods I recommend trying our score developed in the paper,<strong> </strong>as done in Lesson 2 (though the implementation might not be the best).</p><p>Finally, note that none of these methods are able to effectively deal with <strong>imputation uncertainty</strong>! In a sense, we only discussed single imputation in this article. (Proper) multiple imputation would require that the uncertainty of the imputation method itself is taken into account, which is usually done using Bayesian methods. For frequentist method like we looked at here, this appears to be an open problem.</p><h4>Appendix 1: m-I-Score</h4><p>The File “Iscore.R”, which can also be found on <a href="https://github.com/JeffNaef/MARimputation/tree/c1f5a1e48e8a60db95c727876086db5b7305f614/Useable">Github</a>.</p><pre>Iscores_new&lt;-function(X, N=50,  imputationfuncs=NULL, imputations=NULL, maxlength=NULL,...){<br>  <br>  ## X: Data with NAs<br>  ## N: Number of samples from imputation distribution H<br>  ## imputationfuncs: A list of functions, whereby each imputationfuncs[[method]] is a function that takes the arguments<br>  ## X,m and imputes X m times using method: imputations= imputationfuncs[[method]](X,m).<br>  ## imputations: Either NULL or a list of imputations for the methods considered, each imputed X saved as <br>  ##              imputations[[method]], whereby method is a string<br>  ## maxlength: Maximum number of variables X_j to consider, can speed up the code<br>  <br>  <br>  require(Matrix)<br>  require(scoringRules)<br>  <br>  <br>  numberofmissingbyj&lt;-sapply(1:ncol(X), function(j)  sum(is.na(X[,j]))  )<br>  print(&quot;Number of missing values per dimension:&quot;)<br>  print(paste0(numberofmissingbyj, collapse=&quot;,&quot;)  )<br><br>  methods&lt;-names(imputationfuncs)<br><br>  score_all&lt;-list()<br>  <br>  for (method in methods) {<br>    print(paste0(&quot;Evaluating method &quot;, method))<br>    <br>    <br>    # }<br>    if (is.null(imputations)){<br>      # If there is no prior imputation<br>      tmp&lt;-Iscores_new_perimp(X, Ximp=NULL, N=N, imputationfunc=imputationfuncs[[method]], maxlength=maxlength,...)<br>      score_all[[method]] &lt;- tmp  <br>      <br>      <br>    }else{<br>      <br>      tmp&lt;-Iscores_new_perimp(X, Ximp=imputations[[method]][[1]], N=N, imputationfunc=imputationfuncs[[method]], maxlength=maxlength, ...)<br>      score_all[[method]] &lt;- tmp  <br>      <br>    }<br>    <br>    <br>    <br>  }<br>  <br>  return(score_all)<br>  <br>}<br><br><br>Iscores_new_perimp &lt;- function(X, Ximp, N=50, imputationfunc, maxlength=NULL,...){<br>  <br>  if (is.null(Ximp)){<br>    # Impute, maxit should not be 1 here!<br>    Ximp&lt;-imputationfunc(X=X  , m=1)[[1]]<br>  }<br>  <br>  <br>  colnames(X) &lt;- colnames(Ximp) &lt;- paste0(&quot;X&quot;, 1:ncol(X))<br>  <br>  args&lt;-list(...)<br>  <br>  X&lt;-as.matrix(X)<br>  Ximp&lt;-as.matrix(Ximp)<br>  <br>  n&lt;-nrow(X)<br>  p&lt;-ncol(X)<br>  <br>  ##Step 1: Reoder the data according to the number of missing values<br>  ## (least missing first)<br>  numberofmissingbyj&lt;-sapply(1:p, function(j)  sum(is.na(X[,j]))  )<br><br>  ## Done in the function<br>  M&lt;-1*is.na(X)<br>  colnames(M) &lt;- colnames(X)<br>  <br>  indexfull&lt;-colnames(X)<br>  <br>  <br>  # Order first according to most missing values<br>  <br>  # Get dimensions with missing values (all other are not important)<br>  dimwithNA&lt;-(colSums(M) &gt; 0)<br>  dimwithNA &lt;- dimwithNA[order(numberofmissingbyj, decreasing=T)]<br>  dimwithNA&lt;-dimwithNA[dimwithNA==TRUE]<br>  <br>  if (is.null(maxlength)){maxlength&lt;-sum(dimwithNA) }<br>  <br>  if (sum(dimwithNA) &lt; maxlength){<br>    warning(&quot;maxlength was set smaller than sum(dimwithNA)&quot;)<br>    maxlength&lt;-sum(dimwithNA)<br>  }<br>  <br>  <br>  index&lt;-1:ncol(X)<br>  scorej&lt;-matrix(NA, nrow= min(sum(dimwithNA), maxlength), ncol=1)<br>  weight&lt;-matrix(NA, nrow= min(sum(dimwithNA), maxlength), ncol=1)<br>  i&lt;-0<br>  <br>  for (j in names(dimwithNA)[1:maxlength]){<br>    <br>    i&lt;-i+1<br><br>    <br>    print( paste0(&quot;Dimension &quot;, i, &quot; out of &quot;, maxlength )   ) <br>    <br>  <br>    <br>    # H for all missing values of X_j<br>    Ximp1&lt;-Ximp[M[,j]==1, ]<br>    <br>    # H for all observed values of X_j<br>    Ximp0&lt;-Ximp[M[,j]==0, ]<br>    <br>    X0 &lt;-X[M[,j]==0, ]<br>    <br>    n1&lt;-nrow(Ximp1)<br>    n0&lt;-nrow(Ximp0)<br>    <br>    <br>    if (n1 &lt; 10){<br>      scorej[i]&lt;-NA<br>      <br>      warning(&#39;Sample size of missing and nonmissing too small for nonparametric distributional regression, setting to NA&#39;)<br>      <br>    }else{<br>      <br>      <br>      # Evaluate on observed data<br>      Xtest &lt;- Ximp0[,!(colnames(Ximp0) %in% j) &amp;  (colnames(Ximp0) %in% indexfull), drop=F]<br>      Oj&lt;-apply(X0[,!(colnames(Ximp0) %in% j) &amp;  (colnames(Ximp0) %in% indexfull), drop=F],2,function(x) !any(is.na(x)) )<br>      # Only take those that are fully observed<br>      Xtest&lt;-Xtest[,Oj, drop=F]<br>      <br>      Ytest &lt;-Ximp0[,j, drop=F]<br>      <br>      if (is.null(Xtest)){<br>        scorej[i]&lt;-NA<br>        #weighted<br>        weight[i]&lt;-(n1/n)*(n0/n)<br>        warning(&quot;Oj was empty&quot;)<br>        next<br>      }<br>      <br>      ###Test 1:<br>      # Train DRF on imputed data<br>      Xtrain&lt;-Ximp1[,!(colnames(Ximp1) %in% j) &amp; (colnames(Ximp1) %in% indexfull), drop=F]<br>      # Only take those that are fully observed<br>      Xtrain&lt;-Xtrain[,Oj, drop=F]<br>      <br>      Ytrain&lt;-Ximp1[,j, drop=F]<br><br>      <br>      Xartificial&lt;-cbind(c(rep(NA,nrow(Ytest)),c(Ytrain)),rbind(Xtest, Xtrain)   )<br>      colnames(Xartificial)&lt;-c(colnames(Ytrain), colnames(Xtrain))<br>      <br>      Imputationlist&lt;-imputationfunc(X=Xartificial  , m=N)<br>      <br>      Ymatrix&lt;-do.call(cbind, lapply(Imputationlist, function(x)  x[1:nrow(Ytest),1]  ))<br>      <br>      scorej[i] &lt;- -mean(sapply(1:nrow(Ytest), function(l)  { crps_sample(y = Ytest[l,], dat = Ymatrix[l,]) }))<br>      <br>    }<br>    <br>    <br>    <br>    #weighted<br>    weight[i]&lt;-(n1/n)*(n0/n)<br>    <br>  }<br>  <br>  scorelist&lt;-c(scorej)<br>  names(scorelist) &lt;- names(dimwithNA)[1:maxlength]<br>  weightlist&lt;-c(weight)<br>  names(weightlist) &lt;- names(dimwithNA)[1:maxlength]<br>  <br>  weightedscore&lt;-scorej*weight/(sum(weight, na.rm=T))<br>  <br>  ## Weight the score according to n0/n * n1/n!!<br>  return( list(score= sum(weightedscore, na.rm=T), scorelist=scorelist, weightlist=weightlist)  )<br>}</pre><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=e9256d45851b" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/what-is-a-good-imputation-for-missing-values-e9256d45851b">What Is a Good Imputation for Missing Values?</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Evaluate Your Predictions]]></title>
            <link>https://medium.com/data-science/how-to-evaluate-your-predictions-cef80d8f6a69?source=rss-ca780798011a------2</link>
            <guid isPermaLink="false">https://medium.com/p/cef80d8f6a69</guid>
            <category><![CDATA[calibration]]></category>
            <category><![CDATA[predictions]]></category>
            <category><![CDATA[model-evaluation]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[thoughts-and-theory]]></category>
            <dc:creator><![CDATA[Jeffrey Näf]]></dc:creator>
            <pubDate>Fri, 17 May 2024 06:59:57 GMT</pubDate>
            <atom:updated>2024-06-06T15:25:42.409Z</atom:updated>
            <content:encoded><![CDATA[<h4>Be mindful of the measure you choose</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*VVF7D8zvpTOgZY2O" /><figcaption>Photo by <a href="https://unsplash.com/@isaacmsmith?utm_source=medium&amp;utm_medium=referral">Isaac Smith</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p>Testing and benchmarking machine learning models by comparing their predictions on a test set, even after deployment, is of fundamental importance. To do this, one needs to think of a measure or <em>score </em>that takes a prediction and a test point and assigns a value measuring how successful the prediction is with respect to the test point. However, one should think carefully about which scoring measure is appropriate. In particular, when choosing a method to evaluate a prediction we should adhere to the idea of <em>proper scoring rules</em>. I only give a loose definition of this idea here, but basically, we want a score that is minimized at the thing we want to measure!</p><blockquote>As a general rule: One can use MSE to evaluate mean predictions, MAE to evaluate median predictions, the quantile score to evaluate more general quantile predictions and the energy or MMD score to evaluate distributional predictions.</blockquote><p>Consider a variable you want to predict, say a random variable <em>Y</em>, from a vector of covariates <strong><em>X</em></strong>. In the example below, <em>Y</em> will be income and <strong><em>X</em></strong> will be certain characteristics, such as <em>age </em>and <em>education</em>. We learned a predictor <em>f</em> on some training data and now we predict <em>Y</em> as <em>f(</em><strong><em>x</em></strong><em>)</em>. Usually, when we want to predict a variable <em>Y</em> as well as possible we predict the expectation of <em>y</em> given <strong>x</strong>, i.e. <em>f(</em><strong><em>x</em></strong><em>)</em> should approximate <em>E[Y | </em><strong><em>X</em></strong><em>=</em><strong><em>x</em></strong><em>]</em>. But more generally, <em>f(</em><strong><em>x</em></strong><em>)</em> could be an estimator of the median, other quantiles, or even the full conditional distribution <em>P(Y | </em><strong><em>X</em></strong><em>=</em><strong><em>x</em></strong><em>)</em>.</p><p>Now for a new test point <em>y</em>, we want to score your prediction, that is you want a function <em>S(y,f(</em><strong><em>x</em></strong><em>))</em>, that is <em>minimized </em>(in expectation)<em> </em>when <em>f(</em><strong><em>x</em></strong><em>)</em> is the best thing you can do. For instance, if we want to predict<em> E[Y | </em><strong><em>X</em></strong><em>=</em><strong><em>x</em></strong><em>]</em>, this score is given as the MSE: <em>S(y, f(</em><strong><em>x</em></strong><em>))= (y-f(</em><strong><em>x</em></strong><em>))²</em>.</p><p>Here we study the principle of scoring the predictor <em>f</em> over at test set of <em>(y_i,</em><strong><em>x</em></strong><em>_i), i=1,…,ntest</em> in more detail. In all examples we will compare the ideal estimation method to an other that is clearly wrong, or naive, and show that our scores do what they are supposed to. The full code used here can also be found on <a href="https://github.com/JeffNaef/Medium-Articles/blob/main/HowtoScoreprediction.R">Github</a>.</p><h4>The Example</h4><p>To illustrate things, I will simulate a simple dataset that should mimic income data. We will use this simple example throughout this article to illustrate the concepts.</p><pre>library(dplyr)<br><br><br>#Create some variables:<br># Simulate data for 100 individuals<br>n &lt;- 5000<br><br># Generate age between 20 and 60<br>age &lt;- round(runif(n, min = 20, max = 60))<br><br># Define education levels<br>education_levels &lt;- c(&quot;High School&quot;, &quot;Bachelor&#39;s&quot;, &quot;Master&#39;s&quot;)<br><br># Simulate education level probabilities<br>education_probs &lt;- c(0.4, 0.4, 0.2)<br><br># Sample education level based on probabilities<br>education &lt;- sample(education_levels, n, replace = TRUE, prob = education_probs)<br><br># Simulate experience correlated with age with some random error<br>experience &lt;- age - 20 + round(rnorm(n, mean = 0, sd = 3)) <br><br># Define a non-linear function for wage<br>wage &lt;- exp((age * 0.1) + (case_when(education == &quot;High School&quot; ~ 1,<br>                                 education == &quot;Bachelor&#39;s&quot; ~ 1.5,<br>                                 TRUE ~ 2)) + (experience * 0.05) + rnorm(n, mean = 0, sd = 0.5))<br><br>hist(wage)</pre><p>Although this simulation may be oversimplified, it reflects certain well-known characteristics of such data: older age, advanced education, and greater experience are all linked to higher wages. The use of the “exp” operator results in a highly skewed wage distribution, which is a consistent observation in such datasets.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1aWAHQVZwUbumKf_OjSMKw.png" /><figcaption>Wage distribution over the whole simulated population. Source: Author</figcaption></figure><p>Crucially, this skewness is also present when we fix age, education and experience to certain values. Let’s imagine we look at a specific person, Dave, who is 30 years old, has a Bachelor’s in Economics and 10 years of experience and let’s look at his actual income distribution according to our data generating process:</p><pre>ageDave&lt;-30<br>educationDave&lt;-&quot;Bachelor&#39;s&quot;<br>experienceDave &lt;- 10<br><br><br>wageDave &lt;- exp((ageDave * 0.1) + (case_when(educationDave == &quot;High School&quot; ~ 1,<br>                                     educationDave == &quot;Bachelor&#39;s&quot; ~ 1.5,<br>                                     TRUE ~ 2)) + (experienceDave * 0.05) + rnorm(n, mean = 0, sd = 0.5))<br><br>hist(wageDave, main=&quot;Wage Distribution for Dave&quot;, xlab=&quot;Wage&quot;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*M4Hrt419D9695CFoSY9CtA.png" /><figcaption>Wage distrbution for Dave. Source: Author</figcaption></figure><p>Thus the distribution of possible wages of Dave, given the information we have about him, is still highly skewed.</p><p>We also generate a test set of several people:</p><pre><br>## Generate test set<br>ntest&lt;-1000<br><br># Generate age between 20 and 60<br>agetest &lt;- round(runif(ntest, min = 20, max = 60))<br><br><br># Sample education level based on probabilities<br>educationtest &lt;- sample(education_levels, ntest, replace = TRUE, prob = education_probs)<br><br># Simulate experience correlated with age with some random error<br>experiencetest &lt;- agetest - 20 + round(rnorm(ntest, mean = 0, sd = 3))<br><br><br>## Generate ytest that we try to predict:<br><br>wagetest &lt;- exp((agetest * 0.1) + (case_when(educationtest == &quot;High School&quot; ~ 1,<br>                                             educationtest == &quot;Bachelor&#39;s&quot; ~ 1.5,<br>                                             TRUE ~ 2)) + (experiencetest * 0.05) + rnorm(ntest, mean = 0, sd = 0.5))</pre><p>We now start simple and first look at the scores for mean and median prediction.</p><h4>The scores for mean and median prediction</h4><p>In data science and machine learning, interest often centers on a single number that signifies the “center” or “middle” of the distribution we aim to predict, namely the (conditional) mean or median. To do this we have the mean squared error (MSE):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/381/1*OP_7Wi50Pb0CKAxRu_A6zw.png" /></figure><p>and the mean absolute error (MAE):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/352/1*u1NgcuBR_rGwPjesApBZug.png" /></figure><p>An important takeaway is that the MSE is the appropriate metric for predicting the conditional mean, while the MAE is the measure to use for the conditional median. Mean and median are not the same thing for skewed distributions like the one we study here.</p><p>Let us illustrate this for the above example with very simple estimators (that we would not have access to in real life), just for illustration:</p><pre>conditionalmeanest &lt;-<br>  function(age, education, experience, N = 1000) {<br>    mean(exp((age * 0.1) + (<br>      case_when(<br>        education == &quot;High School&quot; ~ 1,<br>        education == &quot;Bachelor&#39;s&quot; ~ 1.5,<br>        TRUE ~ 2<br>      )<br>    ) + (experience * 0.05) + rnorm(N, mean = 0, sd = 0.5)<br>    ))<br>  }<br><br><br>conditionalmedianest &lt;-<br>  function(age, education, experience, N = 1000) {<br>    median(exp((age * 0.1) + (<br>      case_when(<br>        education == &quot;High School&quot; ~ 1,<br>        education == &quot;Bachelor&#39;s&quot; ~ 1.5,<br>        TRUE ~ 2<br>      )<br>    ) + (experience * 0.05) + rnorm(N, mean = 0, sd = 0.5)<br>    ))<br>  }</pre><p>That is we estimate mean and median, by simply simulating from the model for fixed values of age, education, and experience (this would be a simulation from the correct conditional distribution) and then we simply take the mean/median of that. Let’s test this on Dave:</p><pre><br>hist(wageDave, main=&quot;Wage Distribution for Dave&quot;, xlab=&quot;Wage&quot;)<br>abline(v=conditionalmeanest(ageDave, educationDave, experienceDave), col=&quot;darkred&quot;, cex=1.2)<br>abline(v=conditionalmedianest(ageDave, educationDave, experienceDave), col=&quot;darkblue&quot;, cex=1.2)<br><br></pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*d0FBBXFK6sbyW6eMfWulHg.png" /><figcaption>Blue: estimated conditional median of Dave, Red: estimated conditional mean of Dave. Source: Author</figcaption></figure><p>Clearly the mean and median are different, as one would expect from such a distribution. In fact, as is typical for income distributions, the mean is higher (more influenced by high values) than the median.</p><p>Now let’s use these estimators on the test set:</p><pre>Xtest&lt;-data.frame(age=agetest, education=educationtest, experience=experiencetest)<br><br>meanest&lt;-sapply(1:nrow(Xtest), function(j)  conditionalmeanest(Xtest$age[j], Xtest$education[j], Xtest$experience[j])  )<br>median&lt;-sapply(1:nrow(Xtest), function(j)  conditionalmedianest(Xtest$age[j], Xtest$education[j], Xtest$experience[j])  )<br></pre><p>This gives a diverse range of conditional mean/median values. Now we calculate MSE and MAE:</p><pre>(MSE1&lt;-mean((meanest-wagetest)^2))<br>(MSE2&lt;-mean((median-wagetest)^2))<br><br>MSE1 &lt; MSE2<br>### Method 1 (the true mean estimator) is better than method 2!<br><br># but the MAE is actually worse of method 1!<br>(MAE1&lt;-mean(abs(meanest-wagetest)) )<br>(MAE2&lt;-mean( abs(median-wagetest)))<br><br>MAE1 &lt; MAE2<br>### Method 2 (the true median estimator) is better than method 1!</pre><p>This shows what is known theoretically: MSE is minimized for the (conditional) expectation <em>E[Y | </em><strong><em>X</em></strong><em>=</em><strong><em>x</em></strong><em>]</em>, while MAE is minimized at the conditional median. <em>In general, it does not make sense to use the MAE when you try to evaluate your mean prediction. </em>In a lot of applied research and data science, people use the MAE or both to evaluate mean predictions (I know because I did it myself). While this may be warranted in certain applications, this can have serious consequences for distributions that are not symmetric, as we saw in this example: When looking at the MAE, method 1 looks worse than method 2, even though the former estimates the mean correctly. In fact, in this highly skewed example, <em>method 1 should have a lower MAE than method 2</em>.</p><blockquote>To score conditional mean prediction use the mean squared error (MSE) and not the mean absolute error (MAE). The MAE is minimized for the conditional median.</blockquote><h4>Scores for quantile and interval prediction</h4><p>Assume we want to score an estimate <em>f(</em><strong><em>x</em></strong><em>)</em> of the quantile <em>q_</em><strong><em>x</em></strong> such that</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/410/1*vlr6vLIH4yICPioIitU25w.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xE6hqSj7OOL1MQZngjwtrA.png" /><figcaption>Simple quantile illustration. Source: Author</figcaption></figure><p>In this case, we can consider the quantile score:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Hx2gl1j94UK1hL7FKfVCZw.png" /></figure><p>whereby</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/563/1*vTQp7nGHv3sk1ajnbi9Byg.png" /></figure><p>To unpack this formula, we can consider two cases:</p><p>(1) <em>y</em> is smaller than <em>f(</em><strong><em>x):</em></strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/498/1*RZfxK2xu6v_3UzuQ8-Mmpw.png" /></figure><p>i.e. we incur a penalty which gets bigger the further away <em>y</em> is from <em>f(</em><strong><em>x).</em></strong></p><p>(2) <em>y</em> is larger than <em>f(</em><strong><em>x):</em></strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/408/1*5gHjeAsBZwLtDKG_4CwnoQ.png" /></figure><p>i.e. a penalty which gets bigger the further away <em>y</em> is from <em>f(</em><strong><em>x).</em></strong></p><p>Notice that the weight is such that for a high <em>alpha</em>, having the estimated quantile <em>f(</em><strong><em>x</em></strong><em>)</em> smaller than <em>y</em> gets penalized more. This is by design and ensures that the right quantile is indeed the minimizer of the expected value of <em>S(y,f(</em><strong><em>x</em></strong><em>))</em> over y. This score is in fact the <em>quantile loss </em>(up to a factor 2), see e.g. this <a href="https://towardsdatascience.com/quantile-loss-and-quantile-regression-b0689c13f54d">nice article</a>. It is implemented in the <em>quantile_score </em>function of the package <em>scoringutils</em> in R. Finally, note that for <em>alpha=0.5:</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KAU-FXaNpvtSH19oh2EQUQ.png" /></figure><p>simply the MAE! This makes sense, as the 0.5 quantile is the median.</p><p>With the power to predict quantiles, we can also build prediction intervals. Consider (<em>l_</em><strong><em>x</em></strong><em>, u_</em><strong><em>x)</em></strong>, where <em>l_</em><strong><em>x</em></strong> ≤ <em>u_</em><strong><em>x</em></strong> are quantiles such that</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/604/1*ZY_F-1xKiDt3XxQwzA_aOw.png" /></figure><p>In fact, this is met if <em>l_</em><strong><em>x</em></strong><em> </em>is<em> </em>the <em>alpha/2</em> quantile, and <em>u_</em><strong><em>x</em></strong> is the <em>1-alpha/2</em> quantile. Thus we now estimate and score these two quantiles. Consider <em>f(</em><strong><em>x</em></strong><em>)=(f_1(</em><strong><em>x</em></strong><em>), f_2(</em><strong><em>x</em></strong><em>))</em>, whereby <em>f_1(</em><strong><em>x</em></strong><em>)</em> to be an estimate of <em>l_</em><strong><em>x</em></strong><em> </em>and <em>f_2(</em><strong><em>x</em></strong><em>)</em> an estimate of <em>u_</em><strong><em>x. </em></strong>We provide two estimators, the “ideal” one that simulates again from the true process to then estimate the required quantiles and a “naive” one, which has the right coverage but is too big:</p><pre>library(scoringutils)<br><br>## Define conditional quantile estimation<br>conditionalquantileest &lt;-<br>  function(probs, age, education, experience, N = 1000) {<br>    quantile(exp((age * 0.1) + (<br>      case_when(<br>        education == &quot;High School&quot; ~ 1,<br>        education == &quot;Bachelor&#39;s&quot; ~ 1.5,<br>        TRUE ~ 2<br>      )<br>    ) + (experience * 0.05) + rnorm(N, mean = 0, sd = 0.5)<br>    )<br>    , probs =<br>      probs)<br>  }<br><br>## Define a very naive estimator that will still have the required coverage<br>lowernaive &lt;- 0<br>uppernaive &lt;- max(wage)<br><br># Define the quantile of interest<br>alpha &lt;- 0.05<br><br>lower &lt;-<br>  sapply(1:nrow(Xtest), function(j)<br>    conditionalquantileest(alpha / 2, Xtest$age[j], Xtest$education[j], Xtest$experience[j]))<br>upper &lt;-<br>  sapply(1:nrow(Xtest), function(j)<br>    conditionalquantileest(1 - alpha / 2, Xtest$age[j], Xtest$education[j], Xtest$experience[j]))<br><br><br><br>## Calculate the scores for both estimators<br><br># 1. Score the alpha/2 quantile estimate<br>qs_lower &lt;- mean(quantile_score(wagetest,<br>                           predictions = lower,<br>                           quantiles = alpha / 2))<br># 2. Score the alpha/2 quantile estimate<br>qs_upper &lt;- mean(quantile_score(wagetest,<br>                           predictions = upper,<br>                           quantiles = 1 - alpha / 2))<br><br># 1. Score the alpha/2 quantile estimate<br>qs_lowernaive &lt;- mean(quantile_score(wagetest,<br>                                predictions = rep(lowernaive, ntest),<br>                                quantiles = alpha / 2))<br># 2. Score the alpha/2 quantile estimate<br>qs_uppernaive &lt;- mean(quantile_score(wagetest,<br>                                predictions = rep(uppernaive, ntest),<br>                                quantiles = 1 - alpha / 2))<br><br># Construct the interval score by taking the average<br>(interval_score &lt;- (qs_lower + qs_upper) / 2)<br># Score of the ideal estimator: 187.8337<br><br># Construct the interval score by taking the average<br>(interval_scorenaive &lt;- (qs_lowernaive + qs_uppernaive) / 2)<br># Score of the naive estimator: 1451.464</pre><p>Again we can clearly see that, on average, the correct estimator has a much lower score than the naive one!</p><p>Thus with the quantile score, we have a reliable way of scoring individual quantile predictions. However, the way of averaging the score of the upper and lower quantiles for the prediction interval might seem ad hoc. Luckily it turns out that this leads to the so-called <em>interval score</em>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*OYOG3EFagAWDm02ZSefZ7Q.png" /></figure><p>Thus through some algebraic magic, we can score a prediction interval by averaging the scores for the <em>alpha/2</em> and the <em>1-alpha/2</em> quantiles as we did. Interestingly, the resulting interval score rewards narrow prediction intervals, and induces a penalty, the size of which depends on <em>alpha</em>, if the observation misses the interval. Instead of using the average of quantile scores, we can also directly calculate this score with the package <em>scoringutils.</em></p><pre>alpha &lt;- 0.05<br>mean(interval_score(<br>  wagetest,<br>  lower=lower,<br>  upper=upper,<br>  interval_range=(1-alpha)*100,<br>  weigh = T,<br>  separate_results = FALSE<br>))<br>#Score of the ideal estimator: 187.8337</pre><p>This is the exact same number we got above when averaging the scores of the two intervals.</p><blockquote>The quantile score implemented in R in the package scoringutils can be used to score quantile predictions. If one wants to score a prediction interval directly, the interval_score function can be used.</blockquote><h4>Scores for distributional prediction</h4><p>More and more fields have to deal with <em>distributional prediction</em>. Luckily there are even scores for this problem. In particular, here I focus on what is called the <em>energy score:</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ipPWmgZwD83RMCDTgiAF7Q.png" /></figure><p>for <em>f(</em><strong><em>x</em></strong><em>)</em> being an estimate of the distribution <em>P(Y | </em><strong><em>X</em></strong><em>=</em><strong><em>x</em></strong><em>). </em>The second term takes the expectation of the Eucledian distance between two independent samples from <em>f(</em><strong><em>x</em></strong><em>). </em>This is akin to a normalizing term, establishing the value if the same distribution was compared. The first term then compares the sample point <em>y</em> to a draw <em>X</em> from <em>f(</em><strong><em>x</em></strong><em>). </em>In expectation (over <em>Y </em>drawn from<em> P(Y | </em><strong><em>X</em></strong><em>=</em><strong><em>x</em></strong><em>))</em> this will be minimized if <em>f(</em><strong><em>x</em></strong><em>)=P(Y | </em><strong><em>X</em></strong><em>=</em><strong><em>x</em></strong><em>).</em></p><p>Thus instead of just predicting the mean or the quantiles, we now try to predict the whole distribution of wage at each test point. Essentially we try to predict and evaluate the conditional distribution we plotted for Dave above. This is a bit more complicated; how exactly do we represent a learned distribution? In practice this is resolved by assuming we can obtain a sample from the predicted distribution. Thus we compare a sample of <em>N</em>, obtained from the predicted distribution, to a single test point. This can be done in R using <em>es_sample </em>from the <em>scoringRules </em>package:</p><pre>library(scoringRules)<br><br>## Ideal &quot;estimate&quot;: Simply sample from the true conditional distribution <br>## P(Y | X=x) for each sample point x<br>distributionestimate &lt;-<br>  function(age, education, experience, N = 100) {<br>    exp((age * 0.1) + (<br>      case_when(<br>        education == &quot;High School&quot; ~ 1,<br>        education == &quot;Bachelor&#39;s&quot; ~ 1.5,<br>        TRUE ~ 2<br>      )<br>    ) + (experience * 0.05) + rnorm(N, mean = 0, sd = 0.5))<br>  }<br><br>## Naive Estimate: Only sample from the error distribution, without including the <br>## information of each person.<br>distributionestimatenaive &lt;-<br>  function(age, education, experience, N = 100) {<br>    exp(rnorm(N, mean = 0, sd = 0.5))<br>  }<br><br><br><br><br>scoretrue &lt;- mean(sapply(1:nrow(Xtest), function(j)  {<br>  wageest &lt;-<br>    distributionestimate(Xtest$age[j], Xtest$education[j], Xtest$experience[j])<br>  return(scoringRules::es_sample(y = wagetest[j], dat = matrix(wageest, nrow=1)))<br>}))<br><br>scorenaive &lt;- mean(sapply(1:nrow(Xtest), function(j)  {<br>  wageest &lt;-<br>    distributionestimatenaive(Xtest$age[j], Xtest$education[j], Xtest$experience[j])<br>  return(scoringRules::es_sample(y = wagetest[j], dat = matrix(wageest, nrow=1)))<br>}))<br><br>## scoretrue: 761.026<br>## scorenaive: 2624.713</pre><p>In the above code, we again compare the “perfect” estimate (i.e. sampling from the true distribution <em>P(Y | </em><strong><em>X</em></strong><em>=</em><strong><em>x</em></strong><em>)) </em>to a very naive one, namely one that does not consider any information on wage, edicuation or experience. Again, the score reliably identifies the better of the two methods.</p><blockquote>The energy score, implemented in the R package scoringRules can be used to score distributional prediction, if a sample from the predicted distribution is available.</blockquote><h4>Conclusion</h4><p>We have looked at different ways of scoring predictions. Thinking about the right measure to test predictions is important, as the wrong measure might make us choose and keep the wrong model for our prediction task.</p><p>It should be noted that especially for distributional prediction this scoring is a difficult task and the score might not have much power in practice. That is, even a method that leads to a large improvement might only have a slightly smaller score. However, this is not a problem per se, as long as the score is able to reliably identify the better of the two methods.</p><h4>References</h4><p>[1] Tilmann Gneiting &amp; Adrian E Raftery (2007) Strictly Proper Scoring Rules, Prediction, and Estimation, Journal of the American Statistical Association, 102:477, 359–378, DOI: <a href="https://doi.org/10.1198/016214506000001437">10.1198/016214506000001437</a></p><h4>Appendix: All the code in one place</h4><p>This file can also be found on <a href="https://github.com/JeffNaef/Medium-Articles/blob/main/HowtoScoreprediction.R">Github</a>.</p><pre>library(dplyr)<br><br>#Create some variables:<br># Simulate data for 100 individuals<br>n &lt;- 5000<br><br># Generate age between 20 and 60<br>age &lt;- round(runif(n, min = 20, max = 60))<br><br># Define education levels<br>education_levels &lt;- c(&quot;High School&quot;, &quot;Bachelor&#39;s&quot;, &quot;Master&#39;s&quot;)<br><br># Simulate education level probabilities<br>education_probs &lt;- c(0.4, 0.4, 0.2)<br><br># Sample education level based on probabilities<br>education &lt;- sample(education_levels, n, replace = TRUE, prob = education_probs)<br><br># Simulate experience correlated with age with some random error<br>experience &lt;- age - 20 + round(rnorm(n, mean = 0, sd = 3)) <br><br># Define a non-linear function for wage<br>wage &lt;- exp((age * 0.1) + (case_when(education == &quot;High School&quot; ~ 1,<br>                                     education == &quot;Bachelor&#39;s&quot; ~ 1.5,<br>                                     TRUE ~ 2)) + (experience * 0.05) + rnorm(n, mean = 0, sd = 0.5))<br><br>hist(wage)<br><br><br><br>ageDave&lt;-30<br>educationDave&lt;-&quot;Bachelor&#39;s&quot;<br>experienceDave &lt;- 10<br><br>wageDave &lt;- exp((ageDave * 0.1) + (case_when(educationDave == &quot;High School&quot; ~ 1,<br>                                             educationDave == &quot;Bachelor&#39;s&quot; ~ 1.5,<br>                                             TRUE ~ 2)) + (experienceDave * 0.05) + rnorm(n, mean = 0, sd = 0.5))<br><br>hist(wageDave, main=&quot;Wage Distribution for Dave&quot;, xlab=&quot;Wage&quot;)<br><br><br><br>## Generate test set<br>ntest&lt;-1000<br><br># Generate age between 20 and 60<br>agetest &lt;- round(runif(ntest, min = 20, max = 60))<br><br># Sample education level based on probabilities<br>educationtest &lt;- sample(education_levels, ntest, replace = TRUE, prob = education_probs)<br><br># Simulate experience correlated with age with some random error<br>experiencetest &lt;- agetest - 20 + round(rnorm(ntest, mean = 0, sd = 3))<br><br>## Generate ytest that we try to predict:<br><br>wagetest &lt;- exp((agetest * 0.1) + (case_when(educationtest == &quot;High School&quot; ~ 1,<br>                                             educationtest == &quot;Bachelor&#39;s&quot; ~ 1.5,<br>                                             TRUE ~ 2)) + (experiencetest * 0.05) + rnorm(ntest, mean = 0, sd = 0.5))<br><br><br><br><br><br>conditionalmeanest &lt;-<br>  function(age, education, experience, N = 1000) {<br>    mean(exp((age * 0.1) + (<br>      case_when(<br>        education == &quot;High School&quot; ~ 1,<br>        education == &quot;Bachelor&#39;s&quot; ~ 1.5,<br>        TRUE ~ 2<br>      )<br>    ) + (experience * 0.05) + rnorm(N, mean = 0, sd = 0.5)<br>    ))<br>  }<br><br>conditionalmedianest &lt;-<br>  function(age, education, experience, N = 1000) {<br>    median(exp((age * 0.1) + (<br>      case_when(<br>        education == &quot;High School&quot; ~ 1,<br>        education == &quot;Bachelor&#39;s&quot; ~ 1.5,<br>        TRUE ~ 2<br>      )<br>    ) + (experience * 0.05) + rnorm(N, mean = 0, sd = 0.5)<br>    ))<br>  }<br><br><br>hist(wageDave, main=&quot;Wage Distribution for Dave&quot;, xlab=&quot;Wage&quot;)<br>abline(v=conditionalmeanest(ageDave, educationDave, experienceDave), col=&quot;darkred&quot;, cex=1.2)<br>abline(v=conditionalmedianest(ageDave, educationDave, experienceDave), col=&quot;darkblue&quot;, cex=1.2)<br><br><br><br>Xtest&lt;-data.frame(age=agetest, education=educationtest, experience=experiencetest)<br><br>meanest&lt;-sapply(1:nrow(Xtest), function(j)  conditionalmeanest(Xtest$age[j], Xtest$education[j], Xtest$experience[j])  )<br>median&lt;-sapply(1:nrow(Xtest), function(j)  conditionalmedianest(Xtest$age[j], Xtest$education[j], Xtest$experience[j])  )<br><br><br><br>(MSE1&lt;-mean((meanest-wagetest)^2))<br>(MSE2&lt;-mean((median-wagetest)^2))<br><br>MSE1 &lt; MSE2<br>### Method 1 (the true mean estimator) is better than method 2!<br><br># but the MAE is actually worse of method 1!<br>(MAE1&lt;-mean(abs(meanest-wagetest)) )<br>(MAE2&lt;-mean( abs(median-wagetest)))<br><br>MAE1 &lt; MAE2<br>### Method 2 (the true median estimator) is better than method 1!<br><br><br><br><br><br><br><br><br>library(scoringutils)<br><br>## Define conditional quantile estimation<br>conditionalquantileest &lt;-<br>  function(probs, age, education, experience, N = 1000) {<br>    quantile(exp((age * 0.1) + (<br>      case_when(<br>        education == &quot;High School&quot; ~ 1,<br>        education == &quot;Bachelor&#39;s&quot; ~ 1.5,<br>        TRUE ~ 2<br>      )<br>    ) + (experience * 0.05) + rnorm(N, mean = 0, sd = 0.5)<br>    )<br>    , probs =<br>      probs)<br>  }<br><br>## Define a very naive estimator that will still have the required coverage<br>lowernaive &lt;- 0<br>uppernaive &lt;- max(wage)<br><br># Define the quantile of interest<br>alpha &lt;- 0.05<br><br>lower &lt;-<br>  sapply(1:nrow(Xtest), function(j)<br>    conditionalquantileest(alpha / 2, Xtest$age[j], Xtest$education[j], Xtest$experience[j]))<br>upper &lt;-<br>  sapply(1:nrow(Xtest), function(j)<br>    conditionalquantileest(1 - alpha / 2, Xtest$age[j], Xtest$education[j], Xtest$experience[j]))<br><br>## Calculate the scores for both estimators<br><br># 1. Score the alpha/2 quantile estimate<br>qs_lower &lt;- mean(quantile_score(wagetest,<br>                                predictions = lower,<br>                                quantiles = alpha / 2))<br># 2. Score the alpha/2 quantile estimate<br>qs_upper &lt;- mean(quantile_score(wagetest,<br>                                predictions = upper,<br>                                quantiles = 1 - alpha / 2))<br><br># 1. Score the alpha/2 quantile estimate<br>qs_lowernaive &lt;- mean(quantile_score(wagetest,<br>                                     predictions = rep(lowernaive, ntest),<br>                                     quantiles = alpha / 2))<br># 2. Score the alpha/2 quantile estimate<br>qs_uppernaive &lt;- mean(quantile_score(wagetest,<br>                                     predictions = rep(uppernaive, ntest),<br>                                     quantiles = 1 - alpha / 2))<br><br># Construct the interval score by taking the average<br>(interval_score &lt;- (qs_lower + qs_upper) / 2)<br># Score of the ideal estimator: 187.8337<br><br># Construct the interval score by taking the average<br>(interval_scorenaive &lt;- (qs_lowernaive + qs_uppernaive) / 2)<br># Score of the naive estimator: 1451.464<br><br><br>library(scoringRules)<br><br>## Ideal &quot;estimate&quot;: Simply sample from the true conditional distribution <br>## P(Y | X=x) for each sample point x<br>distributionestimate &lt;-<br>  function(age, education, experience, N = 100) {<br>    exp((age * 0.1) + (<br>      case_when(<br>        education == &quot;High School&quot; ~ 1,<br>        education == &quot;Bachelor&#39;s&quot; ~ 1.5,<br>        TRUE ~ 2<br>      )<br>    ) + (experience * 0.05) + rnorm(N, mean = 0, sd = 0.5))<br>  }<br><br>## Naive Estimate: Only sample from the error distribution, without including the <br>## information of each person.<br>distributionestimatenaive &lt;-<br>  function(age, education, experience, N = 100) {<br>    exp(rnorm(N, mean = 0, sd = 0.5))<br>  }<br><br>scoretrue &lt;- mean(sapply(1:nrow(Xtest), function(j)  {<br>  wageest &lt;-<br>    distributionestimate(Xtest$age[j], Xtest$education[j], Xtest$experience[j])<br>  return(scoringRules::es_sample(y = wagetest[j], dat = matrix(wageest, nrow=1)))<br>}))<br><br>scorenaive &lt;- mean(sapply(1:nrow(Xtest), function(j)  {<br>  wageest &lt;-<br>    distributionestimatenaive(Xtest$age[j], Xtest$education[j], Xtest$experience[j])<br>  return(scoringRules::es_sample(y = wagetest[j], dat = matrix(wageest, nrow=1)))<br>}))<br><br>## scoretrue: 761.026<br>## scorenaive: 2624.713<br><br></pre><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=cef80d8f6a69" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/how-to-evaluate-your-predictions-cef80d8f6a69">How to Evaluate Your Predictions</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Random Forests in 2023: Modern Extensions of a Powerful Method]]></title>
            <link>https://medium.com/data-science/random-forests-in-2023-modern-extensions-of-a-powerful-method-b62debaf1d62?source=rss-ca780798011a------2</link>
            <guid isPermaLink="false">https://medium.com/p/b62debaf1d62</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[thoughts-and-theory]]></category>
            <category><![CDATA[random-forest]]></category>
            <category><![CDATA[math]]></category>
            <category><![CDATA[algorithms]]></category>
            <dc:creator><![CDATA[Jeffrey Näf]]></dc:creator>
            <pubDate>Tue, 07 Nov 2023 19:04:16 GMT</pubDate>
            <atom:updated>2026-01-21T22:37:26.120Z</atom:updated>
            <content:encoded><![CDATA[<h4>Random Forests came a long way</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*FBb3Z6GdjDWpqxMSlTHkpA.png" /><figcaption>Features of modern Random Forest methods. Source: Author.</figcaption></figure><p>In terms of Machine Learning timelines, Random Forests (RFs), introduced in the seminal paper of Breimann ([1]), are ancient. Despite their age, they keep impressing with their performance and are a topic of active research. The goal of this article is to highlight what a versatile toolbox Random Forest methods have become, focussing on <strong>Generalized Random Forest (GRF)</strong> and <strong>Distributional Random Forest (DRF)</strong>.</p><p>In short, the main idea underlying both methods is that the weights implicitly produced by RF can be used to estimate targets other than the conditional expectation. The idea of GRF is to use a Random Forest with a splitting criterion that is adapted to the target one has in mind (e.g., conditional mean, conditional quantiles, or the conditional treatment effect). The idea of DRF is to adapt the splitting criterion such that the whole conditional distribution can be estimated. From this object, many different targets can then be derived in a second step. In fact, I mostly talk about DRF in this article, as I am more familiar with this method and it is somewhat more streamlined (only one forest has to be fitted for a wide range of targets). However, all the advantages, indicated in the figure above, also apply to GRF and in fact, the DRF package in R is built upon the <a href="https://grf-labs.github.io/grf/">professional implementation of GRF</a>. Moreover, the fact that the splitting criterion of GRF forests is adapted to the target means it can have better performance than DRF. This is particularly true for binary <em>Y</em>, where probability_forests() should be used. So, even though I talk mostly about DRF, GRF should be kept in mind throughout this article.</p><p>The goal of this article is to provide an overview with links to deeper reading in the corresponding sections. We will go through each of the points in the above figure clock-wise, reference the corresponding articles, and highlight it with a little example. I first quickly summarize the most important links to further reading below:</p><p>Versatility/Performance: <a href="https://towardsdatascience.com/drf-a-random-forest-for-almost-everything-625fa5c3bcb8">Medium Article</a> and Original Papers (<a href="https://jmlr.org/papers/v23/21-0585.html">DRF</a>/<a href="https://projecteuclid.org/journals/annals-of-statistics/volume-47/issue-2/Generalized-random-forests/10.1214/18-AOS1709.full">GRF</a>)</p><p>Missing Values Incorporated: <a href="https://towardsdatascience.com/random-forests-and-missing-values-3daaea103db0">Medium Article</a></p><p>Uncertainty Measures: <a href="https://towardsdatascience.com/inference-for-distributional-random-forests-64610bbb3927">Medium Article</a></p><p>Variable Importance: <a href="https://medium.com/towards-data-science/variable-importance-in-random-forests-20c6690e44e0">Medium Article</a></p><blockquote>The full code for this article can be found on <a href="https://github.com/JeffNaef/Medium-Articles/blob/main/RF_2023.R">Github</a>.</blockquote><h4>The Example</h4><p>We take <em>X_1, X_2, X_4, …, X_10</em> independently uniform between (-1,1) and create dependence between <em>X_1</em> and <em>X_3</em> by taking <em>X_3=X_1 + uniform error</em>. Then we simulate <em>Y</em> as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/800/1*fVFn4lfuSjRoVZRsgdLRLA.png" /></figure><pre><br>## Load packages and functions needed<br>library(drf)<br>library(mice)<br><br><br><br>## Set parameters<br>set.seed(10)<br>n&lt;-1000<br><br><br><br>##Simulate Data that experiences both a mean as well as sd shift<br># Simulate from X<br>x1 &lt;- runif(n,-1,1)<br>x2 &lt;- runif(n,-1,1)<br>x3 &lt;- x1+ runif(n,-1,1)<br>X0 &lt;- matrix(runif(7*n,-1,1), nrow=n, ncol=7)<br>Xfull &lt;- cbind(x1,x2, x3, X0)<br>colnames(Xfull)&lt;-paste0(&quot;X&quot;, 1:10)<br><br># Simulate dependent variable Y<br>Y &lt;- as.matrix(rnorm(n,mean = 0.8*(x1 &gt; 0), sd = 1 + 1*(x2 &gt; 0)))<br><br>##Also add MAR missing values using ampute from the mice package<br>X&lt;-ampute(Xfull)$amp<br><br>head(cbind(Y,X))<br><br>              Y          X1          X2          X3          X4         X5<br>[1,]  2.8230938  0.01495641  0.61541265  0.72124459  0.22751381 -0.2418149<br>[2,] -0.5506897 -0.38646299  0.42035842 -1.02849810 -0.78525849 -0.8512734<br>[3,]  0.1359363 -0.14618467          NA  0.07900811  0.03722511  0.5516002<br>[4,]  2.7453314  0.38620416  0.01956594  1.17097982  0.76213426  0.9813759<br>[5,]  0.3702340 -0.82972806  0.13751720 -0.63962047 -0.23973421 -0.4339398<br>[6,] -0.3955091 -0.54912677 -0.33686360 -0.76832645 -0.55324807 -0.4658600<br>             X6         X7         X8         X9        X10<br>[1,] -0.4815087  0.3548534  0.2010427 -0.8215880  0.4515748<br>[2,] -0.4561582 -0.8561308  0.3944132  0.5845884 -0.1409658<br>[3,] -0.2494117  0.1858109 -0.7487566  0.8355041  0.6216888<br>[4,]  0.6688906 -0.4936409 -0.2557489         NA -0.8048351<br>[5,]  0.5194512  0.4799429 -0.8374205  0.3405841 -0.9950487<br>[6,]  0.7471006  0.8961717 -0.0435499  0.4200485  0.6398618</pre><p>Notice that with the function ampute from the <a href="https://cran.r-project.org/web/packages/mice/index.html">mice package</a>, we put <strong>Missing not at Random (MAR)</strong> missing values on <strong><em>X</em></strong> to highlight the ability of GRF/DRF to deal with missing values. Moreover, in the above process only <em>X_1</em> and <em>X_2</em> are relevant for predicting <em>Y</em>, all other variables are “noise” variables. Such a “sparse” setting might actually be common in real-life datasets.</p><p>We now choose a test point for this example that we will use throughout:</p><pre>x&lt;-matrix(c(0.2, 0.4, runif(8,-1,1)), nrow=1, ncol=10)<br>print(x)<br><br>     [,1] [,2]      [,3]      [,4]      [,5]      [,6]    [,7]      [,8]<br>[1,]  0.2  0.4 0.7061058 0.8364877 0.2284314 0.7971179 0.78581 0.5310279<br>           [,9]     [,10]<br>[1,] -0.5067102 0.6918785</pre><h4>Versatility</h4><p>DRF estimates the conditional distribution <em>P_{Y|</em><strong><em>X</em></strong><em>=</em><strong><em>x</em></strong><em>}</em><strong> </strong>in the form of simple weights:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/533/1*IoIqGJmp72jyZFDKeppCbQ.png" /></figure><p>From these weights, a wide range of targets can be calculated, or they can be used to simulate from the conditional distribution. A good reference for its versatility is the original <a href="https://jmlr.org/papers/v23/21-0585.html">research article</a>, where a lot of examples were used, as well as the corresponding <a href="https://towardsdatascience.com/drf-a-random-forest-for-almost-everything-625fa5c3bcb8">medium article</a>.</p><p>In the example, we first simulate from this distribution:</p><pre>DRF&lt;-drf(X=X, Y=Y, ci.group.size=2000/50,num.trees=1000, min.node.size = 5)<br>DRFpred&lt;-predict(DRF, newdata=x, estimate.uncertainty=TRUE)<br><br>## Sample from P_{Y| X=x}<br>Yxs&lt;-Y[sample(1:n, size=n, replace = T, DRFpred$weights[1,])]<br>hist(Yxs, prob=T)<br>z&lt;-seq(-6,7,by=0.01)<br>d&lt;-dnorm(z, mean=0.8 * (x[1] &gt; 0), sd=(1+(x[2] &gt; 0)))<br>lines(z,d, col=&quot;darkred&quot;  )</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*iPArBRb5TKPHk1iShEcvWg.png" /><figcaption>Histogram of the simulated conditional distribution overlaid with the true density (in red). Source: Author.</figcaption></figure><p>The plot shows the approximate draws from the conditional distribution overlaid with the true density in red. We now use this to estimate the <strong>conditional expectation</strong> and the <strong>conditional (0.05, 0.95) quantiles</strong> at <strong><em>x:</em></strong></p><pre># Calculate quantile prediction as weighted quantiles from Y<br>qx &lt;- quantile(Yxs, probs = c(0.05,0.95))<br><br># Calculate conditional mean prediction<br>mux &lt;- mean(Yxs)<br><br># True quantiles<br>q1&lt;-qnorm(0.05, mean=0.8 * (x[1] &gt; 0), sd=(1+(x[2] &gt; 0)))<br>q2&lt;-qnorm(0.95, mean=0.8 * (x[1] &gt; 0), sd=(1+(x[2] &gt; 0)))<br>mu&lt;-0.8 * (x[1] &gt; 0)<br><br>hist(Yxs, prob=T)<br>z&lt;-seq(-6,7,by=0.01)<br>d&lt;-dnorm(z, mean=0.8 * (x[1] &gt; 0), sd=(1+(x[2] &gt; 0)))<br>lines(z,d, col=&quot;darkred&quot;  )<br>abline(v=q1,col=&quot;darkred&quot; )<br>abline(v=q2, col=&quot;darkred&quot; )<br>abline(v=qx[1], col=&quot;darkblue&quot;)<br>abline(v=qx[2], col=&quot;darkblue&quot;)<br>abline(v=mu, col=&quot;darkred&quot;)<br>abline(v=mux, col=&quot;darkblue&quot;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*er3AqUPXIkth0Ifrp9OF5w.png" /><figcaption>Histogram of the simulated conditional distribution overlaid with the true density (in red). Additionally, the estimated conditional expectation and the conditional (0.05, 0.95) quantiles are in blue, with true values in red. Source: Author.</figcaption></figure><p>Likewise, many targets can be calculated with GRF, only in this case for each of those two targets a different forest would need to be fit. In particular, <em>regression_forest()</em> for the conditional expectation and <em>quantile_forest()</em> for the quantiles.</p><p>What GRF cannot do is deal with multivariate targets, which is possible with DRF as well.</p><h4>Performance</h4><p>Despite all the work on powerful new (nonparametric) methods, such as neural nets, tree-based methods are consistently able to beat competitors on tabular data. See e.g., this <a href="https://arxiv.org/abs/2207.08815">fascinating paper</a>, or this older paper on the <a href="https://dl.acm.org/doi/pdf/10.5555/2627435.2697065">strength of RF in classification</a>.</p><p>To be fair, with parameter tuning, <a href="https://towardsdatascience.com/an-overview-of-boosting-methods-catboost-xgboost-adaboost-lightboost-histogram-based-gradient-407447633ac1">boosted tree methods</a>, such as XGboost, often take the lead, at least when it comes to classical prediction (which corresponds to conditional expectation estimation). Nonetheless, the robust performance RF methods tend to have without any tuning is remarkable. Moreover, there has also been work on improving the performance of Random Forests, for example, the <a href="https://arxiv.org/pdf/2308.15384.pdf">hedged Random Forest approach</a>.</p><h4>Missing Values Incorporated</h4><p>“Missing incorporated in attributes criterion” (MIA) from <a href="https://www.sciencedirect.com/science/article/abs/pii/S0167865508000305">this paper</a> is a very simple but very powerful idea that allows tree-based methods to handle missing data. This was implemented in the GRF R package and so it is also available in DRF. The details are also explained in this <a href="https://towardsdatascience.com/random-forests-and-missing-values-3daaea103db0">medium article</a>. As simple as the concept is, it works remarkably well in practice: In the above example, DRF had no trouble handling substantial MAR missingness in the training data <strong><em>X</em></strong> (!)</p><h4><strong>Uncertainty Measures</strong></h4><p>As a statistician I don’t just want point estimates (even of a distribution), but also a measure of <strong>estimation uncertainty</strong> of my parameters (even if the “parameter” is my whole distribution). Turns out that a simple additional subsampling baked into DRF/GRF allows for a principled uncertainty quantification for large sample sizes. The theory behind this in the case of DRF is derived in this <a href="https://arxiv.org/abs/2302.05761">research article</a>, but I also explain it in this <a href="https://towardsdatascience.com/inference-for-distributional-random-forests-64610bbb3927">medium article</a>. GRF has all the theory in the <a href="https://projecteuclid.org/journals/annals-of-statistics/volume-47/issue-2/Generalized-random-forests/10.1214/18-AOS1709.full">original paper</a>.</p><p>We adapt this for the above example:</p><pre># Calculate uncertainty<br>alpha&lt;-0.05<br>B&lt;-nrow(DRFpred$weights.uncertainty[[1]])<br>qxb&lt;-matrix(NaN, nrow=B, ncol=2)<br>muxb&lt;-matrix(NaN, nrow=B, ncol=1)<br>for (b in 1:B){<br>  Yxsb&lt;-Y[sample(1:n, size=n, replace = T, DRFpred$weights.uncertainty[[1]][b,])]<br>  qxb[b,] &lt;- quantile(Yxsb, probs = c(0.05,0.95))<br>  muxb[b] &lt;- mean(Yxsb)<br>}<br><br>CI.lower.q1 &lt;- qx[1] - qnorm(1-alpha/2)*sqrt(var(qxb[,1]))<br>CI.upper.q1 &lt;- qx[1] + qnorm(1-alpha/2)*sqrt(var(qxb[,1]))<br><br>CI.lower.q2 &lt;- qx[2] - qnorm(1-alpha/2)*sqrt(var(qxb[,2]))<br>CI.upper.q2 &lt;- qx[2] + qnorm(1-alpha/2)*sqrt(var(qxb[,2]))<br><br>CI.lower.mu &lt;- mux - qnorm(1-alpha/2)*sqrt(var(muxb))<br>CI.upper.mu &lt;- mux + qnorm(1-alpha/2)*sqrt(var(muxb))<br><br>hist(Yxs, prob=T)<br>z&lt;-seq(-6,7,by=0.01)<br>d&lt;-dnorm(z, mean=0.8 * (x[1] &gt; 0), sd=(1+(x[2] &gt; 0)))<br>lines(z,d, col=&quot;darkred&quot;  )<br>abline(v=q1,col=&quot;darkred&quot; )<br>abline(v=q2, col=&quot;darkred&quot; )<br>abline(v=qx[1], col=&quot;darkblue&quot;)<br>abline(v=qx[2], col=&quot;darkblue&quot;)<br>abline(v=mu, col=&quot;darkred&quot;)<br>abline(v=mux, col=&quot;darkblue&quot;)<br>abline(v=CI.lower.q1, col=&quot;darkblue&quot;, lty=2)<br>abline(v=CI.upper.q1, col=&quot;darkblue&quot;, lty=2)<br>abline(v=CI.lower.q2, col=&quot;darkblue&quot;, lty=2)<br>abline(v=CI.upper.q2, col=&quot;darkblue&quot;, lty=2)<br>abline(v=CI.lower.mu, col=&quot;darkblue&quot;, lty=2)<br>abline(v=CI.upper.mu, col=&quot;darkblue&quot;, lty=2)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*51q42CSUNthXU1yF2bFsWw.png" /><figcaption>Histogram of the simulated conditional distribution overlaid with the true density (in red). Additionally, the estimated conditional expectation and the conditional (0.05, 0.95) quantiles are in blue, with true values in red. Moreover, the dashed red lines are the confidence intervals for the estimates as calculated by DRF. Source: Author.</figcaption></figure><p>As can be seen from the above code, we essentially have <em>B</em> subtrees that can be used to calculate the measure each time. From these <em>B</em> samples of mean and quantiles, we can then calculate variances and use a normal approximation to obtain (asymptotic) confidence intervals seen in the dashed line in the Figure. <em>Again, all of this can be done despite the missing values in </em><strong><em>X</em></strong>(!)</p><p><strong>Variable Importance</strong></p><p>A final important aspect of Random Forests is efficiently calculated variable importance measures. While traditional measures are somewhat ad hoc, for the traditional RF and for DRF, there are now principled measures available, as explained in this <a href="https://medium.com/towards-data-science/variable-importance-in-random-forests-20c6690e44e0">medium article</a>. For RF, the Sobol-MDA method reliably identifies the variables most important for conditional expectation estimation, whereas for DRF, the MMD-MDA identifies the variables most important for the estimation of the distribution overall. As discussed in the article, using the idea of <em>projected Random Forests</em>, these measures can be very efficiently implemented. We demonstrate this in the example with a less efficient implementation of the MMD variable importance measure:</p><pre>## Variable importance for conditional Quantile Estimation<br><br><br>## For the conditional quantiles we use a measure that considers the whole distribution,<br>## i.e. the MMD based measure of DRF.<br>MMDVimp &lt;- compute_drf_vimp(X=X,Y=Y, print=F)<br>sort(MMDVimp, decreasing = T)<br><br>         X2          X1          X8          X9          X3         X10          X6 <br>0.812070057 0.149333294 0.015815104 0.007078924 0.006750151 0.005492872 0.001984300 <br>         X4          X5          X7 <br>0.000000000 0.000000000 0.000000000 </pre><p>Here both <em>X_1</em> and <em>X_2</em> are correctly identified as being the most relevant variable when trying to estimate the distribution. Remarkably, despite the dependence of <em>X_3</em> and <em>X_1</em>, the measure correctly quantifies that <em>X_3</em> is not important for the prediction of the distribution of <em>Y</em>. This is something that the original MDA measure of Random Forests tends to do wrong, as demonstrated in the medium article. Moreover, notice again that the missing values in <strong><em>X</em></strong> are no problem here.</p><h4>Conclusion</h4><p>GRF/DRF and also the traditional Random Forest should not be missing in the toolbox of any data scientist. While methods like XGboost can have a better performance in traditional prediction, the many strengths of modern RF-based approaches render them an incredibly versatile tool.</p><p>Of course, one should keep in mind that these methods are still fully nonparametric, and a lot of data points are needed for the fit to make sense. This is in particularly true for the uncertainty quantification, which is only valid asymptotically, i.e. for “large” samples.</p><h4>Literature</h4><p>[1] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.</p><h4>Appendix: Variable Importance Code</h4><pre>require(drf)<br>require(Matrix)<br>require(kernlab)<br><br><br>#&#39; Variable importance for Distributional Random Forests<br>#&#39;<br>#&#39; @param X Matrix with input training data.<br>#&#39; @param Y Matrix with output training data.<br>#&#39; @param X_test Matrix with input testing data. If NULL, out-of-bag estimates are used.<br>#&#39; @param num.trees Number of trees to fit DRF. Default value is 500 trees.<br>#&#39; @param silent If FALSE, print variable iteration number, otherwise nothing is print. Default is FALSE.<br>#&#39;<br>#&#39; @return The list of importance values for all input variables.<br>#&#39; @export<br>#&#39;<br>#&#39; @examples<br>compute_drf_vimp &lt;- function(X, Y, X_test = NULL, num.trees = 500, silent = FALSE){<br>  <br>  # fit initial DRF<br>  bandwidth_Y &lt;- drf:::medianHeuristic(Y)<br>  k_Y &lt;- rbfdot(sigma = bandwidth_Y)<br>  K &lt;- kernelMatrix(k_Y, Y, Y)<br>  DRF &lt;- drf(X, Y, num.trees = num.trees)<br>  wall &lt;- predict(DRF, X_test)$weights<br>  <br>  # compute normalization constant<br>  wbar &lt;- colMeans(wall)<br>  wall_wbar &lt;- sweep(wall, 2, wbar, &quot;-&quot;)<br>  I0 &lt;- as.numeric(sum(diag(wall_wbar %*% K %*% t(wall_wbar))))<br>  <br>  # compute drf importance dropping variables one by one<br>  I &lt;- sapply(1:ncol(X), function(j) {<br>    if (!silent){print(paste0(&#39;Running importance for variable X&#39;, j, &#39;...&#39;))}<br>    DRFj &lt;- drf(X = X[, -j, drop=F], Y = Y, num.trees = num.trees) <br>    DRFpredj &lt;- predict(DRFj, X_test[, -j])<br>    wj &lt;- DRFpredj$weights<br>    Ij &lt;- sum(diag((wj - wall) %*% K %*% t(wj - wall)))/I0<br>    return(Ij)<br>  })<br>  <br>  # compute retraining bias<br>  DRF0 &lt;- drf(X = X, Y = Y, num.trees = num.trees)<br>  DRFpred0 = predict(DRF0, X_test)<br>  w0 &lt;- DRFpred0$weights<br>  vimp0 &lt;- sum(diag((w0 - wall) %*% K %*% t(w0 - wall)))/I0<br>  <br>  # compute final importance (remove bias &amp; truncate negative values)<br>  vimp &lt;- sapply(I - vimp0, function(x){max(0,x)})<br>  <br>  names(vimp)&lt;-colnames(X)<br>  <br>  return(vimp)<br>  <br>}</pre><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b62debaf1d62" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/random-forests-in-2023-modern-extensions-of-a-powerful-method-b62debaf1d62">Random Forests in 2023: Modern Extensions of a Powerful Method</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Variable Importance in Random Forests]]></title>
            <link>https://medium.com/data-science/variable-importance-in-random-forests-20c6690e44e0?source=rss-ca780798011a------2</link>
            <guid isPermaLink="false">https://medium.com/p/20c6690e44e0</guid>
            <category><![CDATA[deep-dives]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[interpretable-ml]]></category>
            <category><![CDATA[math]]></category>
            <category><![CDATA[random-forest]]></category>
            <dc:creator><![CDATA[Jeffrey Näf]]></dc:creator>
            <pubDate>Fri, 03 Nov 2023 06:06:47 GMT</pubDate>
            <atom:updated>2023-11-03T06:06:47.913Z</atom:updated>
            <content:encoded><![CDATA[<h4>Traditional Methods and New Developments</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*aUfHbsy_RKEozl6YDrt2rw.png" /><figcaption>Features of (Distributional) Random Forests. In this article: The ability to produce variable importance. Source: Author.</figcaption></figure><p>Random Forest and generalizations (in particular, Generalized Random Forests (GRF) and Distributional Random Forests (DRF) ) are powerful and easy-to-use machine learning methods that should not be absent in the toolbox of any data scientist. They not only show robust performance over a large range of datasets without the need for tuning, but can also easily handle <a href="https://medium.com/towards-data-science/random-forests-and-missing-values-3daaea103db0">missing values</a>, and even provide <a href="https://medium.com/towards-data-science/inference-for-distributional-random-forests-64610bbb3927">confidence intervals</a>. In this article, we focus on another feature they are able to provide: notions of feature importance. In particular, we focus on:</p><ol><li>Traditional Random Forest (RF), which is used to predict the conditional expectation of a variable <em>Y</em> given p predictors <strong><em>X</em></strong>.</li><li>The <a href="https://medium.com/towards-data-science/drf-a-random-forest-for-almost-everything-625fa5c3bcb8">Distributional Random Forest</a>, which is used to predict the whole conditional distribution of a d-variate <strong><em>Y</em></strong> given <em>p</em> predictors <strong><em>X</em></strong>.</li></ol><p>Unfortunately, like many modern machine learning methods, both forests lack interpretability. That is, there are so many operations involved, it seems impossible to determine what the functional relationship between the predictors and <em>Y</em> actually is. A common way to tackle this problem is to define Variable Importance measures (VIMP), that at least help decide which predictors are important. Generally, this has two different objectives:</p><p>(1) finding a small number of variables with maximal accuracy,</p><p>(2) detecting and ranking all influential variables to focus on for further exploration.</p><p>The difference between (1) and (2) matters as soon as there is dependence between the elements in <strong><em>X</em></strong> (so pretty much always). For example, if two variables are highly correlated together and with <em>Y</em>, one of the two inputs can be removed without hurting accuracy for objective (1), since both variables convey the same information. However, both should be included for objective (2), since these two variables may have different meanings in practice for domain experts.</p><p>Today we focus on (1) and try to find a smaller number of predictors that display more or less the same predictive accuracy. For instance, in the wage example below, we are able to reduce the number of predictors from 79 to about 20, with only a small reduction in accuracy. These most important predictors contain variables such as age and education which are well-known to influence wages. There are also many great articles on medium about (2), using Shapley values such as <a href="https://towardsdatascience.com/from-shapley-to-shap-understanding-the-math-e7155414213b">this one</a> or <a href="https://towardsdatascience.com/understand-the-working-of-shap-based-on-shapley-values-used-in-xai-in-the-most-simple-way-d61e4947aa4e">this one</a>. There is also very recent and exciting <a href="https://hal.science/hal-03232621/document">academic literature</a> on how to efficiently calculate Shapley values with Random Forest. But this is material for a second article.</p><p>The two measures we look at today are actually more general variable importance measures that can be used for any method, based on the drop-and-relearn principle which we will look at below. We focus exclusively on tree-based methods here, however. Moreover, we don’t go into great detail explaining the methods, but rather try to focus on their applications and why newer versions are preferable to the more traditional ones.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/794/1*f9VzF3CQwtrf4hqCxBPI1A.png" /><figcaption>Overview of Variable Importance Measures for Random Forests. Mean Decrease Impurity (MDI) and Mean Decrease Accuracy (MDA) were both postulated by Breiman. Due to their empirical nature, however, several problems remained, which were recently addressed by Sobol-MDA. Source: Author</figcaption></figure><h4>The Beginnings</h4><p>Variable importance measures for RFs are in fact as old as RF itself. The first accuracy the <strong>Mean Decrease Accuracy (MDA)</strong> was proposed by Breiman in his seminal Random Forest paper [1]. The idea is simple: For every dimension <em>j=1,…,p</em>, one compares the accuracy of the full prediction with the accuracy of the prediction when <em>X_j</em> is randomly permuted. The idea of this is to break the relationship between <em>X_j</em> and <em>Y</em> and compare the accuracy when <em>X_j</em> is not helping to predict <em>Y</em> by design, to the case when it is potentially of use.</p><p>There are various different versions of MDA implemented in R and Python:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*M0o2LHQj9yNCDHb_ghcYzQ.png" /><figcaption>Different Versions of MDA, implemented in different packages. Source: Table 1 in [3]</figcaption></figure><p>Unfortunately, permuting variable <em>X_j</em> in this way not only breaks its relationship to <em>Y</em>, but also to the other variables in <strong><em>X</em></strong>. This is not a problem if <em>X_j</em> is independent from all other variables, but it becomes a problem once there is dependence. Consequently, [3] is able to show that as soon as there is dependence in <strong><em>X</em></strong>, the MDA converges to something nonsensical. In particular, MDA can give high importance to a variable <em>X_j</em> that is not important to predict <em>Y</em>, but is highly correlated with another variable, say <em>X_l</em>, that is actually important for predicting <em>Y</em> (as demonstrated in the example below). At the same time, it can fail to detect variables that are actually relevant, as demonstrated by a long list of papers in [3, Section 2.1]. Intuitively, what we would want to measure is the performance of the model if <em>X_j</em> is not included, and instead, we measure the performance of a model with a permuted <em>X_j</em> variable.</p><p>The second traditional accuracy measure is <strong>Mean Decrease Impurity (MDI)</strong>, which sums the weighted decreases of impurity over all nodes that split on a given covariate, averaged over all trees in the forest. Unfortunately, MDI is ill-defined from the start (it&#39;s not clear what it should measure) and several papers highlight the practical problem of this approach (e.g. [5]) As such, we will not go into detail about MDI, as MDA is often the preferred choice.</p><h4>Modern Developments I: Sobol-MDA</h4><p>For the longest time, I thought these somewhat informal measures were the best we could do. One paper that changed that, came out only very recently. In this paper, the authors demonstrate theoretically that the popular measures above are actually quite flawed and do not measure what we want to measure. So the first question might be: What do we actually want to measure? One potential answer: The Sobol-index (originally proposed in the computer science literature):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/554/1*XoCfjmKeOOa3HwIsQgkTjw.png" /></figure><p>Let’s unpack this. First, <em>tau(</em><strong><em>X</em></strong><em>)=E[ Y | </em><strong><em>X</em></strong><em>] </em>is the conditional expectation function we would like to estimate. This is a random variable because it is a function of the random <strong><em>X</em></strong>. Now <strong><em>X</em></strong><em>^{(-j)}</em> is the <em>p-1</em> vector with covariate <em>j</em> removed. Thus <em>ST^{(j)}</em> is the reduction in output explained variance if the <em>j</em>th output variable is removed.</p><p>The above is the more traditional way of writing the measure. However, for me writing:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/808/1*wwKfDohoxS4N19RONHB_og.png" /></figure><p>is much more intuitive. Here <em>d</em> is a distance between two random vectors and for the <em>ST^{(j)}</em> above, this distance is simply the usual Euclidean distance. Thus the upper part of <em>ST^{(j)}</em> is simply measuring the average squared distance between what we want (<em>tau(</em><strong><em>X</em></strong><em>)</em>) and what we get without variable <em>j</em>. The latter is</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*I3KvLW0oeT_o6e7ZYGryXg.png" /></figure><p>The question becomes how to estimate this efficiently. It turns out that the intuitive drop-and-relearn principle would be enough: Simply estimating <em>tau(</em><strong><em>X</em></strong><em>) </em>using RF and then dropping <em>X_j</em> and refitting the RF to obtain an estimate of <em>tau(</em><strong><em>X^{(-j)}</em></strong><em>), </em>one obtains the consistent estimator<em>:</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/821/1*Ppzj0U4VlYE8rlyLlRB2gQ.png" /></figure><p>where <em>tau_n(</em><strong><em>X</em></strong><em>_i)</em> is the RF estimate for a test point <strong><em>X</em></strong><em>_i</em> using all <em>p</em> predictors and similarly <em>tau_n(</em><strong><em>X</em></strong><em>_i^{(-j)})</em> is the refitted forest using only <em>p-1</em> predictors.</p><p>However, this means the forest needs to be refitted <em>p</em> times, not very efficient when <em>p</em> is large! As such the authors in [3] develop what they call the <strong>Sobol-MDA</strong>. Instead of refitting the forest each time, the forest is only fitted once. Then test points are dropped down the same forest and the resulting prediction is “projected” to form the measure in (1). That is, splits on <em>X_j</em> are simply ignored (remember the goal is to obtain an estimate without <em>X_j</em>). The authors are able to show that calculating (1) above with this projected approach also results in a consistent estimator! This is a beautiful idea indeed and renders the algorithm applicable even in high dimensions.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-IdKh0Bud0QfI-w472q0-Q.png" /><figcaption>Illustration of the projection approach. On the left the division of the two-dimensional space by RF. On the right the projection approach ignores splits in X^(2), thereby removing it when making predictions. As can be seen the point X gets projected onto X^{(-j)} on the right using this principle. Source: Figure 1 in [3]</figcaption></figure><p>The method is implemented in R in the <a href="https://gitlab.com/drti/sobolmda">soboldMDA </a>package, based on the very fast <a href="https://cran.r-project.org/web/packages/ranger/ranger.pdf">ranger</a> package.</p><h4>Modern Developments II: MMD-based sensitivity index</h4><p>Looking at the formulation using the distance <em>d</em>, a natural question is to ask whether different distances could be used to get variable importance measures for more difficult problems. One such recent example is to use the MMD distance as <em>d:</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/793/1*5nQ4BxhuBf9zTS-YI8yTIA.png" /></figure><p>The MMD distance is a wonderful tool, that allows to quite easily build a distance between distributions using a kernel <em>k</em> (such as the Gaussian kernel):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/451/1*7YzaWgwrB2GfAyS0HPpE8Q.png" /></figure><p>For the moment I leave the details to further articles. The most important takeaway is simply that <em>I^{(j)}</em> considers a more general target than the conditional expectation. It recognizes a variable <em>X_j </em>as important, as soon as it influences the distribution of <em>Y</em> in any way. It might be that <em>X_j</em> only changes the variance or the quantiles and leaves the conditional mean of <em>Y</em> untouched (see example below). In this case, the Sobol-MDA would not recognize <em>X_j</em> as important, but the MMD method would. This doesn’t necessarily make it better, it is simply a different tool: If one is interested in predicting the conditional expectation, <em>ST^{(j)}</em> is the right measure. However, if one is interested in predicting other aspects of the distribution, especially quantiles, <em>I^{(j)}</em> would be better suited. Again <em>I^{(j)}</em> can be consistently estimated using the drop-and-relearn principle (refitting DRF for <em>j=1,…,p</em> eacht time with variable $j$ removed), or the same projection approach as for Sobol-MDA can be used. A drop-and-relearn-based implementation is attached at the end of this article. We refer to this method here as <strong>MMD-MDA</strong>.</p><h4>Simulated Data</h4><p>We now illustrate these two modern measures on a simple simulated example: We first download and install the Sobol-MDA package from <a href="https://gitlab.com/drti/sobolmda">Gitlab</a> and then load all the packages necessary for this example:</p><pre>library(kernlab)<br>library(drf)<br>library(Matrix)<br>library(DescTools)<br>library(mice)<br>library(sobolMDA)<br>source(&quot;compute_drf_vimp.R&quot;) ##Contents of this file can be found below<br>source(&quot;evaluation.R&quot;) ##Contents of this file can be found below</pre><p>Then we simulate from this simple example: We take <em>X_1, X_2, X_4, …, X_10</em> independently uniform between (-1,1) and create dependence between <em>X_1</em> and <em>X_3</em> by taking <em>X_3=X_1 + uniform error</em>. Then we simulate <em>Y</em> as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/800/1*fVFn4lfuSjRoVZRsgdLRLA.png" /></figure><pre>##Simulate Data that experiences both a mean as well as sd shift<br><br># Simulate from X<br>x1 &lt;- runif(n,-1,1)<br>x2 &lt;- runif(n,-1,1)<br>X0 &lt;- matrix(runif(7*n,-1,1), nrow=n, ncol=7)<br>x3 &lt;- x1+ runif(n,-1,1)<br>X &lt;- cbind(x1,x2, x3, X0)<br><br># Simulate dependent variable Y<br>Y &lt;- as.matrix(rnorm(n,mean = 0.8*(x1 &gt; 0), sd = 1 + 1*(x2 &gt; 0)))<br>colnames(X)&lt;-paste0(&quot;X&quot;, 1:10)<br><br><br>head(cbind(Y,X))<br></pre><p>We then analyze the Sobol-MDA approach to estimate the conditional expectation of <em>Y</em> given <strong><em>X</em></strong>:</p><pre>## Variable importance for conditional Expectation Estimation<br><br>XY &lt;- as.data.frame(cbind(Xfull, Y))<br>colnames(XY) &lt;- c(paste(&#39;X&#39;, 1:(ncol(XY)-1), sep=&#39;&#39;), &#39;Y&#39;)<br>num.trees &lt;- 500<br>forest &lt;- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees, importance = &#39;sobolMDA&#39;)<br>sobolMDA &lt;- forest$variable.importance<br>names(sobolMDA) &lt;- colnames(X)<br><br>sort(sobolMDA, decreasing = T)<br><br>          X1           X8           X7           X6           X5           X9 <br> 0.062220958  0.021946135  0.016818860  0.016777223 -0.001290326 -0.001540919 <br>          X3          X10           X4           X2 <br>-0.001578540 -0.007400854 -0.008299478 -0.020334150 <br></pre><p>As can be seen, it correctly identifies that <em>X_1</em> is the most important variable, while the others are ranked equally (un)important. This makes sense because the conditional expectation of <em>Y</em> is only changed by <em>X_1</em>. Crucially, the measure manages to do this despite the dependence between <em>X_1 </em>and <em>X_3</em>. Thus we successfully pursued goal (1), as explained above, in this example. On the other hand, we can also have a look at the traditional MDA:</p><pre>forest &lt;- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees, importance = &#39;permutation&#39;)<br>MDA &lt;- forest$variable.importance<br>names(MDA) &lt;- colnames(X)<br><br>sort(MDA, decreasing = T)<br><br>          X1           X3           X6           X7           X8           X2 <br> 0.464516976  0.118147061  0.063969310  0.032741521  0.029004312 -0.004494380 <br>          X4           X9          X10           X5 <br>-0.009977733 -0.011030996 -0.014281844 -0.018062544 </pre><p>In this case, while it correctly identifies <em>X_1 </em>as the most important variable, it also places <em>X_3 </em>in second place, with a value that seems quite a bit higher than the remaining variables. This despite the fact, that <em>X_3 </em>is just as unimportant as <em>X_2, X_4,…, X_10</em>!</p><p>But what if we are interested in predicting the distribution of <em>Y</em> more generally, say for estimating quantiles? In this case, we need a measure that is able to recognize the influence of <em>X_2</em> on the conditional variance of <em>Y</em>. Here the MMD variable importance measure comes into play:</p><pre>MMDVimp &lt;- compute_drf_vimp(X=X,Y=Y)<br>sort(MMDVimp, decreasing = T)<br><br>         X2          X1         X10          X6          X8          X3 <br>0.683315006 0.318517259 0.014066410 0.009904518 0.006859128 0.005529749 <br>         X7          X9          X4          X5 <br>0.003476256 0.003290550 0.002417677 0.002036174 </pre><p>Again the measure is able to correctly identify what matters: <em>X_1</em> and <em>X_2</em> are the two most important variables. And again, it does this despite the dependence between <em>X_1</em> and <em>X_3</em>. Interestingly, it also gives the variance shift from <em>X_2</em> a higher importance than the expectation shift from <em>X_1</em>.</p><h4>Real Data</h4><p>Finally, I present a real data application to demonstrate the variable importance measure. Note that with DRF, we could look even at multivariate <strong><em>Y</em></strong> but to keep things more simple, we focus on a univariate setting and consider the US wage data from the 2018 American Community Survey by the US Census Bureau. In the first <a href="https://www.jmlr.org/papers/v23/21-0585.html">DRF paper</a>, we obtained data on approximately 1 million full-time employees from the 2018 American Community Survey by the US Census Bureau from which we extracted the salary information and all covariates that might be relevant for salaries. This wealth of data is ideal to experiment with a method like DRF (in fact we will only use a tiny subset for this analysis). The data we load can be found <a href="https://github.com/lorismichel/drf/blob/master/applications/wage_data/data/datasets/wage_benchmark.Rdata">here</a>.</p><pre># Load data (https://github.com/lorismichel/drf/blob/master/applications/wage_data/data/datasets/wage_benchmark.Rdata)<br>load(&quot;wage_benchmark.Rdata&quot;)<br><br>##Define the training data<br><br>n&lt;-1000<br><br>Xtrain&lt;-X[1:n,] <br>Ytrain&lt;-Y[1:n,]<br>Xtrain&lt;-cbind(Xtrain,Ytrain[,&quot;male&quot;])<br>colnames(Xtrain)[ncol(Xtrain)]&lt;-&quot;male&quot;<br>Ytrain&lt;-Ytrain[,1, drop=F]<br><br><br>##Define the test data<br>ntest&lt;-2000<br>Xtest&lt;-X[(n+1):(n+ntest),]  <br>Ytest&lt;-Y[(n+1):(n+ntest),]<br>Xtest&lt;-cbind(Xtest,Ytest[,&quot;male&quot;])<br>colnames(Xtest)[ncol(Xtest)]&lt;-&quot;male&quot;<br>Ytest&lt;-Ytest[,1, drop=F]</pre><p>We now calculate both variable importance measures (this will take a while as only the drop-and-relearn method is implemented for DRF):</p><pre># Calculate variable importance for both measures<br># 1. Sobol-MDA<br>XY &lt;- as.data.frame(cbind(Xtrain, Ytrain))<br>colnames(XY) &lt;- c(paste(&#39;X&#39;, 1:(ncol(XY)-1), sep=&#39;&#39;), &#39;Y&#39;)<br>num.trees &lt;- 500<br>forest &lt;- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees, importance = &#39;sobolMDA&#39;)<br>SobolMDA &lt;- forest$variable.importance<br>names(SobolMDA) &lt;- colnames(Xtrain)<br><br># 2. MMD-MDA<br>MMDVimp &lt;- compute_drf_vimp(X=Xtrain,Y=Ytrain,silent=T)<br><br><br><br>print(&quot;Top 10 most important variables for conditional Expectation estimation&quot;)<br>sort(SobolMDA, decreasing = T)[1:10]<br>print(&quot;Top 5 most important variables for conditional Distribution estimation&quot;)<br>sort(MMDVimp, decreasing = T)[1:10]</pre><pre>Sobol-MDA:<br><br>education_level                   age                  male <br>          0.073506769           0.027079349           0.013722756 <br>        occupation_11         occupation_43           industry_54 <br>          0.013550320           0.010025332           0.007744589 <br>          industry_44         occupation_23         occupation_15 <br>          0.006657918           0.005772662           0.004610835 <br>marital_never married <br>          0.004545964<br></pre><pre>MMD-MDA:<br><br>education_level                   age                  male <br>          0.420316085           0.109212519           0.027356393 <br>        occupation_43         occupation_11 marital_never married <br>          0.016861954           0.014122583           0.003449910 <br>        occupation_29       marital_married           industry_81 <br>          0.002272629           0.002085207           0.001152210 <br>          industry_72 <br>          0.000984725</pre><p>In this case, the two variable importance measures agree quite a bit on which variables are important. While this is not a causal analysis, it is also nice that variables that are known to be important to predict wages, specifically “age”, “education_level” and “gender”, are indeed seen as very important by the two measures.</p><p>To obtain a small set of predictive variables, one could now for <em>j=1,…p-1,</em></p><p>(I) Remove the least important variable</p><p>(II) Calculate the loss (e.g. mean squared error) on a test set</p><p>(III) Recalculate the variable importance for the remaining variable</p><p>(IV) Repeat until a certain stopping criterion is met</p><p>One could stop, for instance, if the loss increased by more than 5%. To make my life easier in this article, I just use the same variable importance values saved in “SobolMDA” and “MMDVimp” above. That is, I ignore step (III) and only consider (I), (II) and (IV). When the goal of estimation is the full conditional distribution, step (II) is also not entirely clear. We use what we refer to as MMD loss, described in more detail in our paper ([4]). This loss considers the error we are making in the prediction of the distribution. For the conditional mean, we simply use the mean-squared error. This is done in the function “evalall” found below:</p><pre># Remove variables one-by-one accoring to the importance values saved in SobolMDA<br># and MMDVimp.<br>evallistSobol&lt;-evalall(SobolMDA, X=Xtrain ,Y=Ytrain ,Xtest, Ytest, metrics=c(&quot;MSE&quot;), num.trees )<br>evallistMMD&lt;-evalall(MMDVimp, X=Xtrain ,Y=Ytrain ,Xtest, Ytest, metrics=c(&quot;MMD&quot;), num.trees )<br><br><br>plot(evallistSobol$evalMSE, type=&quot;l&quot;, lwd=2, cex=0.8, col=&quot;darkgreen&quot;, main=&quot;MSE loss&quot; , xlab=&quot;Number of Variables removed&quot;, ylab=&quot;Values&quot;)<br>plot(evallistMMD$evalMMD, type=&quot;l&quot;, lwd=2, cex=0.8, col=&quot;darkgreen&quot;, main=&quot;MMD loss&quot; , xlab=&quot;Number of Variables removed&quot;, ylab=&quot;Values&quot;)</pre><p>This results in the following two pictures:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*OeG0_-q8q3_Ie4RS9T3RcQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*rX7w3fOhWWPZYkOQc6QiHQ.png" /></figure><p>Notice that both have somewhat wiggly lines, which is first due to the fact that I did not recalculate the importance measure, e.g., left out step (III), and second due to the randomness of the forests. Aside from this, the graphs nicely show how the errors successively increase with each variable that is removed. This increase is first slow for the least important variables and then gets quicker for the most important ones, exactly as one would expect. In particular, the loss in both cases remains virtually unchanged if one removes the 50 least important variables! In fact, one could remove about 70 variables in both cases without increasing the loss by more than 6%. One has to note though that many predictors are part of one-hot encoded categorical variables and thus one needs to be somewhat careful when removing predictors, as they correspond to levels of one categorical variable. However, in an actual application, this might still be desirable.</p><h4>Conclusion</h4><p>In this article, we looked at modern approaches to variable importance in Random Forests, with the goal of obtaining a small set of predictors or covariates, both with respect to the conditional expectation and for the conditional distribution more generally. We have seen in the wage data example, that this can lead to a substantial reduction in predictors with virtually the same accuracy.</p><p>As noted above the measures presented are not strictly constrained to Random Forest, but can be used more generally in principle. However, forests allow for the elegant projection approach that allows for the calculation of the importance measure for all variables <em>j</em>, without having to refit the forest each time (!) This is described in both [3] and [4].</p><h4>Literature</h4><p>[1] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.</p><p>[2] Breiman, L. (2003a). Setting up, using, and understanding random forests v3.1. Technical report, UC Berkeley, Department of Statistics</p><p>[3] Bénard, C., Da Veiga, S., and Scornet, E. (2022). Mean decrease accuracy for random forests: inconsistency, and a practical solution via the Sobol-MDA. Biometrika, 109(4):881–900.</p><p>[4] Clément Bénard, Jeffrey Näf, and Julie Josse. MMD-based variable importance for distributional random forest, 2023.</p><p>[5] Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics, 8:25.</p><h4>Appendix : Code</h4><pre><br>#### Contents of compute_drf_vimp.R ######<br><br>#&#39; Variable importance for Distributional Random Forests<br>#&#39;<br>#&#39; @param X Matrix with input training data.<br>#&#39; @param Y Matrix with output training data.<br>#&#39; @param X_test Matrix with input testing data. If NULL, out-of-bag estimates are used.<br>#&#39; @param num.trees Number of trees to fit DRF. Default value is 500 trees.<br>#&#39; @param silent If FALSE, print variable iteration number, otherwise nothing is print. Default is FALSE.<br>#&#39;<br>#&#39; @return The list of importance values for all input variables.<br>#&#39; @export<br>#&#39;<br>#&#39; @examples<br>compute_drf_vimp &lt;- function(X, Y, X_test = NULL, num.trees = 500, silent = FALSE){<br>  <br>  # fit initial DRF<br>  bandwidth_Y &lt;- drf:::medianHeuristic(Y)<br>  k_Y &lt;- rbfdot(sigma = bandwidth_Y)<br>  K &lt;- kernelMatrix(k_Y, Y, Y)<br>  DRF &lt;- drf(X, Y, num.trees = num.trees)<br>  wall &lt;- predict(DRF, X_test)$weights<br>  <br>  # compute normalization constant<br>  wbar &lt;- colMeans(wall)<br>  wall_wbar &lt;- sweep(wall, 2, wbar, &quot;-&quot;)<br>  I0 &lt;- as.numeric(sum(diag(wall_wbar %*% K %*% t(wall_wbar))))<br>  <br>  # compute drf importance dropping variables one by one<br>  I &lt;- sapply(1:ncol(X), function(j) {<br>    if (!silent){print(paste0(&#39;Running importance for variable X&#39;, j, &#39;...&#39;))}<br>    DRFj &lt;- drf(X = X[, -j, drop=F], Y = Y, num.trees = num.trees) <br>    DRFpredj &lt;- predict(DRFj, X_test[, -j])<br>    wj &lt;- DRFpredj$weights<br>    Ij &lt;- sum(diag((wj - wall) %*% K %*% t(wj - wall)))/I0<br>    return(Ij)<br>  })<br>  <br>  # compute retraining bias<br>  DRF0 &lt;- drf(X = X, Y = Y, num.trees = num.trees)<br>  DRFpred0 = predict(DRF0, X_test)<br>  w0 &lt;- DRFpred0$weights<br>  vimp0 &lt;- sum(diag((w0 - wall) %*% K %*% t(w0 - wall)))/I0<br>  <br>  # compute final importance (remove bias &amp; truncate negative values)<br>  vimp &lt;- sapply(I - vimp0, function(x){max(0,x)})<br>  <br>  names(vimp)&lt;-colnames(X)<br>  <br>  return(vimp)<br>  <br>}<br><br><br><br></pre><pre>#### Contents of evaluation.R ######<br><br><br>compute_mmd_loss &lt;- function(Y_train, Y_test, weights){<br>  # Y_train &lt;- scale(Y_train)<br>  # Y_test &lt;- scale(Y_test)<br>  bandwidth_Y &lt;- (1/drf:::medianHeuristic(Y_train))^2<br>  k_Y &lt;- rbfdot(sigma = bandwidth_Y)<br>  K_train &lt;- matrix(kernelMatrix(k_Y, Y_train, Y_train), ncol = nrow(Y_train))<br>  K_cross &lt;- matrix(kernelMatrix(k_Y, Y_test, Y_train), ncol = nrow(Y_train))<br>  weights &lt;- matrix(weights, ncol = ncol(weights))<br>  t1 &lt;- diag(weights%*%K_train%*%t(weights))<br>  t2 &lt;- diag(K_cross%*%t(weights))<br>  mmd_loss &lt;- mean(t1) - 2*mean(t2)<br>  mmd_loss<br>}<br><br>evalall &lt;- function(Vimp, X ,Y ,Xtest, Ytest, metrics=c(&quot;MMD&quot;,&quot;MSE&quot;), num.trees ){<br>  <br>  if (ncol(Ytest) &gt; 1 &amp; &quot;MSE&quot; %in% metrics){<br>    metrics &lt;- metrics[!( metrics %in% &quot;MSE&quot;) ]<br>  }<br>  <br>  # Sort for increasing importance, such that the least important variables are removed first<br>  Vimp&lt;-sort(Vimp)<br>  <br>  if ( is.null(names(Vimp)) ){<br>    stop(&quot;Need names for later&quot;)  <br>  }<br>  <br>  <br>  evalMMD&lt;-matrix(0, nrow=ncol(X))<br>  evalMSE&lt;-matrix(0, nrow=ncol(X))<br>  <br>  ###Idea: Create a function that takes a variable importance measure and does this loop!!<br>  <br>  for (j in 1:ncol(X)){<br>    <br>   <br>    <br>    if (j==1){<br>      <br>      if (&quot;MMD&quot; %in% metrics){<br>        <br>        DRFred&lt;- drf(X=X,Y=Y)<br>        weights&lt;- predict(DRFred, newdata=Xtest)$weights<br>        evalMMD[j]&lt;-compute_mmd_loss(Y_train=Y, Y_test=Ytest, weights)<br>    <br>      }<br>      <br>      if (&quot;MSE&quot; %in% metrics){<br>        <br>        XY &lt;- as.data.frame(cbind(X, Y))<br>        colnames(XY) &lt;- c(paste(&#39;X&#39;, 1:(ncol(XY)-1), sep=&#39;&#39;), &#39;Y&#39;)<br>        RFfull &lt;- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees)<br>        XtestRF&lt;-Xtest<br>        colnames(XtestRF) &lt;- paste(&#39;X&#39;, 1:ncol(XtestRF), sep=&#39;&#39;)<br>        predRF&lt;-predict(RFfull, data=XtestRF)<br>        evalMSE[j] &lt;- mean((Ytest - predRF$predictions)^2)<br>      <br>      }<br><br>    }else{<br>      <br>      <br>      if (&quot;MMD&quot; %in% metrics){<br>        <br>        DRFred&lt;- drf(X=X[,!(colnames(X) %in% names(Vimp[1:(j-1)])), drop=F],Y=Y)<br>        weights&lt;- predict(DRFred, newdata=Xtest[,!(colnames(Xtest) %in% names(Vimp[1:(j-1)])), drop=F])$weights<br>        evalMMD[j]&lt;-compute_mmd_loss(Y_train=Y, Y_test=Ytest, weights)<br>        <br>      }<br>      <br>      <br>      <br>      if (&quot;MSE&quot; %in% metrics){<br>        <br>        XY &lt;- as.data.frame(cbind(X[,!(colnames(X) %in% names(Vimp[1:(j-1)])), drop=F], Y))<br>        colnames(XY) &lt;- c(paste(&#39;X&#39;, 1:(ncol(XY)-1), sep=&#39;&#39;), &#39;Y&#39;)<br>        RFfull &lt;- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees)<br>        XtestRF&lt;-Xtest[,!(colnames(Xtest) %in% names(Vimp[1:(j-1)])), drop=F]<br>        colnames(XtestRF) &lt;- paste(&#39;X&#39;, 1:ncol(XtestRF), sep=&#39;&#39;)<br>        predRF&lt;-predict(RFfull, data=XtestRF)<br>        evalMSE[j] &lt;- mean((Ytest - predRF$predictions)^2)<br>        <br>        # DRFall &lt;- drf(X=X[,!(colnames(X) %in% names(Vimp[1:(j-1)])), drop=F], Y=Y, num.trees=num.trees)<br>        # quantpredictall&lt;-predict(DRFall, newdata=Xtest[,!(colnames(Xtest) %in% names(Vimp[1:(j-1)])), drop=F], functional=&quot;quantile&quot;,quantiles=c(0.5))<br>        # evalMAD[j] &lt;- mean(sapply(1:nrow(Xtest), function(j)  abs(Ytest[j] - quantpredictall$quantile[,,&quot;q=0.5&quot;][j]) ))<br>      }<br>      <br>    }<br>    <br>  }<br>  <br>  return(list(Vimp=Vimp, evalMMD=evalMMD, evalMSE=evalMSE ))<br>  <br>}<br><br></pre><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=20c6690e44e0" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/variable-importance-in-random-forests-20c6690e44e0">Variable Importance in Random Forests</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[CLVTools Version 0.10.0]]></title>
            <link>https://medium.com/@jeffrey_85949/clvtools-version-0-10-0-8a943a856743?source=rss-ca780798011a------2</link>
            <guid isPermaLink="false">https://medium.com/p/8a943a856743</guid>
            <category><![CDATA[ltv]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[business-strategy]]></category>
            <category><![CDATA[r]]></category>
            <dc:creator><![CDATA[Jeffrey Näf]]></dc:creator>
            <pubDate>Thu, 26 Oct 2023 07:58:32 GMT</pubDate>
            <atom:updated>2023-10-26T07:58:32.309Z</atom:updated>
            <content:encoded><![CDATA[<h4>Model your customers like never before</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*q-hQ6AnWFqAg6unp" /><figcaption>Photo by <a href="https://unsplash.com/@hostreviews?utm_source=medium&amp;utm_medium=referral">Stephen Phillips - Hostreviews.co.uk</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p>In <a href="https://towardsdatascience.com/clvtools-a-powerful-r-package-to-evaluate-your-customers-4fd1781811d">this article</a> I briefly discussed the CLVTools package for customer lifetime (CLV) modeling. The package just got a major upgrade and is better than ever.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*88QtD_p18ZkUPu5y.jpeg" /><figcaption>Overview of the models included in CLVTools. Source: Walkthrough <a href="https://www.clvtools.com/articles/CLVTools.html">here</a>, with permission of the authors.</figcaption></figure><p>One of the big advantages of the package was the inclusion of a well-designed implementation of the extended Pareto/NBD model, which allowed for time-varying covariates. Thus it allows for instance to model known seasonal patterns, such as holidays or the firms own marketing campaigns. Including well-designed time-varying covariates in this fashion can lead to a heavy performance boost compared to the already strong (original) Pareto/NBD model, as demonstrated for instance in the <a href="https://pubsonline.informs.org/doi/abs/10.1287/mksc.2020.1254">original paper</a>. Since the Pareto/NBD model has shown consistent performance over a wide range of marketing datasets over the last 30 years, the extended Pareto/NBD model can be seen as a flagship model of modern marketing research. Crucially, in this update the implementation of the extended Pareto/NBD model got a makeover and is now done in Rcpp. That means it is an order of magnitude faster than before, which can make a huge difference for end-users. It’s not an understatement to say that this lifts the extended Pareto/NBD model from an academic curiosity to a model that can actually be used in practice.</p><p>This is by far the biggest change, but there are also some further nice additions, such as new formula interfaces and new plotting capabilities. In total there is:</p><ul><li>MUCH faster fitting for the Pareto/NBD with time-varying covariates because the LL is now implemented in Rcpp</li><li>Added interface to specify models using a formula notation (<a href="https://www.clvtools.com/reference/latentAttrition.html">latentAttrition()</a> and <a href="https://www.clvtools.com/reference/spending.html">spending()</a>)</li><li>New method to plot customer’s transaction timings (plot.clv.data(which=&#39;timings&#39;))</li><li>Ability to draw diagnostic plots of multiple models in single plot (plot(other.models=list(), label=c()))</li></ul><p>I quickly demonstrate some of the new features here, with the same example as in my <a href="https://towardsdatascience.com/clvtools-a-powerful-r-package-to-evaluate-your-customers-4fd1781811d">first article</a>. More details about the package including a walkthrough can be found <a href="https://www.clvtools.com/">here</a>.</p><pre>library(CLVTools)<br>data(&quot;apparelTrans&quot;)<br>data(&quot;apparelDynCov&quot;)<br><br>clv.apparel &lt;- clvdata(apparelTrans,<br>                       date.format = &quot;ymd&quot;,<br>                       time.unit= &quot;week&quot;,<br>                       estimation.split=40,<br>                       name.id=&quot;Id&quot;,<br>                       name.date=&quot;Date&quot;,<br>                       name.price=&quot;Price&quot;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*WSNw6FrXw5n0wjm9hDkYPg.png" /></figure><p>This data set also comes with covariates:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mrO1kgQqsdfw9DAxtpEL-Q.png" /></figure><p>“Gender” and “Channel” are static covariates that change over customers, but not over time. Marketing on the other hand is both customer and time specific and thus dynamic.</p><p>Now let’s look at two of the changes:</p><h4>An Rcpp implementation of the extended Pareto/NBD model</h4><p>In the first article, we first set the covariates</p><pre>clv.dyn &lt;- SetDynamicCovariates(clv.data=clv.apparel,<br>                                data.cov.life = apparelDynCov,<br>                                data.cov.trans = apparelDynCov,<br>                                names.cov.life = c(&quot;Marketing&quot;, &quot;Gender&quot;, &quot;Channel&quot;),<br>                                names.cov.trans = c(&quot;Marketing&quot;, &quot;Gender&quot;, &quot;Channel&quot;),<br>                                name.id = &quot;Id&quot;,<br>                                name.date = &quot;Cov.Date&quot;<br></pre><p>and then optimized:</p><pre># Estimate the PNBD with Covariates (This takes a while (!))<br>est.pnbd.dyn &lt;- pnbd(clv.dyn)</pre><p>The latter optimization took around 9 minutes on my <a href="https://www.digitec.ch/en/s1/product/microsoft-surface-studio-2-intel-core-i7-7820hq-16-gb-1000-gb-ssd-pc-10247238">Surface Studio 2</a>, despite this being a relatively small dataset. Now it takes only <strong>24 seconds</strong>!! This is an impressive 96% decrease in computation time.</p><h4>New Plotting Capabilities</h4><p>Models are nice, especially when patterns cannot be easily spotted. However sometimes a picture is worth a thousand models. Thus there are some further plotting capabilities in the new version. In particular, one can now plot the transaction behavior of the dataset beforehand:</p><pre>plot(clv.dyn,which=&#39;timings&#39;)</pre><p>This results in the following nice overview over the transactions of each customer ( each line is a customer over time and each dot represents a transaction).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1AVRax2zBH_4xdcu_NBnUw.png" /></figure><p>In particular this provides a nice overview of what the (extended) Pareto/NBD model tries to model: The purchase behavior of each customer and how likely it is that the customer actually left the firm (i.e. attrition).</p><h4>Conclusion</h4><p>This article briefly discussed the new features of CLVTools 0.10.0. The most important change is the Rcpp implementation of the flagship model (the extended Pareto/NBD model). More detailed information on new features and the package itself can be found <a href="https://www.clvtools.com/">here</a>.</p><p>The package is already widely used, but we hope that this improved implementation, new formula interface and further plotting capabilities will lead to an even more widespread adoption of the package.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=8a943a856743" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Random Forests and Missing Values]]></title>
            <link>https://medium.com/data-science/random-forests-and-missing-values-3daaea103db0?source=rss-ca780798011a------2</link>
            <guid isPermaLink="false">https://medium.com/p/3daaea103db0</guid>
            <category><![CDATA[missing-data]]></category>
            <category><![CDATA[random-forest]]></category>
            <category><![CDATA[exploratory-data-analysis]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Jeffrey Näf]]></dc:creator>
            <pubDate>Wed, 21 Jun 2023 13:51:30 GMT</pubDate>
            <atom:updated>2023-10-31T09:02:48.928Z</atom:updated>
            <content:encoded><![CDATA[<h4>There is a very Intriguing Practical Fix</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*C4w-xgklPtkzG8Gs48c9fQ.png" /><figcaption>Features of (Distributional) Random Forests. In this article: The ability to deal with missing values. Source: Author.</figcaption></figure><p>Outside of some excessively cleaned data sets that one finds online, missing values are everywhere. In fact, the more complex and large the dataset, the more likely it is that missing values are present. Missing values are a fascinating field of statistical research, but in practice they are often a nuisance.</p><p>If you deal with a prediction problem where you want to predict a variable <em>Y</em> from <em>p-</em>dimensional covariates <strong><em>X</em></strong><em>=(X_1,…,X_p)</em> and you face missing values in <strong><em>X</em></strong>, there is an interesting solution for tree-based methods. This method is actually rather old but appears to work remarkably well in a wide range of data sets. I am talking about the “missing incorporated in attributes criterion” (MIA; [1]). While there are many good articles about missing values (such as <a href="https://medium.com/@vinitasilaparasetty/guide-to-handling-missing-values-in-data-science-37d62edbfdc1">this one</a>), this powerful approach seems somewhat underused. In particular, one does not need to impute, delete or predict your missing values in any way, but instead can just run your prediction as if you have fully observed data.</p><p>I will quickly explain how the method itself works, and then present an example with the distributional random forest (DRF) explained <a href="https://medium.com/towards-data-science/drf-a-random-forest-for-almost-everything-625fa5c3bcb8">here</a>. I chose DRF because it is a very general version of Random Forest (in particular, it can also be used to predict a random vector <strong><em>Y</em></strong>) and because I am somewhat biased here. MIA is actually implemented for the generalized random forest (<a href="https://grf-labs.github.io/grf/">GRF</a>), which covers a wide range of forest implementations. In particular, since the implementation of DRF on <a href="https://cran.r-project.org/web/packages/drf/index.html">CRAN </a>is based on GRF, after a slight modification, it can use the MIA method as well.</p><p>Of course, be aware that this is a quick fix that (as far as I know) has no theoretical guarantees. Depending on the missingness mechanism, it might heavily bias the analysis. On the other hand, most commonly used methods for dealing with missing values don’t have any theoretical guarantees or are outright known to bias the analysis and, at least empirically, MIA appears to work well and</p><h3>How it works</h3><p>Recall that in a RF, splits are build of the form <em>X_j &lt; S </em>or <em>X_j ≥ S</em>, for a dimension <em>j=1,…,p</em>. To find this split valule <em>S</em> it optimizes some kind of criterion on the <em>Y</em>’s, for example the CART criterion. Thus the observations are successively divided through decision rules that depend on <strong><em>X</em></strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mS1TtqR8KCg8FUPfnd_AXg.png" /><figcaption>Illustration of the splitting done in a RF. Image by author.</figcaption></figure><p>The original paper explains it a bit confusingly, but as far as I understand MIA works as follows: Let us consider a sample (<em>Y_1</em>, <strong><em>X</em></strong><em>_1),…, (Y_n, </em><strong><em>X</em></strong><em>_n), with</em></p><p><strong><em>X</em></strong><em>_i=(X_i1,…,X_ip)’.</em></p><p>Splitting without missing values is just looking for the value <em>S</em> as above and then throwing all <em>Y_i</em> with <em>X_ij &lt; S</em> in Node 1 and all <em>Y_i</em> with <em>X_ij ≥ S</em> in Node 2. Calculating the target criterion such as CART for each value <em>S</em>, we can choose the best one. With missing values there are instead 3 options for every candidate split value <em>S</em> to consider:</p><ul><li>Use the usual rule for all observations <em>i</em> such that <em>X_ij</em> is observed and send <em>i</em> to Node 1 if <em>X_ij </em>is missing.</li><li>Use the usual rule for all observations i such that <em>X_ij</em> is observed and send <em>i</em> to Node 2 if <em>X_ij</em> is missing.</li><li>Ignore the usual rule and just send i to Node 1 if <em>X_ij</em> is missing and to Node 2 if it is observed.</li></ul><p>Which of these rules to follow is again decided according to the criterion on <em>Y_i</em> we use.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*cgJYF-qaAWARr4jITM_LZA.png" /><figcaption>Illustration of how I understand the MIA procedure. Given observations in the parent node, we are looking for the best split value S. For each split value we consider the 3 options and try until we find the minimum. The sets {} on the left indicate the observations i that get sent to the left or the right. Image by author.</figcaption></figure><h3>A Small Example</h3><p>It needs to be mentioned at this point that the drf package on <a href="https://cran.r-project.org/web/packages/drf/index.html">CRAN </a>is not yet updated with the newest methodology. There will be a point in the future where all of this is implemented in one package on CRAN(!) However, for the moment, there are two versions:</p><p>If you want to use the <strong>fast drf implementation</strong> with missing values (<strong>without</strong> confidence intervals), you can use the “drfown” function attached at the end of this article. This code is adapted from</p><p><a href="https://github.com/lorismichel/drf">lorismichel/drf: Distributional Random Forests (Cevid et al., 2020) (github.com)</a></p><p>If on the other hand, you want <strong>confidence intervals </strong>with your parameters, use this (slower) code</p><p><a href="https://github.com/JeffNaef/drfinference/blob/main/drf-foo.R">drfinference/drf-foo.R at main · JeffNaef/drfinference (github.com)</a></p><p>In particular, drf-foo.R contains all you need in the latter case.</p><p>We will focus on the slower code with confidence intervals, as explained in <a href="https://medium.com/towards-data-science/inference-for-distributional-random-forests-64610bbb3927">this article</a> and also consider the same example as in said article:</p><pre>set.seed(2)<br><br>n&lt;-2000<br>beta1&lt;-1<br>beta2&lt;--1.8<br><br><br># Model Simulation<br>X&lt;-mvrnorm(n = n, mu=c(0,0), Sigma=matrix(c(1,0.7,0.7,1), nrow=2,ncol=2))<br>u&lt;-rnorm(n=n, sd = sqrt(exp(X[,1])))<br>Y&lt;- matrix(beta1*X[,1] + beta2*X[,2] + u, ncol=1)</pre><p>Note that this is a heteroskedastic linear model with <em>p=2 </em>and with the variance of the error term depending on the <em>X_1</em> values. Now we also add missing values to <em>X_1</em> in a Missing at Random (MAR) fashion:</p><pre>prob_na &lt;- 0.3<br>X[, 1] &lt;- ifelse(X[, 2] &lt;= -0.2 &amp; runif(n) &lt; prob_na, NA, X[, 1]) </pre><p>This means that <em>X_1</em> is missing with a probability of 0.3, whenever <em>X_2</em> has a value smaller than -0.2. Thus the probability of <em>X_1</em> being missing depends on <em>X_2</em>, which is what is referred to as “Missing at Random”. This is already a complex situation and there is information to be gained by looking at the pattern of missing values. That is, the missingness is not “Missing Completely at Random (MCAR)”, because the missingness of <em>X_1 </em>depends on the value of <em>X_2. </em>This in turn means that the distribution of <em>X_2</em> we draw from is different, conditional on whether <em>X_1</em> is missing or not. This in particular means that deleting the rows with missing values might severely bias the analysis.</p><p>We now fix<em> </em><strong><em>x</em></strong> and estimate the conditional expectation and variance given <strong><em>X</em></strong><em>=</em><strong><em>x, </em></strong>exactly as in the <a href="https://medium.com/towards-data-science/inference-for-distributional-random-forests-64610bbb3927">previous article</a>.</p><pre># Choose an x that is not too far out<br>x&lt;-matrix(c(1,1),ncol=2)<br><br># Choose alpha for CIs<br>alpha&lt;-0.05</pre><p>We then also fit DRF and predict the weights for the test point <strong><em>x</em> </strong>(which corresponds to predicting the conditional distribution of <em>Y|</em><strong><em>X</em></strong><em>=</em><strong><em>x</em></strong>):</p><pre>## Fit the new DRF framework<br>drf_fit &lt;- drfCI(X=X, Y=Y, min.node.size = 5, splitting.rule=&#39;FourierMMD&#39;, num.features=10, B=100)<br><br>## predict weights<br>DRF = predictdrf(drf_fit, x=x)<br>weights &lt;- DRF$weights[1,]</pre><h4>Example 1: Conditional Expectation</h4><p>We first estimate the conditional expectation of <em>Y|</em><strong><em>X</em></strong><em>=</em><strong><em>x.</em></strong></p><pre># Estimate the conditional expectation at x:<br>condexpest&lt;- sum(weights*Y)<br><br># Use the distribution of weights, see below<br>distofcondexpest&lt;-unlist(lapply(DRF$weightsb, function(wb)  sum(wb[1,]*Y)  ))<br><br># Can either use the above directly to build confidence interval, or can use the normal approximation.<br># We will use the latter<br>varest&lt;-var(distofcondexpest-condexpest)<br><br># build 95%-CI<br>lower&lt;-condexpest - qnorm(1-alpha/2)*sqrt(varest)<br>upper&lt;-condexpest + qnorm(1-alpha/2)*sqrt(varest)<br>round(c(lower, condexpest, upper),2)<br><br># without NAs: (-1.00, -0.69 -0.37)<br># with NAs: (-1.15, -0.67, -0.19)</pre><p>Remarkably, the values obtained with NAs are very close to the ones from the first analysis without NAs in the <a href="https://medium.com/towards-data-science/inference-for-distributional-random-forests-64610bbb3927">previous article</a>! This really is quite astounding to me, as this missing mechanism is not easy to deal with. Interestingly, the estimated variance of the estimator also doubles, from around 0.025 without missing values to roughly 0.06 with missing values.</p><p>The truth is given as:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6lLvKMR3odPPrYXvzvDjOQ.png" /></figure><p>so we have a slight error, but the confidence intervals contain the truth, as they should.</p><p>The result looks similar for a more complex target, like the conditional variance:</p><pre># Estimate the conditional expectation at x:<br>condvarest&lt;- sum(weights*Y^2) - condexpest^2<br><br>distofcondvarest&lt;-unlist(lapply(DRF$weightsb, function(wb)  {<br>  sum(wb[1,]*Y^2) - sum(wb[1,]*Y)^2<br>} ))<br><br># Can either use the above directly to build confidence interval, or can use the normal approximation.<br># We will use the latter<br>varest&lt;-var(distofcondvarest-condvarest)<br><br># build 95%-CI<br>lower&lt;-condvarest - qnorm(1-alpha/2)*sqrt(varest)<br>upper&lt;-condvarest + qnorm(1-alpha/2)*sqrt(varest)<br><br>c(lower, condvarest, upper)<br><br># without NAs: (1.89, 2.65, 3.42)<br># with NAs: (1.79, 2.74, 3.69)</pre><p>Here the difference in the estimated values is a bit larger. As the truth is given as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*fkRz_h_mF6POPYmtxkF0ug.png" /></figure><p>the estimate with NAs is even slightly more accurate (though of course this is likely just randomness). Again the variance estimate of the (variance) estimator increases with missing values, from 0.15 (no missing values) to 0.23.</p><h3>Conclusion</h3><p>In this article, we discussed MIA, which is an adaptation of the splitting method in Random Forest to deal with missing values. Since it is implemented in GRF and DRF, it can be used broadly and the small example we looked at indicates that it works remarkably well.</p><p>However, I’d like to note again that there is no theoretical guarantee for consistency or for the confidence intervals to make sense, even for a very large number of datapoints. The reason for missing values are numerous and one has to be very careful to not bias one’s analysis through a careless handling of this issue. The MIA method is by no means a well-understood fix for this problem. However, it seems to be a reasonable quick fix for the moment, that appears to be able to make some use of the pattern of missingness in the data. If somebody does/has a more extensive simulation analysis I would be curious about the results.</p><h3>Code</h3><pre>require(drf)            <br>             <br>drfown &lt;-               function(X, Y,<br>                              num.trees = 500,<br>                              splitting.rule = &quot;FourierMMD&quot;,<br>                              num.features = 10,<br>                              bandwidth = NULL,<br>                              response.scaling = TRUE,<br>                              node.scaling = FALSE,<br>                              sample.weights = NULL,<br>                              sample.fraction = 0.5,<br>                              mtry = min(ceiling(sqrt(ncol(X)) + 20), ncol(X)),<br>                              min.node.size = 15,<br>                              honesty = TRUE,<br>                              honesty.fraction = 0.5,<br>                              honesty.prune.leaves = TRUE,<br>                              alpha = 0.05,<br>                              imbalance.penalty = 0,<br>                              compute.oob.predictions = TRUE,<br>                              num.threads = NULL,<br>                              seed = stats::runif(1, 0, .Machine$integer.max),<br>                              compute.variable.importance = FALSE) {<br>  <br>  # initial checks for X and Y<br>  if (is.data.frame(X)) {<br>    <br>    if (is.null(names(X))) {<br>      stop(&quot;the regressor should be named if provided under data.frame format.&quot;)<br>    }<br>    <br>    if (any(apply(X, 2, class) %in% c(&quot;factor&quot;, &quot;character&quot;))) {<br>      any.factor.or.character &lt;- TRUE<br>      X.mat &lt;- as.matrix(fastDummies::dummy_cols(X, remove_selected_columns = TRUE))<br>    } else {<br>      any.factor.or.character &lt;- FALSE<br>      X.mat &lt;- as.matrix(X)<br>    }<br>    <br>    mat.col.names.df &lt;- names(X)<br>    mat.col.names &lt;- colnames(X.mat)<br>  } else {<br>    X.mat &lt;- X<br>    mat.col.names &lt;- NULL<br>    mat.col.names.df &lt;- NULL<br>    any.factor.or.character &lt;- FALSE<br>  }<br>  <br>  if (is.data.frame(Y)) {<br>    <br>    if (any(apply(Y, 2, class) %in% c(&quot;factor&quot;, &quot;character&quot;))) {<br>      stop(&quot;Y should only contain numeric variables.&quot;)<br>    }<br>    Y &lt;- as.matrix(Y)<br>  }<br>  <br>  if (is.vector(Y)) {<br>    Y &lt;- matrix(Y,ncol=1)<br>  }<br>  <br>  <br>  #validate_X(X.mat)<br>  <br>  if (inherits(X, &quot;Matrix&quot;) &amp;&amp; !(inherits(X, &quot;dgCMatrix&quot;))) {<br>        stop(&quot;Currently only sparse data of class &#39;dgCMatrix&#39; is supported.&quot;)<br>    }<br>  <br>  drf:::validate_sample_weights(sample.weights, X.mat)<br>  #Y &lt;- validate_observations(Y, X)<br>  <br>  # set legacy GRF parameters<br>  clusters &lt;- vector(mode = &quot;numeric&quot;, length = 0)<br>  samples.per.cluster &lt;- 0<br>  equalize.cluster.weights &lt;- FALSE<br>  ci.group.size &lt;- 1<br>  <br>  num.threads &lt;- drf:::validate_num_threads(num.threads)<br>  <br>  all.tunable.params &lt;- c(&quot;sample.fraction&quot;, &quot;mtry&quot;, &quot;min.node.size&quot;, &quot;honesty.fraction&quot;,<br>                          &quot;honesty.prune.leaves&quot;, &quot;alpha&quot;, &quot;imbalance.penalty&quot;)<br>  <br>  # should we scale or not the data<br>  if (response.scaling) {<br>    Y.transformed &lt;- scale(Y)<br>  } else {<br>    Y.transformed &lt;- Y<br>  }<br>  <br>  data &lt;- drf:::create_data_matrices(X.mat, outcome = Y.transformed, sample.weights = sample.weights)<br>  <br>  # bandwidth using median heuristic by default<br>  if (is.null(bandwidth)) {<br>    bandwidth &lt;- drf:::medianHeuristic(Y.transformed)<br>  }<br>  <br>  <br>  args &lt;- list(num.trees = num.trees,<br>               clusters = clusters,<br>               samples.per.cluster = samples.per.cluster,<br>               sample.fraction = sample.fraction,<br>               mtry = mtry,<br>               min.node.size = min.node.size,<br>               honesty = honesty,<br>               honesty.fraction = honesty.fraction,<br>               honesty.prune.leaves = honesty.prune.leaves,<br>               alpha = alpha,<br>               imbalance.penalty = imbalance.penalty,<br>               ci.group.size = ci.group.size,<br>               compute.oob.predictions = compute.oob.predictions,<br>               num.threads = num.threads,<br>               seed = seed,<br>               num_features = num.features,<br>               bandwidth = bandwidth,<br>               node_scaling = ifelse(node.scaling, 1, 0))<br>  <br>  if (splitting.rule == &quot;CART&quot;) {<br>    ##forest &lt;- do.call(gini_train, c(data, args))<br>    forest &lt;- drf:::do.call.rcpp(drf:::gini_train, c(data, args))<br>    ##forest &lt;- do.call(gini_train, c(data, args))<br>  } else if (splitting.rule == &quot;FourierMMD&quot;) {<br>    forest &lt;- drf:::do.call.rcpp(drf:::fourier_train, c(data, args))<br>  } else {<br>    stop(&quot;splitting rule not available.&quot;)<br>  }<br>  <br>  class(forest) &lt;- c(&quot;drf&quot;)<br>  forest[[&quot;ci.group.size&quot;]] &lt;- ci.group.size<br>  forest[[&quot;X.orig&quot;]] &lt;- X.mat<br>  forest[[&quot;is.df.X&quot;]] &lt;- is.data.frame(X)<br>  forest[[&quot;Y.orig&quot;]] &lt;- Y<br>  forest[[&quot;sample.weights&quot;]] &lt;- sample.weights<br>  forest[[&quot;clusters&quot;]] &lt;- clusters<br>  forest[[&quot;equalize.cluster.weights&quot;]] &lt;- equalize.cluster.weights<br>  forest[[&quot;tunable.params&quot;]] &lt;- args[all.tunable.params]<br>  forest[[&quot;mat.col.names&quot;]] &lt;- mat.col.names<br>  forest[[&quot;mat.col.names.df&quot;]] &lt;- mat.col.names.df<br>  forest[[&quot;any.factor.or.character&quot;]] &lt;- any.factor.or.character<br>  <br>  if (compute.variable.importance) {<br>    forest[[&#39;variable.importance&#39;]] &lt;- variableImportance(forest, h = bandwidth)<br>  }<br>  <br>  forest<br>}</pre><h4><strong><em>Citations</em></strong></h4><p>[1] Twala, B. E. T. H., M. C. Jones, and David J. Hand. Good methods for coping with missing data in decision trees. <em>Pattern Recognition Letters 29</em>,2008.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3daaea103db0" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/random-forests-and-missing-values-3daaea103db0">Random Forests and Missing Values</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Studying the Gender Wage Gap in the US Using Distributional Random Forests]]></title>
            <link>https://medium.com/data-science/studying-the-gender-wage-gap-in-the-us-using-distributional-random-forests-ec4c2a69abf0?source=rss-ca780798011a------2</link>
            <guid isPermaLink="false">https://medium.com/p/ec4c2a69abf0</guid>
            <category><![CDATA[wage-gap]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[random-forest]]></category>
            <category><![CDATA[data-for-change]]></category>
            <dc:creator><![CDATA[Jeffrey Näf]]></dc:creator>
            <pubDate>Sat, 18 Feb 2023 00:35:04 GMT</pubDate>
            <atom:updated>2024-10-22T14:40:34.133Z</atom:updated>
            <content:encoded><![CDATA[<h4>An example of a real data analysis with DRF</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*bOIuAfgCFLGVyzBB" /><figcaption>Photo by <a href="https://unsplash.com/@mettyunuabona?utm_source=medium&amp;utm_medium=referral">Ehimetalor Akhere Unuabona</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p>In two previous articles, I explained <a href="https://towardsdatascience.com/drf-a-random-forest-for-almost-everything-625fa5c3bcb8">Distributional Random Forests (DRFs)</a>, a Random Forest that is able to estimate conditional distributions, as well as an <a href="https://medium.com/@jeffrey_85949/inference-for-distributional-random-forests-64610bbb3927">extension of the method</a> that allows for uncertainty quantification, like Confidence Intervals, etc. Here I present an example of a real-world application to wage data from the 2018 American Community Survey by the US Census Bureau. In the first <a href="https://www.jmlr.org/papers/v23/21-0585.html">DRF paper</a>, we obtained data on approximately 1 million full-time employees from the 2018 American Community Survey by the US Census Bureau from which we extracted the salary information and all covariates that might be relevant for salaries. This wealth of data is ideal to experiment with a method like DRF (in fact we will only use a tiny subset for this analysis).</p><p>When one studies the raw data on hourly wages, there is a consistent gap between the two genders, in that men tend to earn more. An interesting question is whether the observed gap in hourly wages (<em>W</em>) of men (<em>G=1</em>) and women (<em>G=0</em>) is due to gender alone or whether it can be explained by some other confounding variables <strong><em>X</em></strong>, which are influenced by gender and in turn influence wage. That is, we want to study the effect size corresponding to the bold arrow in the following causal graph:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/704/1*bsxOHQIkDIpC2e516MkYRw.png" /><figcaption>Assumed Causal Graph, G=Gender, W=Wage and <strong>X</strong> are confounders</figcaption></figure><p>For example, let’s assume that <strong><em>X</em></strong> only includes occupation and that women have a tendency to choose occupations that do not entail a high monetary reward, such as doctors, nurses, or teachers, while men tend to have professional gambling jobs with obscene hourly wages. If this alone were to drive the difference in hourly wages between genders, we would still see a wage gap when looking directly at the hourly wage data. However if we then fix the occupation (<strong><em>X</em></strong>) only to doctors and compare the two wage distributions there, any statistically significant difference can only come from gender alone.</p><p>We focus on a two-stage analysis:</p><ul><li>We fix <strong><em>X</em></strong> to a particular value and compare the distribution of wages in the two groups for the covariates fixed to <strong><em>X</em></strong><em>=</em><strong><em>x</em></strong>. This is interesting from two points of view: First, if <strong><em>X</em></strong> really includes all other factors that influence wages and are related to gender, then fixing <strong><em>X</em></strong><em>=</em><strong><em>x</em></strong> and looking at wages for both genders means we really observe the effect of Gender on wages. Second, it allows for the prediction of the whole wage distribution for an individual with given characteristics <strong><em>x</em></strong>.</li><li>We use the assumed causal graph above and the rules of causality to estimate a counterfactual distribution with DRF: The distribution of women’s wages had they been treated as men for setting the wage. If <strong><em>X</em> </strong>contains all the relevant covariates and there is no gender pay gap, this distribution should be the same as the wage distribution for men (neglecting statistical randomness).</li></ul><p>This article is the culmination of the work of several people: The code and the dataset were obtained from the original <a href="https://github.com/lorismichel/drf">DRF repository</a> and then combined with the methods developed in our new paper on <a href="https://arxiv.org/abs/2302.05761">arXiv</a>, written together with <a href="http://www.linkedin.com/in/corinne-rahel-emmenegger">Corinne Emenegger</a>.</p><p>Before going on, I want to point out that this is <em>only an example to illustrate the use of DRF</em>. I do not want to make any serious (causal) claims here, simply because the analysis is surely flawed in some regard and the causal graph we assume below is surely wrong. Moreover, we only use a tiny subset of the available data.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/931/1*eotwYHcwrPx5xTvd0a7Isg.png" /></figure><p>Also, note that the code is quite slow to run. This is because, while DRF itself is coded in C, the repeated fitting needed for Confidence Intervals is implemented in R so far.</p><p>That said, let’s dive in. In the following, all images, unless otherwise noted, are by the author.</p><h4>Data</h4><p>The PUMS (Public Use Microdata Area) data from the 2018 1-Year American Community Survey is obtained from the <a href="https://www.census.gov/content/dam/Census/data/developers/api-user-guide/api-guide.pdf">US Census Bureau API</a>. The survey is sent to ≈ 3.5 million people annually and aims to give more up-to-date data than the official census that is carried out every decade. The 2018 data set has about 3 million anonymized data points for the 51 states and the District of Columbia. For the DRF paper linked above, we retrieved only the subset of variables that might be relevant for the salaries, such as a person’s gender, age, race, person’s marital status, education level, and level of English knowledge.</p><p>The preprocessed data can be found <a href="https://github.com/lorismichel/drf/tree/master/applications/wage_data/data/datasets">here</a>. We first do some further cleaning:</p><pre>##Further data cleaning ##<br><br>which = rep(TRUE, nrow(wage))<br>which = which &amp; (wage$age &gt;= 17)<br>which = which &amp; (wage$weeks_worked &gt; 48)<br>which = which &amp; (wage$hours_worked &gt; 16)<br>which = which &amp; (wage$employment_status == &#39;employed&#39;)<br>which = which &amp; (wage$employer != &#39;self-employed&#39;)<br>which[is.na(which)] = FALSE<br><br>data = wage[which, ]<br>sum(is.na(data))<br>colSums(is.na(data))<br>rownames(data) = 1:nrow(data)<br>#data = na.omit(data)<br><br>data$log_wage = log(data$salary / (data$weeks_worked * data$hours_worked))<br><br><br>## Prepare data and fit drf<br>## Define X and Y<br>X = data[,c(<br>  &#39;age&#39;,<br>  &#39;race&#39;,<br>  &#39;hispanic_origin&#39;,<br>  &#39;citizenship&#39;,<br>  &#39;nativity&#39;, <br>  &#39;marital&#39;,<br>  &#39;family_size&#39;,<br>  &#39;children&#39;,<br>  &#39;education_level&#39;,<br>  &#39;english_level&#39;,<br>  &#39;economic_region&#39;<br>)]<br>X$occupation = unlist(lapply(as.character(data$occupation), function(s){return(substr(s, 1, 2))}))<br>X$occupation = as.factor(X$occupation)<br>X$industry = unlist(lapply(as.character(data$industry), function(s){return(substr(s, 1, 2))}))<br>X$industry[X$industry %in% c(&#39;32&#39;, &#39;33&#39;, &#39;3M&#39;)] = &#39;31&#39;<br>X$industry[X$industry %in% c(&#39;42&#39;)] = &#39;41&#39;<br>X$industry[X$industry %in% c(&#39;45&#39;, &#39;4M&#39;)] = &#39;44&#39;<br>X$industry[X$industry %in% c(&#39;49&#39;)] = &#39;48&#39;<br>X$industry[X$industry %in% c(&#39;92&#39;)] = &#39;91&#39;<br>X$industry = as.factor(X$industry)<br>X=dummy_cols(X, remove_selected_columns = TRUE)<br>X = as.matrix(X)<br><br>Y = data[,c(&#39;sex&#39;, &#39;log_wage&#39;)]<br>Y$sex = (Y$sex == &#39;male&#39;)<br>Y = as.matrix(Y)</pre><p>These are actually way more observations than we need, and we instead subsample 4&#39;000 training data points at random for the analysis here.</p><pre>train_idx = sample(1:nrow(data), 4000, replace = FALSE)<br><br>## Focus on training data<br>Ytrain=Y[train_idx,]<br>Xtrain=X[train_idx,]</pre><p>Again, this is because it is just an illustration— in reality, you would want to take as many data points as you can get. The estimated wage densities of the two sexes for these 4&#39;000 data points are plotted in Figure 1, with this code:</p><pre>## Plot the test data without adjustment<br>plotdfunadj = data[train_idx, ]<br>plotdfunadj$weight=1<br>plotdfunadj$plotweight[plotdfunadj$sex==&#39;female&#39;] = plotdfunadj$weight[plotdfunadj$sex==&#39;female&#39;]/sum(plotdfunadj$weight[plotdfunadj$sex==&#39;female&#39;])<br>plotdfunadj$plotweight[plotdfunadj$sex==&#39;male&#39;] = plotdfunadj$weight[plotdfunadj$sex==&#39;male&#39;]/sum(plotdfunadj$weight[plotdfunadj$sex==&#39;male&#39;])<br><br>#pooled data<br>ggplot(plotdfunadj, aes(log_wage)) +<br>  geom_density(adjust=2.5, alpha = 0.3, show.legend = TRUE,  aes(fill=sex, weight=plotweight)) +<br>  theme_light()+<br>  scale_fill_discrete(name = &quot;gender&quot;, labels = c(&#39;female&#39;, &quot;male&quot;))+<br>  theme(legend.position = c(0.83, 0.66),<br>        legend.text=element_text(size=18),<br>        legend.title=element_text(size=20),<br>        legend.background = element_rect(fill=alpha(&#39;white&#39;, 0.5)),<br>        axis.text.x = element_text(size=14),<br>        axis.text.y = element_text(size=14),<br>        axis.title.x = element_text(size=19),<br>        axis.title.y = element_text(size=19))+<br>  labs(x=&#39;log(hourly_wage)&#39;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*MlvPk7mzKJYmgg7xQhbTpQ.png" /><figcaption>Estimated densities for the (unconditional) raw log of the hourly wages</figcaption></figure><p>Calculating the percentage median difference between the two wages, that is</p><p>(median wage men- median wage women)/(median wage women) *100,</p><p>we obtain around 18 percent. That is, the median salary of men is 18 percent higher than that of women in the unadjusted data (!)</p><pre>## Median Difference before adjustment!<br>quantile_maleunadj = wtd.quantile(x=plotdfunadj$log_wage, weights=plotdfunadj$plotweight*(plotdfunadj$sex==&#39;male&#39;), normwt=TRUE, probs=0.5)<br>quantile_femaleunadj = wtd.quantile(x=plotdfunadj$log_wage, weights=plotdfunadj$plotweight*(plotdfunadj$sex==&#39;female&#39;), normwt=TRUE, probs=0.5)<br>(1-exp(quantile_femaleunadj)/exp(quantile_maleunadj))</pre><h4>Analysis</h4><p>The question now becomes whether this is truly “unfair”. That is, we assume the causal graph above, where Gender (<em>G</em>) influences both Wage (<em>W</em>), as well as covariates <strong><em>X</em></strong> that in turn influence <em>W</em>. What we would like to know is whether Gender directly influences Wage (the bold arrow). That is, if a woman and a man with the exact same characteristics <strong><em>X</em></strong><em>=</em><strong><em>x</em></strong> get the same wage, or whether she gets less, simply because of her gender.</p><p>We will study this in two settings. The first one is to truly hold <strong><em>X</em></strong><em>=</em><strong><em>x</em></strong> fixed and to use the machinery explained in the <a href="https://medium.com/@jeffrey_85949/inference-for-distributional-random-forests-64610bbb3927">earlier article</a>. Intuitively, if we fix all other covariates that could influence wage besides gender, and now compare the two wage distributions, then any observed difference must be by wage alone.</p><p>The second one tries to quantify this difference over all possible values of <strong>X</strong> This is done by calculating the counterfactual distribution of</p><p>W(male, <strong>X</strong>(female)).</p><p>This quantity is the counterfactual wage a man gets if he has exactly the characteristics of a woman. That is, we ask for the wage a woman gets if treated like a man.</p><p>Note that this assumes that the causal graph above is correct. In particular, it assumes that <strong><em>X</em> </strong>captures all relevant factors besides gender, that would determine the wage. It could very well be that this is not the case, thus the disclaimer at the beginning of this article.</p><h4>Studying the conditional distributional differences</h4><p>In the following, we fix <strong><em>x</em></strong> to an arbitrary point:</p><pre>i&lt;-47<br><br># Important: Test point needs to be a matrix<br>test_point&lt;-X[i,, drop=F]</pre><p>The following picture shows some of the values contained in this test point <strong><em>x</em> </strong>— we are looking at childcare workers with high school diplomas that are married and have 1 child. With DRF we can estimate and plot the densities for the two groups conditional on <strong><em>X</em></strong><em>=</em><strong><em>x</em></strong>:</p><pre># Load all relevant functions (the CIdrf.R file can be found at the end of this <br># article<br>source(&#39;CIdrf.R&#39;)<br><br><br>## Fit the new DRF framework (I forgot to include this in an earlier<br>## version of the article, apologies)<br>drf_fit &lt;- drfCI(X=Xtrain, Y=Ytrain, min.node.size = 20, splitting.rule=&#39;FourierMMD&#39;, num.features=10, B=100)<br><br># predict with the new framework<br>DRF = predictdrf(drf_fit, x=x)<br>weights &lt;- DRF$weights<br><br><br>## Conditional Density Plotting<br>plotdfx = data[train_idx, ]<br><br>propensity = sum(weights[plotdfx$sex==&#39;female&#39;])<br>plotdfx$plotweight = 0<br>plotdfx$plotweight[plotdfx$sex==&#39;female&#39;] = weights[plotdfx$sex==&#39;female&#39;]/propensity<br>plotdfx$plotweight[plotdfx$sex==&#39;male&#39;] = weights[plotdfx$sex==&#39;male&#39;]/(1-propensity)<br><br>gg = ggplot(plotdfx, aes(log_wage)) +<br>  geom_density(adjust=5, alpha = 0.3, show.legend=TRUE,  aes(fill=sex, weight=plotweight)) +<br>  labs(x=&#39;log(hourly wage)&#39;)+<br>  theme_light()+<br>  scale_fill_discrete(name = &quot;gender&quot;, labels = c(sprintf(&quot;F: %g%%&quot;, round(100*propensity, 1)), sprintf(&quot;M: %g%%&quot;, round(100*(1-propensity), 1))))+<br>  theme(legend.position = c(0.9, 0.65),<br>        legend.text=element_text(size=18),<br>        legend.title=element_text(size=20),<br>        legend.background = element_rect(fill=alpha(&#39;white&#39;, 0)),<br>        axis.text.x = element_text(size=14),<br>        axis.text.y = element_text(size=14),<br>        axis.title.x = element_text(size=19),<br>        axis.title.y = element_text(size=19))+<br>  annotate(&quot;text&quot;, x=-1, y=Inf, hjust=0, vjust=1, size=5, label = point_description(data[i,]))<br>plot(gg)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2qdUc7JqIyKddSSste0eQg.png" /><figcaption>Estimated density of the log(hourly wage) for the two genders given <strong>X</strong>=<strong>x</strong>. The code for this plot can be found at the end of the article.</figcaption></figure><p>In this plot, there appears to be a clear difference in wages, even in the case of this fixed <strong><em>x</em></strong> (remember, all the assumed confounders are fixed in this case, so we really just compare wages directly). With DRF we now estimate and test the <em>median difference</em></p><pre>## Getting the respective weights<br>weightsmale&lt;-weights*(Ytrain[, &quot;sex&quot;]==1)/sum(weights*(Ytrain[, &quot;sex&quot;]==1))<br>weightsfemale&lt;-weights*(Ytrain[, &quot;sex&quot;]==0)/sum(weights*(Ytrain[, &quot;sex&quot;]==0))<br><br><br>## Choosing alpha:<br>alpha&lt;-0.05<br><br><br># Step 1: Doing Median comparison for fixed x<br><br>quantile_male = wtd.quantile(x=data$log_wage[train_idx], weights=matrix(weightsmale), normwt=TRUE, probs=0.5)<br>quantile_female = wtd.quantile(x=data$log_wage[train_idx], weights=matrix(weightsfemale), normwt=TRUE, probs=0.5)<br><br>(medianx&lt;-unname(1-exp(quantile_female)/exp(quantile_male)))<br><br><br>mediandist &lt;- sapply(DRF$weightsb, function(wb) {<br>  <br>  wbmale&lt;-wb*(Ytrain[, &quot;sex&quot;]==1)/sum(wb*(Ytrain[, &quot;sex&quot;]==1))<br>  wbfemale&lt;-wb*(Ytrain[, &quot;sex&quot;]==0)/sum(wb*(Ytrain[, &quot;sex&quot;]==0))<br>  <br>  <br>  quantile_maleb = wtd.quantile(x=data$log_wage[train_idx], weights=matrix(wbmale), normwt=TRUE, probs=0.5)<br>  quantile_femaleb = wtd.quantile(x=data$log_wage[train_idx], weights=matrix(wbfemale), normwt=TRUE, probs=0.5)<br>  <br>  <br>  return( unname(1-exp(quantile_femaleb)/exp(quantile_maleb)) ) <br>})<br><br>varx&lt;-var(mediandist)<br><br>## Use Gaussian CI:<br>(upper&lt;-medianx + qnorm(1-alpha/2)*sqrt(varx))<br>(lower&lt;-medianx - qnorm(1-alpha/2)*sqrt(varx))<br><br></pre><p>This gives a confidence interval of the median difference of</p><p>(0.06, 0.40) or (6%, 40%)</p><p>This interval very clearly does not contain zero and thus the median difference is indeed significant.</p><p>Using the Witobj function, we can make this difference better visible</p><pre><br>Witobj&lt;-Witdrf(drf_fit, x=test_point, groupingvar=&quot;sex&quot;, alpha=0.05)<br><br><br><br>hatmun&lt;-function(y,Witobj){<br>  <br>  c&lt;-Witobj$c<br>  k_Y&lt;-Witobj$k_Y<br>  Y&lt;-Witobj$Y<br>  weightsall1&lt;-Witobj$weightsall1<br>  weightsall0&lt;-Witobj$weightsall0<br>  Ky=t(kernelMatrix(k_Y, Y , y = y))<br>  <br>  out&lt;-list()<br>  out$val &lt;- tcrossprod(Ky, weightsall1  ) - tcrossprod(Ky, weightsall0  )<br>  out$upper&lt;-  out$val+sqrt(c)<br>  out$lower&lt;-  out$val-sqrt(c)<br>  <br>  return( out )<br>  <br>  <br>  <br>}<br><br>all&lt;-hatmun(sort(Witobj$Y),Witobj)<br><br><br>plot(sort(Witobj$Y),all$val , type=&quot;l&quot;, col=&quot;darkblue&quot;, lwd=2, ylim=c(min(all$lower), max(all$upper)),<br>     xlab=&quot;log(wage)&quot;, ylab=&quot;witness function&quot;)<br>lines(sort(Witobj$Y),all$upper , type=&quot;l&quot;, col=&quot;darkgreen&quot;, lwd=2 )<br>lines(sort(Witobj$Y),all$lower , type=&quot;l&quot;, col=&quot;darkgreen&quot;, lwd=2 )<br>abline(h=0)</pre><p>which leads to the Figure:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yUemNKO16nTXsyNbkeVw4A.png" /><figcaption>Estimate of the conditional witness function for wage men minus wage women.</figcaption></figure><p>We refer to the <a href="https://medium.com/@jeffrey_85949/inference-for-distributional-random-forests-64610bbb3927">companion article</a> for a more detailed explanation of this concept. Essentially it can be thought of as</p><p><em>conditional density of log(wage) of men given </em><strong><em>x</em></strong><em> — conditional density of log(wage) of women given </em><strong><em>x</em></strong></p><p>That is, the conditional witness function shows where the density of one group is larger than the other, without having to actually estimate the density. In this example, negative values mean the density of women’s wages is higher than that of men conditional on <strong><em>x</em></strong>, and positive values mean the density of women’s wages is lower. Since we already estimated the conditional densities above, the conditional witness function alone does not add much more. But it’s good for illustration purposes. Indeed, we see that it is negative at the start, for values where the conditional density of women’s wages is higher than the conditional density of men’s wages. Vice-versa it turns positive for larger values, for which the conditional density of men’s wages is higher than for women. Thus the relevant information about the two densities is summarized in the witness function plot: We see that the density for women’s wages is higher for lower values of wages and lower for higher values, indicating that the density is shifted to the left and women earn less! Moreover, we can also provide 95% confidence bands in green that include the true function with 95%, <em>uniformly over all y</em>. (Though one really needs a lot of data to make this valid) Since this uniform confidence band does not contain the zero line between around 2 and 2.5, we see again that the difference between the two distributions is statistically significant.</p><p>Conditioning on a particular <strong><em>x</em></strong>, allows us to study individual effects in great detail and with a notion of uncertainty. However, it can also be interesting to study the overall effect. We do this by estimating the counterfactual distribution, in the next section.</p><h4>Estimating the counterfactual distribution</h4><p>Using the calculation laws of causality on our assumed causal graph, it can be derived that:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mQ0qES9kD2dFr2jYoL-yTg.png" /></figure><p>I.e. the distribution of the counterfactual we are looking for is obtained by averaging the conditional distribution of <em>W | G=male</em>, <strong><em>X</em></strong><em>=</em><strong><em>x</em></strong>, over the <strong><em>x</em></strong> of the gender female.</p><p>As the distributions are given as simple weights, this is easily done with DRF as follows:</p><pre>## Add code<br><br>## Male is 1, Female is 0<br><br># obtain all X from the female test population<br>Xtestf&lt;-Xtest[Ytest[,&quot;sex&quot;]==0,]<br><br><br># Obtain the conditional distribution of W | G=male, X=x, for x in the female<br># population.<br><br># These weights correspond to P(W, G=male | X=x  )<br>weightsf&lt;-predictdrf(drf_fit, x=Xtestf)$weights*(Ytrain[, &quot;sex&quot;]==1)<br>weightsf&lt;-weightsf/rowSums(weightsf)<br><br># The counterfactual distribution is the average over those weights/distributions <br>counterfactualw&lt;-colMeans(weightsf)</pre><p>which leads to the following counterfactual density estimate:</p><pre>plotdfc&lt;-rbind(plotdfc, plotdfunadj[plotdfunadj$sex==&#39;female&#39;,])<br>plotdfc$sex2&lt;-c(rep(1, length(train_idx)), rep(0,nrow(plotdfunadj[plotdfunadj$sex==&#39;female&#39;,])))<br><br>plotdfc$sex2&lt;-factor(plotdfc$sex2)<br><br><br>#interventional distribution<br>ggplot(plotdfc, aes(log_wage)) +<br>  geom_density(adjust=2.5, alpha = 0.3, show.legend=TRUE,  aes(fill=sex2, weight=plotweight)) +<br>  theme_light()+<br>  scale_fill_discrete(name = &quot;&quot;, labels = c(&quot;observed women&#39;s wages&quot;, &quot;wages if treated as men&quot;))+<br>  theme(legend.position = c(0.2, 0.98),<br>        legend.text=element_text(size=16),<br>        legend.title=element_text(size=20),<br>        legend.background = element_rect(fill=alpha(&#39;white&#39;, 0)),<br>        axis.text.x = element_text(size=14),<br>        axis.text.y = element_text(size=14),<br>        axis.title.x = element_text(size=19),<br>        axis.title.y = element_text(size=19))+<br>  labs(x=&#39;log(hourly wage)&#39;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*DvBKJOc-JSOVDpRHMO_5Tg.png" /></figure><p>These two densities are now the density of women’s wages in red, and the density of women had they been treated as men for setting the wage in green-bluish. Clearly, the densities are now closer together than they were before — adjusting for the confounders made the difference in gender pay smaller. However, the median difference is still</p><pre>quantile_male = wtd.quantile(x=plotdfc$log_wage[plotdfc$sex2==1], weights=counterfactualw, normwt=TRUE, probs=0.5)<br>quantile_female = wtd.quantile(x=plotdfunadj$log_wage, weights=plotdfunadj$plotweight*(plotdfunadj$sex==&#39;female&#39;), normwt=TRUE, probs=0.5)<br>(1-exp(quantile_female)/exp(quantile_male))<br><br></pre><p>0.11 or 11 percent!</p><p>Thus if our analysis is correct, 11% of the difference in pay can still be attributed to gender alone. In other words, while we managed to reduce the difference from 18% in median income in the unadjusted data to 11%, there is still a substantial difference left over, indicating an “unfair” wage gap between the genders (at least if <strong><em>X</em></strong> really captures the relevant confounders).</p><h4>Conclusion</h4><p>In this article, we studied an example of how DRF could be used for a real data analysis. We studied both the case of a fixed <strong><em>x</em></strong>, for which the methods discussed in this article, allows to build uncertainty measures, as well as the distribution of a counterfactual quantity. In both cases, we saw that there was still a substantial, and in the case of fixed <strong><em>x</em></strong>, significant difference when adjusting for the available confounders.</p><p>Though I did not check, it might be interesting to see how the results from this little experiment stack up against more serious analysis. In any case, I hope this article showcased how DRF could be used in a real-world data analysis.</p><h4>Additional Code</h4><p>The full code can also be found on <a href="https://github.com/JeffNaef/Medium-Articles/blob/main/wageapplication">Github</a>.</p><pre>## Functions in CIdrf.R that is loaded above ##<br><br>drfCI &lt;- function(X, Y, B, sampling = &quot;binomial&quot;,...) {<br><br>### Function that uses DRF with subsampling to obtain confidence regions as<br>### as described in https://arxiv.org/pdf/2302.05761.pdf<br>### X: Matrix of predictors<br>### Y: Matrix of variables of interest<br>### B: Number of half-samples/mini-forests<br><br><br>  n &lt;- dim(X)[1]<br>  <br>  # compute point estimator and DRF per halfsample S<br>  # weightsb: B times n matrix of weights<br>  DRFlist &lt;- lapply(seq_len(B), function(b) {<br>    <br>    # half-sample index<br>    indexb &lt;- if (sampling == &quot;binomial&quot;) {<br>      seq_len(n)[as.logical(rbinom(n, size = 1, prob = 0.5))]<br>    } else {<br>      sample(seq_len(n), floor(n / 2), replace = FALSE)<br>    }<br>    <br>    ## Using refitting DRF on S<br>    DRFb &lt;- <br>      drf(X = X[indexb, , drop = F], Y = Y[indexb, , drop = F],<br>          ci.group.size = 1, ...)<br>    <br>    <br>    return(list(DRF = DRFb, indices = indexb))<br>  })<br>  <br>  return(list(DRFlist = DRFlist, X = X, Y = Y) )<br>}<br><br><br>predictdrf&lt;- function(DRF, x, ...) {<br><br>### Function to predict from DRF with Confidence Bands<br>### DRF: DRF object<br>### x: Testpoint<br>  <br>  ntest &lt;- nrow(x)<br>  n &lt;- nrow(DRF$Y)<br>  <br>  ## extract the weights w^S(x)<br>  weightsb &lt;- lapply(DRF$DRFlist, function(l) {<br>    <br>    weightsbfinal &lt;- Matrix(0, nrow = ntest, ncol = n , sparse = TRUE)<br>    <br>    weightsbfinal[, l$indices] &lt;- predict(l$DRF, x)$weights <br>    <br>    return(weightsbfinal)<br>  })<br>  <br>  <br>  ## obtain the overall weights w<br>  weights&lt;- Reduce(&quot;+&quot;, weightsb) / length(weightsb)<br>  <br>    <br>return(list(weights = weights, weightsb = weightsb ))<br>}<br><br><br><br>Witdrf&lt;- function(DRF, x, groupingvar, alpha=0.05, ...){<br>  <br>### Function to calculate the conditional witness function with<br>### confidence bands from DRF<br>### DRF: DRF object<br>### x: Testpoint<br>  <br>  if (is.null(dim(x)) ){<br>    <br>  stop(&quot;x needs to have dim(x) &gt; 0&quot;)<br>  }<br>  <br>  ntest &lt;- nrow(x)<br>  n &lt;- nrow(DRF$Y)<br>  coln&lt;-colnames(DRF$Y)<br>  <br>  <br>  ## Collect w^S<br>  weightsb &lt;- lapply(DRF$DRFlist, function(l) {<br>    <br>    weightsbfinal &lt;- Matrix(0, nrow = ntest, ncol = n , sparse = TRUE)<br>    <br>    weightsbfinal[, l$indices] &lt;- predict(l$DRF, x)$weights <br>    <br>    return(weightsbfinal)<br>  })<br>  <br>  ## Obtain w<br>  weightsall &lt;- Reduce(&quot;+&quot;, weightsb) / length(weightsb)<br>  <br>  #weightsall0&lt;-weightsall[, DRF$Y[, groupingvar]==0, drop=F]<br>  #weightsall1&lt;-weightsall[,DRF$Y[, groupingvar]==1, drop=F]<br>  <br>  <br>  # Get the weights of the respective classes (need to standardize by propensity!)<br>  weightsall0&lt;-weightsall*(DRF$Y[, groupingvar]==0)/sum(weightsall*(DRF$Y[, groupingvar]==0))<br>  weightsall1&lt;-weightsall*(DRF$Y[, groupingvar]==1)/sum(weightsall*(DRF$Y[, groupingvar]==1))<br>  <br>  <br>  bandwidth_Y &lt;- drf:::medianHeuristic(DRF$Y)<br>  k_Y &lt;- rbfdot(sigma = bandwidth_Y)<br><br>  K&lt;-kernelMatrix(k_Y, DRF$Y[,coln[coln!=groupingvar]], y = DRF$Y[,coln[coln!=groupingvar]])<br><br>  <br>  nulldist &lt;- sapply(weightsb, function(wb){<br>    # iterate over class 1<br><br>    wb0&lt;-wb*(DRF$Y[, groupingvar]==0)/sum(wb*(DRF$Y[, groupingvar]==0))<br>    wb1&lt;-wb*(DRF$Y[, groupingvar]==1)/sum(wb*(DRF$Y[, groupingvar]==1))<br>    <br>    <br>    diag( ( wb0-weightsall0 - (wb1-weightsall1) )%*%K%*%t( wb0-weightsall0 - (wb1-weightsall1) )  )<br>    <br>    <br>  })<br>  <br>  # Choose the right quantile<br>  c&lt;-quantile(nulldist, 1-alpha)<br><br>  <br>  return(list(c=c, k_Y=k_Y, Y=DRF$Y[,coln[coln!=groupingvar]], nulldist=nulldist, weightsall0=weightsall0, weightsall1=weightsall1))<br>  <br>  <br>  <br>}</pre><pre>### Code to generate plots<br><br>## Step 0: Choosing x<br><br>point_description = function(test_point){<br>  out = &#39;&#39;<br>  <br>  out = paste(out, &#39;job: &#39;, test_point$occupation_description[1], sep=&#39;&#39;)<br>  out = paste(out, &#39;\nindustry: &#39;, test_point$industry_description[1], sep=&#39;&#39;)<br>  <br>  out = paste(out, &#39;\neducation: &#39;, test_point$education[1], sep=&#39;&#39;)<br>  out = paste(out, &#39;\nemployer: &#39;, test_point$employer[1], sep=&#39;&#39;)<br>  out = paste(out, &#39;\nregion: &#39;, test_point$economic_region[1], sep=&#39;&#39;)<br>  <br>  out = paste(out, &#39;\nmarital: &#39;, test_point$marital[1], sep=&#39;&#39;)<br>  out = paste(out, &#39;\nfamily_size: &#39;, test_point$family_size[1], sep=&#39;&#39;)<br>  out = paste(out, &#39;\nchildren: &#39;, test_point$children[1], sep=&#39;&#39;)<br>  <br>  out = paste(out, &#39;\nnativity: &#39;, test_point$nativity[1], sep=&#39;&#39;)<br>  out = paste(out, &#39;\nhispanic: &#39;, test_point$hispanic_origin[1], sep=&#39;&#39;)<br>  out = paste(out, &#39;\nrace: &#39;, test_point$race[1], sep=&#39;&#39;)<br>  out = paste(out, &#39;\nage: &#39;, test_point$age[1], sep=&#39;&#39;)<br>  <br>  return(out)<br>}</pre><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ec4c2a69abf0" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/studying-the-gender-wage-gap-in-the-us-using-distributional-random-forests-ec4c2a69abf0">Studying the Gender Wage Gap in the US Using Distributional Random Forests</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Inference for Distributional Random Forests]]></title>
            <link>https://medium.com/data-science/inference-for-distributional-random-forests-64610bbb3927?source=rss-ca780798011a------2</link>
            <guid isPermaLink="false">https://medium.com/p/64610bbb3927</guid>
            <category><![CDATA[thoughts-and-theory]]></category>
            <category><![CDATA[random-forest]]></category>
            <category><![CDATA[probability-distributions]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[uncertainty]]></category>
            <dc:creator><![CDATA[Jeffrey Näf]]></dc:creator>
            <pubDate>Fri, 17 Feb 2023 08:23:31 GMT</pubDate>
            <atom:updated>2026-01-21T22:20:55.754Z</atom:updated>
            <content:encoded><![CDATA[<h4>Confidence intervals for a powerful nonparametric method</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*AaFl-wOsV_6dLA7frv2z9w.png" /><figcaption>Features of (Distributional) Random Forests. In this article: The ability to provide uncertainty measures. Source: Author.</figcaption></figure><p>In a previous <a href="https://towardsdatascience.com/drf-a-random-forest-for-almost-everything-625fa5c3bcb8">article</a>, I extensively discussed the Distributional Random Forest method, a Random Forest-type algorithm that can nonparametrically estimate multivariate conditional distributions. This means that we are able to learn the whole distribution of a multivariate response <strong><em>Y</em></strong> given some covariates <strong><em>X</em></strong> nonparametrically, instead of “just” learning an aspect such as its conditional expectation. DRF does this by learning weights <em>w_i(</em><strong><em>x</em></strong><em>) </em>for the <em>i=1,…,n</em> training points that define the distribution and can be used to estimate a wide range of targets.</p><p>So far this method only produced a “point estimate” of the distribution (i.e. a point estimate for the <em>n</em> weights <em>w_i(</em><strong><em>x</em></strong><em>)</em>). While this is enough to predict the whole distribution of a response, it doesn’t give a way to make inference that considers the randomness of the data-generating mechanism. That is, even though this point estimate gets increasingly close to the truth for large sample sizes (under a list of assumptions), there is still uncertainty in its estimate for finite sample sizes. Luckily there is now a (provable) method to quantify this uncertainty as I lay out in this article. This is based on our new paper on <a href="https://arxiv.org/pdf/2302.05761.pdf">arXiv</a>.</p><p>The goal of this article is twofold: First, I want to discuss how to add uncertainty estimates to the DRF, based on our paper. The paper is quite theoretical, so I start with a few examples. The subsequent sections take a quick glance at these theoretical results, for those interested. I then explain how this can be used to get a (sampling-based) uncertainty measure for a wide range of targets. Second, I discuss the CoDiTE of [1] and a particularly interesting example of this concept, the conditional witness function. This function is a complicated object, yet, as we will see below, we can estimate it easily with DRF and can even provide asymptotic confidence bands, based on the concepts introduced in this article. An extensive real-data example of how this could be applied is given in <a href="https://medium.com/@jeffrey_85949/studying-the-gender-wage-gap-in-the-us-using-distributional-random-forests-ec4c2a69abf0">this article</a>.</p><p>Throughout we assume to have a <em>d</em>-variate i.i.d. sample <strong><em>Y</em></strong><em>_1, …, </em><strong><em>Y</em></strong><em>_n</em> of variables of interest and a <em>p</em>-variate i.i.d. sample <strong><em>X</em></strong><em>_1,…,</em><strong><em>X</em></strong><em>_n</em> of covariates. The goal is to estimate the conditional distribution of <strong><em>Y</em></strong><em>|</em><strong><em>X=x</em></strong>.</p><p>We will need the following packages and functions for this:</p><pre>library(kernlab)<br>library(drf)<br>library(Matrix)<br>library(Hmisc)<br>library(MASS)<br>library(ggplot2)</pre><p>In the following, all images, unless otherwise noted, are by the author.</p><blockquote>The full code for this article can be found on <a href="https://github.com/JeffNaef/Medium-Articles/blob/main/DRF_Inference.R">Github</a>.</blockquote><h3>Examples</h3><p>We simulate from a simple example with <em>d=1</em> and <em>p=2</em>:</p><pre> <br>set.seed(2)<br><br>n&lt;-2000<br>beta1&lt;-1<br>beta2&lt;--1.8<br><br><br># Model Simulation<br>X&lt;-mvrnorm(n = n, mu=c(0,0), Sigma=matrix(c(1,0.7,0.7,1), nrow=2,ncol=2))<br>u&lt;-rnorm(n=n, sd = sqrt(exp(X[,1])))<br>Y&lt;- matrix(beta1*X[,1] + beta2*X[,2] + u, ncol=1)</pre><p>Note that this is simply a heteroskedastic linear model, with the variance of the error term depending on the <em>X_1</em> values. Of course, knowing the effect of <strong><em>X</em></strong> on <em>Y</em> is just linear, you would not use DRF, or any Random Forest for that matter, but directly go with linear regression. But for this purpose, it is convenient to know the truth. Since DRF’s job is to estimate a conditional distribution given <strong><em>X</em></strong><em>=</em><strong><em>x</em></strong>, we now fix<em> </em><strong><em>x</em></strong> and estimate the conditional expectation and variance given <strong><em>X</em></strong><em>=</em><strong><em>x.</em></strong></p><p>We choose a point that is right in the center of the <strong><em>X</em></strong> distribution, with lots of observations surrounding it. In general, one should be careful when using any Random Forest method for points on the border of the <strong><em>X</em></strong> observations.</p><pre># Choose an x that is not too far out<br>x&lt;-matrix(c(1,1),ncol=2)<br><br># Choose alpha for CIs<br>alpha&lt;-0.05</pre><p>Finally, we fit our DRF and obtain the weights <em>w_i(</em><strong><em>x</em></strong><em>)</em>:</p><pre>## Fit the new DRF framework<br>drf_fit &lt;- drf(X=X, Y=Y, min.node.size = 5, num.trees=2000, num.features=10, ci.group.size=2000/50)<br><br>## predict weights<br>DRF = predict(drf_fit, newdata=x, estimate.uncertainty = TRUE)<br>weights &lt;- DRF$weights[1,]<br><br># Bxn matrix of uncertainty weights<br>weightsb&lt;-DRF$weights.uncertainty[[1]]</pre><p>Note the option <strong>estimate.uncertainty = TRUE</strong> in the predict function. As explained below, the DRF object we built here not only contains the weights <em>w_i(</em><strong><em>x</em></strong><em>)</em>, but also a sample of <em>B</em> weights that correspond to draws from the distribution of <em>w_i(</em><strong><em>x</em></strong><em>)</em>. We can use these <em>B</em> draws to approximate the distribution of anything we want to estimate, as I illustrate now in two examples.</p><h4><strong>Example 1: Conditional Expectation</strong></h4><p>First, we simply do what most prediction methods do: We estimate the conditional expectation. With our new method, we also build a confidence interval around it.</p><pre># Estimate the conditional expectation at x:<br>condexpest&lt;- sum(weights*Y)<br><br># Use the distribution of weights, see below<br>distofcondexpest&lt;-apply(weightsb,1, function(wb)  sum(wb*Y)  )<br><br># Can either use the above directly to build confidence interval, or can use the normal approximation.<br># We will use the latter<br>varest&lt;-var(distofcondexpest-condexpest)<br><br># build 95%-CI<br>lower&lt;-condexpest - qnorm(1-alpha/2)*sqrt(varest)<br>upper&lt;-condexpest + qnorm(1-alpha/2)*sqrt(varest)<br><br>c(round(lower,2), round(condexpest,2), round(upper,2))<br>(-1.16, -0.59, -0.03)</pre><p>Importantly, though the estimated value is a bit off, this CI contains the truth, which is given as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/607/1*aM0Tjfh-BGHUY-G-DE4e9A.png" /></figure><h4><strong>Example 2: Conditional Variance</strong></h4><p>Assume now we would like to find the variance Var(Y|<strong>X</strong>=<strong>x</strong>) instead of the conditional mean. This is quite a challenging example for a nonparametric method that cannot make use of the linearity. The truth is given as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/858/1*9LS3kfTcJiPiThwXoPL4Xg.png" /></figure><p>Using DRF, we can estimate this as follows:</p><pre># Estimate the conditional expectation at x:<br>condvarest&lt;- sum(weights*Y^2) - condexpest^2<br><br>distofcondvarest&lt;-apply(weightsb,1,  function(wb)  {<br>  sum(wb*Y^2) - sum(wb*Y)^2<br>}  )<br><br># Can either use the above directly to build confidence interval, or can use the normal approximation.<br># We will use the latter<br>varest&lt;-var(distofcondvarest-condvarest)<br><br># build 95%-CI<br>lower&lt;-condvarest - qnorm(1-alpha/2)*sqrt(varest)<br>upper&lt;-condvarest + qnorm(1-alpha/2)*sqrt(varest)<br><br>c(round(lower,2), round(condvarest,2), round(upper,2))<br><br>(1.29, 2.24, 3.18)<br></pre><p>Thus the true parameter is contained in the CI, as we would hope, and in fact, we are quite close to the truth with our estimate!</p><p>We now study the theory underlying these examples, before we come to a third example in Causal Analysis.</p><h3>Asymptotic Normality in the RKHS</h3><p>In this and the next section, we briefly focus on the theoretical results derived in the paper. As explained above and in the article, DRF presents a distributional prediction at a test point <strong>x</strong>. That is, we obtain an estimate</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/392/1*ujFA6RdbxlXeSPLNpDsF0w.png" /></figure><p>of the conditional distribution of <strong><em>Y</em></strong> given <strong><em>X</em></strong><em>=</em><strong><em>x</em></strong>. This is just a typical way of writing an empirical measure, the magic lies in the weights <em>w_i(</em><strong><em>x</em></strong><em>)</em> — they can be used to easily obtain estimators of quantities of interest, or even to sample directly from the distribution.</p><p>To obtain this estimate, DRF actually estimates the conditional mean, but in a reproducing kernel Hilbert space (RKHS). An RKHS is defined through a kernel function <em>k(</em><strong><em>y</em></strong><em>_1, </em><strong><em>y</em></strong><em>_2)</em>. With this kernel, we can map each observation <strong><em>Y</em></strong><em>_i</em> into the Hilbert space, as <em>k(</em><strong><em>Y</em></strong><em>_i, .)</em>. There is a myriad of methods using this extremely powerful tool, such as kernel ridge regression. The key point is that under some conditions, any distribution can be expressed as an element of this RKHS. It turns out that the true conditional distribution can be represented in the RKHS as the following expectation:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/585/1*1TY__-5mr8YtXeR63sFSNg.png" /></figure><p>So this is just another way of expressing the conditional distribution of <strong><em>Y</em></strong> given <strong><em>X</em></strong><em>=</em><strong><em>x</em></strong>. We then try to estimate this element with DRF like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/588/1*JeqPMonr6YbJBCH1yYsOtA.png" /></figure><p>Again we are using the weights obtained from DRF, but now form a weighted sum with <em>k(</em><strong><em>Y</em></strong><em>_i,.)</em> instead of the Dirac measures above. We can map back and forth between the two estimates by writing either of the two. The reason this matters is that we can write the conditional distribution estimate as a weighted mean in the RKHS! Just as the original Random Forest estimates a mean in the real numbers (the conditional expectation of <em>Y</em> given <strong><em>X</em></strong><em>=</em><strong><em>x</em></strong>), DRF estimates a mean in the RKHS. Only with the latter, it turns out we also obtain an estimate of the conditional distribution.</p><p>The reason this is important for our story is that this weighted mean in the RKHS behaves quite similarly in some regards to a (weighted) mean in <em>d</em> dimensions. That is, we can study its consistency and asymptotic normality using the myriad of tools that are available for averages. This is quite remarkable, as all interesting RKHS will be infinite-dimensional. The <a href="https://www.jmlr.org/papers/v23/21-0585.html">first DRF paper </a>already establishes consistency of the estimator in (1) in the RKHS. Our new paper now proves that, in addition,</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/511/1*1o64R7bA84bjvIHl9ymWzg.png" /></figure><p>where sigma_n is a standard deviation that goes to zero and <strong><em>Sigma</em></strong><em>_</em><strong><em>x</em></strong> is an operator that takes the place of a covariance matrix (again it all works quite similarly as in <em>d</em>-dimensional Euclidean space).</p><h3>Obtaining the sampling distribution</h3><p>Ok so, we have an asymptotic normality result in an infinite-dimensional space, what exactly does that mean? Well first, it means estimators derived from the DRF estimate that are “smooth’’ enough will also tend to be asymptotically normal. But this alone is still not useful, as we also need to have a variance estimate. Here a further result in our paper comes into play.</p><p>We leave away a lot of details here, but essentially we can use the following subsample scheme: Instead of just fitting say <em>N</em> trees to build our forest, we build <em>B</em> groups of <em>L</em> trees (such that <em>N=B*L</em>). Now for each group of trees or mini forests, we subsample at random about half of the data points and then fit the forest using only this subsample. Let’s call this subset of samples chosen <em>S</em>. For each drawn <em>S</em> we then get another DRF estimator in the Hilbert space denoted</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/596/1*wWErFvsrzmtI3fOa_LeVrw.png" /></figure><p>only using the samples in <em>S</em>. Note that, as in bootstrapping, we now have two sources of randomness, even disregarding the randomness of the forest (in theory we assume <em>B</em> to be so large, as to make the randomness of the forest(s) negligible). One source from the data themselves and another artificial source of randomness, we introduce when choosing <em>S</em> at random. Crucially the randomness from <em>S</em>, given the data, is in our control — we can draw as many subsets <em>S</em> as we want. So the question is, what happens with our estimator in (2) if we only consider the randomness of <em>S</em> and fix the data? Remarkably, we can show that</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/549/1*xTcFqBFixMp0oE_KZL_B8A.png" /></figure><p>This just means that if we fix the randomness of the data and only consider the randomness from <em>S</em>, the estimator (2) minus the estimator in (1) will converge in distribution to the same limit as the original estimator minus the truth! This is actually how bootstrap theory works: We have shown that something we can sample from, namely</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/284/1*vDdPW4PJkc4XjpGa6tDqlw.png" /></figure><p>converges to the same limit as what we cannot access, namely</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/257/1*8lK6EYO57fEeSVVv_vrYpg.png" /></figure><p>So to make inference about the latter, we can use the former! This is actually the standard argument people make in bootstrap theory to justify why the bootstrap can be used to approximate the sampling distribution! That’s right, even bootstrap, a technique that people often use in small samples, only really makes sense (theoretically) in a large sample regime.</p><p>Let’s use this now.</p><h3>What does this actually mean?</h3><p>We now show what this means in practice. In the following, we define a new function derived from the drf function of the CRAN package <a href="https://cran.r-project.org/web/packages/drf/index.html">drf</a>.</p><pre><br><br>Witdrf&lt;- function(DRF, x, groupingvar, alpha=0.05, ...){<br>  <br>  ### Function to calculate the conditional witness function with<br>  ### confidence bands from DRF<br>  ### DRF: DRF object<br>  ### x: Testpoint<br>  <br>  if (is.null(dim(x)) ){<br>    <br>    stop(&quot;x needs to have dim(x) &gt; 0&quot;)<br>  }<br>  <br>  ntest &lt;- nrow(x)<br>  n &lt;- nrow(DRF$Y)<br>  coln&lt;-colnames(DRF$Y.orig)<br>  <br>  drfpred&lt;- predict(DRF, newdata=x, estimate.uncertainty = TRUE)<br>  weightsall&lt;-drfpred$weights[1,]<br>  weightsb&lt;-drfpred$weights.uncertainty[[1]]<br>  <br>  #weightsall0&lt;-weightsall[, DRF$Y[, groupingvar]==0, drop=F]<br>  #weightsall1&lt;-weightsall[,DRF$Y[, groupingvar]==1, drop=F]<br>  <br>  <br>  # Get the weights of the respective classes (need to standardize by propensity!)<br>  weightsall0&lt;-weightsall*(DRF$Y.orig[, groupingvar]==0)/sum(weightsall*(DRF$Y.orig[, groupingvar]==0))<br>  weightsall1&lt;-weightsall*(DRF$Y.orig[, groupingvar]==1)/sum(weightsall*(DRF$Y.orig[, groupingvar]==1))<br>  <br>  <br>  bandwidth_Y &lt;- drf:::medianHeuristic(DRF$Y.orig)<br>  k_Y &lt;- rbfdot(sigma = bandwidth_Y)<br>  <br>  K&lt;-kernelMatrix(k_Y, DRF$Y.orig[,coln[coln!=groupingvar]], y = DRF$Y.orig[,coln[coln!=groupingvar]])<br>  <br>  <br>  nulldist &lt;- apply(weightsb,1, function(wb){<br>    # iterate over class 1<br>    <br>    wb0&lt;-wb*(DRF$Y[, groupingvar]==0)/sum(wb*(DRF$Y[, groupingvar]==0))<br>    wb1&lt;-wb*(DRF$Y[, groupingvar]==1)/sum(wb*(DRF$Y[, groupingvar]==1))<br>    <br>    <br>    diag( ( wb0-weightsall0 - (wb1-weightsall1) )%*%K%*%( wb0-weightsall0 - (wb1-weightsall1) )  )<br>    <br>    <br>  })<br>  <br>  # Choose the right quantile<br>  c&lt;-quantile(nulldist, 1-alpha)<br>  <br>  <br>  return(list(c=c, k_Y=k_Y, Y=DRF$Y[,coln[coln!=groupingvar]], nulldist=nulldist, weightsall0=weightsall0, weightsall1=weightsall1))<br>  <br>  <br>  <br>}<br></pre><p>So from our method, we not only get the point estimate in form of weights <em>w_i(</em><strong><em>x</em></strong><em>)</em>, but a sample of <em>B</em> weights, each representing an independent draw from the distribution of the estimator of the conditional distribution (that sounds more confusing than it should be, please keep the examples in mind). This just means we are not only having an estimator, but also an approximation to its distribution!</p><p>I now turn to a more interesting example of something we can only do with DRF (as far as I know).</p><h3><strong>Causal Analysis Example: Witness Function</strong></h3><p>Let’s assume we have two sets of observations, say group <em>W=1</em> and group <em>W=0</em> and we want to find the causal relationship between the group belonging and a variable <em>Y</em>. In the example of <a href="https://medium.com/@jeffrey_85949/studying-the-gender-wage-gap-in-the-us-using-distributional-random-forests-ec4c2a69abf0"><em>this article</em></a>, the two groups would be male and female and <em>Y</em> would be the hourly wage. In addition, we have confounders <strong><em>X</em></strong>, which we assume affect both <em>W</em> and <em>Y</em>. We assume here that <strong><em>X</em></strong> really includes all relevant confounders. This is a BIG assumption. Formally, we assume unconfoundedness:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/311/1*4ezqKLMyI9Ei1RlSiZS6Qw.png" /></figure><p>and overlap:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/381/1*wATpSK4y17FQUXH6Xahh-w.png" /></figure><p>Often people then compare the conditional expectation between the two groups:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/582/1*YY2AhchoTzyyG2cb9qgFbA.png" /></figure><p>This is the Conditional Average Treatment Effect (CATE) at <strong><em>x</em></strong>. This is a natural first starting point, but in a recent paper ([1]), the CoDiTE was introduced as a generalization of this idea. Instead of just looking at the difference in expected values the CoDiTE proposes to look at differences in other quantities as well. A particularly interesting example of this idea is the <em>conditional witness function: </em>For both groups, we take as above</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/533/1*sLpy9geMLq3P7siw2zdmLA.png" /></figure><p>So we consider the representation of the two conditional distributions in the RKHS. In addition to being representations of the conditional distributions, these quantities are also real-valued functions: For <em>j=0,1</em>,</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/993/1*K3wmpzqZ6AnQym1Dsu32Fw.png" /></figure><p>The function that gives the difference between those two quantities,</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/475/1*oMeQgh4IVrpb9rdX9UXrpQ.png" /></figure><p>is called the <em>conditional witness function</em>.</p><p>Why is this function interesting? It turns out that this function shows how the two densities behave in relation to each other: For values of <em>y</em> for which the function is negative, the conditional density of class 1 at <em>y</em> is smaller than the conditional density of 0. Similarly, if the function is positive at <em>y</em>, it means the density of 1 is higher at <em>y</em> than the conditional density of 0 (whereby “conditional” always refers to conditioning on <strong><em>X</em></strong><em>=</em><strong><em>x</em></strong>). Crucially, this can be done <em>without having to estimate the densities</em>, which is hard, especially for multivariate <strong><em>Y</em></strong>.</p><p>Finally, we can provide <em>uniform confidence bands </em>for our estimated conditional witness functions, by using the <em>B</em> samples from above. I do not go into details here, but these are essentially the analog to the confidence intervals for the conditional mean we used above. Crucially, these bands should be valid uniformly over the function values <em>y</em>, for one specific <strong><em>x</em></strong>.</p><blockquote>We note that we mostly showcase the witness function here as an example. If interest centers on this causal application and the witness function, a more powerful approach is available that uses only one forest in the <a href="https://arxiv.org/abs/2411.08778">CausalDRF</a>.</blockquote><p>Let’s illustrate this with an example: We simulate the following data-generating process:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*oO5cB4ALRCsZaxJbUC5Peg.png" /></figure><p>That is, <em>X_1, X_2</em> are independently uniformly distributed on (0,1), <em>W</em> is either 0 or 1, with a probability depending on <em>X_2</em> and <em>Y</em> is a function of <em>W</em> and <em>X_1</em>. This is a really hard problem; not only does <strong><em>X</em></strong> influence the probability of belonging to class 1 (i.e. the propensity), it also changes the treatment effect of <em>W</em> on <em>Y</em>. In fact, a small calculation shows that the CATE is given as:</p><p>(1 - 0.2)*X_1 - (0 - 0.2)*X_1 = X_1.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/741/1*_sv7wfzGPBnl7_TYXp1niQ.png" /><figcaption>Graph corresponding to the data-generating process</figcaption></figure><pre>set.seed(2)<br><br>n&lt;-5000<br>p&lt;-2<br><br>X&lt;-matrix(runif(n*p), ncol=p)<br>W&lt;-rbinom(n,size=1, prob= exp(-X[,2])/(1+exp(-X[,2])))<br><br>Y&lt;-(W-0.2)*X[,1] + rnorm(n)<br>Y&lt;-matrix(Y,ncol=1)<br><br></pre><p>We now randomly choose a test point <strong><em>x</em></strong> and use the following code to estimate the witness function plus confidence band:</p><pre><br>x&lt;-matrix(runif(1*p), ncol=2)<br>Yall&lt;-cbind(Y,W)<br>## For the current version of the Witdrf function, we need to give<br>## colnames to Yall<br>colnames(Yall) &lt;- c(&quot;Y&quot;, &quot;W&quot;)<br><br>## Fit the new DRF framework<br>drf_fit &lt;- drf(X=X, Y=Yall, min.node.size = 5, num.trees=4000, ci.group.size=4000/40)<br><br>Witobj&lt;-Witdrf(drf_fit, x=x, groupingvar=&quot;W&quot;, alpha=0.05)<br><br>hatmun&lt;-function(y,Witobj){<br>  <br>  c&lt;-Witobj$c<br>  k_Y&lt;-Witobj$k_Y<br>  Y&lt;-Witobj$Y<br>  weightsall1&lt;-Witobj$weightsall1<br>  weightsall0&lt;-Witobj$weightsall0<br>  Ky=t(kernelMatrix(k_Y, Y , y = y))<br>  <br>  out&lt;-list()<br>  out$val &lt;- (Ky%*%weightsall1 - Ky%*%weightsall0)<br>  out$upper&lt;-  out$val+sqrt(c)<br>  out$lower&lt;-  out$val-sqrt(c)<br>  <br>  return( out )<br>  <br>  <br>  <br>}<br><br>all&lt;-hatmun(sort(Witobj$Y),Witobj)<br><br>plot(sort(Witobj$Y),all$val , type=&quot;l&quot;, col=&quot;darkblue&quot;, lwd=2, ylim=c(min(all$lower), max(all$upper)),<br>     xlab=&quot;y&quot;, ylab=&quot;witness function&quot;)<br>lines(sort(Witobj$Y),all$upper , type=&quot;l&quot;, col=&quot;darkgreen&quot;, lwd=2 )<br>lines(sort(Witobj$Y),all$lower , type=&quot;l&quot;, col=&quot;darkgreen&quot;, lwd=2 )<br>abline(h=0)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*UEImFETznSp9xXuNgipCGQ.png" /></figure><p>We can read from this plot that:</p><p>(1) The conditional density of group 1 is <em>lower </em>than the density of group 0 for values of <em>y</em> between -3 and 0.3. Moreover, this difference gets larger the larger <em>y</em> is until about <em>y = -1</em>, after which point the difference in densities starts to decrease again until the two densities are the same at around 0.3.</p><p>(2) Symmetrically, the density of group 1 is higher than the density of group 0 for values of <em>y</em> between 0.3 and 3 and this difference gets larger until it reaches a maximum at about <em>y = 1.5</em>. After this point, the difference decreases until it is almost zero again at <em>y = 3</em>.</p><p>(3) The difference between the two densities is statistically significant at the 95% percent level, as can be seen from the fact that for <em>y</em> approximately between -1.5 and -0.5 and between 1 and 2, the asymptotic confidence bands do not include the zero line.</p><p>Let’s check (1) and (2) for the simulated true conditional densities. That is, we simulate the truth a great number of times:</p><pre># Simulate truth for a large number of samples ntest<br>ntest&lt;-10000<br>Xtest&lt;-matrix(runif(ntest*p), ncol=2)<br><br>Y1&lt;-(1-0.2)*Xtest[,1] + rnorm(ntest)<br>Y0&lt;-(0-0.2)*Xtest[,1] + rnorm(ntest)<br><br><br>## Plot the test data without adjustment<br>plotdf = data.frame(Y=c(Y1,Y0), W=c(rep(1,ntest),rep(0,ntest) ))<br>plotdf$weight=1<br>plotdf$plotweight[plotdf$W==0] = plotdf$weight[plotdf$W==0]/sum(plotdf$weight[plotdf$W==0])<br>plotdf$plotweight[plotdf$W==1] = plotdf$weight[plotdf$W==1]/sum(plotdf$weight[plotdf$W==1])<br><br>plotdf$W &lt;- factor(plotdf$W)<br><br>#plot pooled data<br>ggplot(plotdf, aes(Y)) +<br>  geom_density(adjust=2.5, alpha = 0.3, show.legend=TRUE,  aes(fill=W, weight=plotweight)) +<br>  theme_light()+<br>  scale_fill_discrete(name = &quot;Group&quot;, labels = c(&#39;0&#39;, &quot;1&quot;))+<br>  theme(legend.position = c(0.83, 0.66),<br>        legend.text=element_text(size=18),<br>        legend.title=element_text(size=20),<br>        legend.background = element_rect(fill=alpha(&#39;white&#39;, 0.5)),<br>        axis.text.x = element_text(size=14),<br>        axis.text.y = element_text(size=14),<br>        axis.title.x = element_text(size=19),<br>        axis.title.y = element_text(size=19))+<br>  labs(x=&#39;y&#39;)<br><br></pre><p>This leads to:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*mcbBMV_xMwImZgx04-MbOA.png" /></figure><p>It is a bit hard to compare visually, but we see that the two densities behave quite close to what the witness function above predicted. In particular, we see that the densities are about the same around 0.3 and the difference in densities appears to be maximal approximately around -1 and 1.5. Thus both points (1) and (2) can be seen in the actual densities!</p><p>Moreover, to get (3) into context, a repeated simulation in the paper shows how the estimated witness function tends to look when no effect is visible:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/977/1*lqBfz_ngTZ-Ut6ZzhFWxkw.png" /><figcaption>Simulation of a 1000 witness functions in a similar setting as described here. In blue are the 1000 estimated witness functions, while in grey one can see the corresponding confidence bands. Taken from our paper on arXiv. There is no effect in this example, and 99% of CIs do not contain the zero line.</figcaption></figure><p>A real data example in Causal Inference is given in <a href="https://medium.com/p/ec4c2a69abf0/edit">this article</a>. The CausalDRF that implements all of this in one forest can be explored <a href="https://arxiv.org/abs/2411.08778">here</a>.</p><h4>Conclusion</h4><p>In this article, I discussed the new inferential tools available for Distributional Random Forests. I also looked at an important application of these new capabilities; estimating the conditional witness function with uniform confidence bands.</p><p>However, I also want to offer a few words of warning:</p><ol><li>The results are only valid for a given test point <strong><em>x</em></strong></li><li>The results are only valid asymptotically</li></ol><p>The first point is actually not so bad, in simulations, the asymptotic normality often also holds over a range of <strong>x</strong>.<em> Just be careful with test points that are close to the boundary of your sample! </em>Intuitively, DRF (and all other nearest neighborhood methods) need many sample points around the test point <strong><em>x</em></strong> to estimate the response for <strong><em>x</em></strong>. So if the covariates <em>X</em> in your training set are standard normal, with most points between -2 and 2, then predicting an <em>x</em> in [-1,1] should be no problem. But if your <em>x</em> reaches -2 or 2, performance starts to deteriorate fast.</p><blockquote>Random Forests (and nearest neighbourhood methods in general) are not good at predicting for points that only have a few neighbours in the training set, such as points at the boundary of the support of <strong><em>X</em></strong>.</blockquote><p>The second point is also quite important. Asymptotic results have fallen somewhat out of fashion in contemporary research, in favor of finite sample results that in turn require assumptions such as “sub-Gaussianity”. Still, asymptotic results provide extremely powerful approximations in complicated settings like these. And in fact, this approximation is pretty accurate for many targets for more than 1000 or 2000 data points (maybe you have 92% coverage instead of 95% for your conditional mean/quantile). However, the witness function we introduced is a complicated object, and thus the more data points you have to estimate the uncertainty bands around it, the better!</p><h4>Citations</h4><p>[1] Junhyung Park, Uri Shalit, Bernhard Schölkopf, and Krikamol Muandet. “Conditional distributional treatment effect with kernel conditional mean embeddings and U-statistic regression.” In Proceedings of 38th International Conference on Machine Learning (ICML) , volume 139, pages 8401–8412. PMLR, July 2021.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=64610bbb3927" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/inference-for-distributional-random-forests-64610bbb3927">Inference for Distributional Random Forests</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Deep Dive into HPLBs for A/B testing using Random Forest]]></title>
            <link>https://medium.com/data-science/deep-dive-into-hplbs-for-a-b-testing-using-random-forest-11f0fdd73044?source=rss-ca780798011a------2</link>
            <guid isPermaLink="false">https://medium.com/p/11f0fdd73044</guid>
            <category><![CDATA[a-b-testing]]></category>
            <category><![CDATA[p-value]]></category>
            <category><![CDATA[random-forest]]></category>
            <category><![CDATA[deep-dives]]></category>
            <dc:creator><![CDATA[Jeffrey Näf]]></dc:creator>
            <pubDate>Mon, 12 Dec 2022 15:12:05 GMT</pubDate>
            <atom:updated>2022-12-12T15:12:05.709Z</atom:updated>
            <content:encoded><![CDATA[<h3>Deep Dive into HPLBs for A/B Testing using Random Forest</h3><h4>An Alternative to <em>p</em>-values in Testing</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*wZ0PS82oT_ct4FQV" /><figcaption>Photo by <a href="https://unsplash.com/@vonshnauzer?utm_source=medium&amp;utm_medium=referral">Egor Myznik</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p>In a <a href="https://medium.com/towards-data-science/an-alternative-to-p-values-in-a-b-testing-44f1406d3f91">recent article</a>, we introduced the concept of a high probability lower bound (HPLB) of the TV distance, based on our article on arXiv:</p><p><a href="https://arxiv.org/abs/2005.06006">https://arxiv.org/abs/2005.06006</a>,</p><p>joint work with <a href="https://medium.com/u/f562dceaeb63">Loris Michel</a>.</p><p>In the current article, we dive into a detailed treatment of this topic and on the way touch upon some very useful (and beautiful) statistical concepts. In particular, we will need to draw balls without replacement from an urn with <em>m </em>balls and <em>n</em> squares. Many thanks to <a href="https://medium.com/u/dba4029558b7">Maybritt Schillinger</a> for a wealth of constructive comments!</p><h4><strong>Outline</strong></h4><p>For some time now it has been known that powerful classifiers (such as the Random Forest classifier) can be used in two-sample or A/B testing, as explained <a href="https://medium.com/towards-data-science/an-alternative-to-p-values-in-a-b-testing-44f1406d3f91">here</a>: We observe two independent groups of samples, one coming from a distribution <em>P</em> (e.g., blood pressure before treatment) and one coming from a distribution <em>Q</em> (e.g., blood pressure of an independent group of people, after treatment) and we want to test, <em>H_0: P=Q</em>. Given these two sets of data, we give one a label of 0, and the other a label of 1, train a classifier and then evaluate this classifier on some independent data. Then it seems intuitive that, the better the classifier can differentiate the two groups, the more evidence against the Null there is. This can be made formal, leading to a valid p-value and to a rejection decision when the p-value is smaller than a prespecified <em>alpha.</em></p><p>This is nice because today classifiers are powerful and thus this approach leads to powerful two-sample tests that can potentially detect any difference between <em>P</em> and <em>Q</em>. On the other hand, we all heard about the problems with p-values and classical testing. In particular, a significant p-value does not tell one <em>how different P and Q are </em>( this is related to <em>effect size</em> in medicine). The picture below illustrates an example where <em>P</em> and <em>Q</em> get progressively more different. In each case, even a strong two-sample test would simply only give a binary rejection decision.</p><p>So it would be more interesting if we could somehow meaningfully calculate <em>how different P is from Q</em>, ideally still using a powerful classifier in the process. Here we construct a meaningful method based on an estimate of the TV distance between <em>P</em> and <em>Q</em>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/proxy/1*lWQpN3DBqyk_o8zyy0rdOg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*h_tUwRru7K-qLLpD7Kb15A.png" /><figcaption>P (in red) and Q(in blue), get progressively more different. A test, no matter how powerful, just rejects without giving additional information. Source: author.</figcaption></figure><p>In the following we assume to observe an i.i.d. sample <em>X_1, …, X_m</em> from <em>P</em> and an independent i.i.d. sample <em>Y_1, …, Y_n</em> from <em>Q</em>. We then use the probability estimate of a classifier (e.g., probability of belonging to class 1) as a “projection” that takes the vectors of data and maps them onto the real line as probability estimates. Building the univariate order statistic with these values and finding a connection to <em>TV(P,Q)</em>, we will then be able to construct our lower bound. In the following, we will also sometimes just write <em>lambda </em>for <em>TV(P,Q)</em>.</p><h4>Using a (powerful) Classifier to get a Univariate Problem</h4><blockquote>Concepts in this section: Drawing circles without replacement from an urn, with m circles and n squares: using the hypergeometric distribution for two-sample testing.</blockquote><p>In general, the samples from <em>P</em>, <em>Q</em> are <em>d</em>-dimensional random vectors. Here the classifier comes into play already: As most classifiers can take these <em>m+n</em> sample points and transform them into a sequence of real numbers, the prediction of the probability of the observation <em>i </em>to have label 1. Thus, we can just focus on the real numbers between 0 and 1 to construct our estimator. Of course, what we really lower bound then, is the TV distance of the probability estimates. It is thus important that the classifier is strong, to make sure we do not lose too much information.</p><p>So, let’s assume we have a sample of <em>N=m+n</em> real numbers, and we know for each of those whether the original observation comes from <em>P</em> or <em>Q</em>. Then we can build this magical thing called <em>order statistic. </em>That is, we take the <em>N</em> numbers and order them from smallest to largest. To illustrate this, let’s represent samples from <em>P</em> as circles and samples from <em>Q</em> as squares. Then the order statistics may look like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*R8FGTm2zyav6P2mUUXRo0g.png" /></figure><p>Now here is the important point: <em>The classifier tries to estimate the probability of an observation to be in class 1, or from Q, or being a square, as accurately as possible. </em>Thus if the classifier is good, we should expect to see more squares on the right, because the estimated probability should be larger for squares than for circles! Thus, the order statistic is like a <em>centrifuge</em>: If there is a discernible difference between <em>P</em> and <em>Q</em>, and the classifier is able to detect it, the order statistics of the probabilities push the circles to the left and the squares to the right. Since there is still randomness and estimation error, this will not look perfect in general. However, we want it to be ‘’sufficiently different from randomness’’.</p><p>One very elegant way to quantify this is the statistic we call <em>V_z, </em>the number of circles below <em>z</em>. This statistic has been used for (univariate) testing for a long time. That is, at any point <em>z=1,…, N</em>, we simply count how many circles we have below z:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*OUnXbfQuPmJHYE5ypH7JKg.png" /></figure><p>What should we expect if <em>P</em> and <em>Q</em> are the same? In this case, we simply have <em>N</em> i.i.d. draws from a single distribution. Thus there should be just a random arrangement of circles and squares in the order statistic, with no pattern. In this case, <em>V_z</em> is actually drawing, with uniform probability, circles without replacement from an urn with m circles and n squares. Thus this connects back to the very basics of probability. Mathematically:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/708/1*QOz37A4V6p6bHuR1YPe2gA.png" /></figure><p>see e.g., <a href="https://towardsdatascience.com/hypergeometric-distribution-explained-with-python-2c80bc613bf4">this nice article</a>.</p><blockquote>Under H_0: P=Q, <em>V_z </em>is hypergeometric: It is the number of times you draw a circle if you draw <em>z</em> times <em>without replacement </em>from an urn with m circles and n squares.</blockquote><p>What is cool, is that we are now even able to find a function in <em>z</em> , <em>q(z, alpha), </em>for any<em> alpha, </em>such that when <em>P=Q</em>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/706/1*-YH12HzlYmV8-wb4w1Ibyw.png" /></figure><p>Finding this <em>q(z, alpha) </em>can be done by using asymptotic theory (see e.g. <a href="https://arxiv.org/abs/2005.06006">our paper</a> and the references therein) or simply by simulation. The main point is that we know the distribution of <em>V_z</em> and it is always the same, no matter what <em>P</em> and <em>Q</em> exactly are. So even if we don’t have a closed-form distribution for the maximum, we can still approximate <em>q(z,alpha)</em> quite readily. This can be directly used for a (univariate) two-sample test! If <em>max_z V_z-q(z,alpha)</em> overshoots zero, we can reject that <em>P=Q</em>.</p><p>Ok so this is quite nice, but the whole point of this article is that we want to get away from simple rejection decisions, and instead get a lower bound for the Total Variation Distance. Unfortunately, under a general alternative where <em>P</em> and <em>Q</em> are different (i.e., <em>TV(P,Q) &gt; 0</em>), the distribution of <em>V_z</em> is no longer known! The goal will now be to find another process <em>B_z</em> that is easier to analyze and such that for all <em>z=1,…,N,</em> <em>B_z ≥ V_z</em>. If we can bound this process correctly, then the bound also holds for <em>V_z</em>.</p><h4>Playing around with TV(P,Q)</h4><blockquote>Concepts in this section: Using the sampling interpretation of TV to introduce the concept of Distributional Witnesses and using this to identify an area where P=Q holds, even if P is not equal to Q in general.</blockquote><p>By finding <em>q(z,alpha)</em> in the last section, we essentially found a first step of the construction of a lower bound for the case when <em>TV(P,Q)=0</em>, i.e. if there is no difference between <em>P</em> and <em>Q</em>. We now extend this by connecting with our first article and playing around with the definition of the TV distance between <em>P</em> and <em>Q</em>. Generally, this is given as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/692/1*lWQpN3DBqyk_o8zyy0rdOg.png" /></figure><p>Thus we look for the set <em>A</em>, out of all possible sets, such that the difference between <em>P(A)</em> and <em>Q(A)</em> is largest. Let’s make this more specific: Let <em>p</em> and <em>q</em> be the densities of <em>P</em> and <em>Q</em> (as a technicality, the data does not need to be continuous in the usual sense, we can <em>always </em>do that in this case). Then the maximal <em>A</em> is given as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/490/1*JozqYWCIdG9hSV0jWNHxOg.png" /></figure><p>so that</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*02PumxRLvDsWTyWSLsDDrg.png" /></figure><p>Let in the following <em>X</em> have distribution <em>P</em> and <em>Y</em> distribution <em>Q</em>. Now here comes the crucial part: We can use this to define a new density</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/841/1*STYrq8kC6gT0fQhbX83nmw.png" /></figure><p>Then this is a valid density (integrates to 1) and we can define similarly a density <em>q_+</em>. What this means is best seen graphically:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/956/1*WO6qHgK_lkMlLfDLG-k0dw.png" /><figcaption>Illustration of the TV concept. Left: the two original densities p and q, with p_+ in red, h in blue and q_+ in green all unstandardized. Right: Densities p_+ in red, h in blue and q_+ in green. Source:author</figcaption></figure><p>The picture shows that the densities <em>p</em>, <em>q</em> can be split up into the densities <em>p_+</em>, <em>q_+,</em> and some middle part, that corresponds to the minimum value of both densities and integrates exactly to 1 if we standardize it with <em>1-TV(P,Q)</em>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/838/1*htIQqNXghY-PRzwRcEa0sw.png" /></figure><p>Instead of seeing <em>X</em> as a draw from <em>P</em> and <em>Y</em> as a draw from <em>Q</em> directly, we can now see <em>X</em> as drawn from the mixture</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/749/1*0VpQ7Bo4EYRz94o-7z_1Vg.png" /></figure><p>What this means is the following; before we draw <em>X</em>, we flip a coin wherewith probability <em>TV(P,Q)</em>, we draw <em>X</em> from the red density <em>p_+</em> and with probability (<em>1-TV(P,Q)</em>) we instead draw it from <em>h</em>. For the distribution of <em>X</em> it doesn’t matter how we look at it, in the end, <em>X</em> will have distribution <em>P.</em> But clearly, it appears interesting whether <em>X </em>actually came from <em>p_+</em> or from <em>h</em>, because the former corresponds to the ‘’unique’’ part of p. Indeed looking at either the graphic or the densities themselves, we see that <em>p_+ </em>and<em> q_+ </em>are <em>disjoint. </em>So <em>X</em> either comes from <em>p_+</em> or from <em>h</em> and similarly, <em>Y</em> is either drawn from <em>q_+</em> or from <em>h</em>. Crucially, if both <em>Y</em> and <em>X</em> are drawn from the density <em>h</em>, they obviously come from the same distribution and there is no way to differentiate them, <em>it is as if we were under the Null.</em></p><p>So, for the i.i.d. observations <em>X_1, …, X_m</em> and <em>Y_1,…,Y_n,</em> each observation is either drawn from the specific part (<em>p_+</em> or <em>q_+</em>) or from the joint part <em>h</em>. We call observations that are drawn from the specific part <em>p_+(q_+)</em> <em>witnesses for P (Q).</em></p><blockquote>Observations drawn from the specific part p_+ are called witnesses (for P). Observations drawn from h cannot be differentiated, so this corresponds to the part where P and Q are the same.</blockquote><p>Ok so if we go back to the order statistics, we can now think of it like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*T9GS0pQhPaKr_0yWDacbgA.png" /></figure><p>Circles with blue crosses correspond to witnesses from <em>P</em>, while squares with blue crosses to witnesses from <em>Q</em>. Basically from the crossed-out observations we can learn something about the difference of <em>P, Q</em>, while the observations without crosses are basically from the Null. In a sense all of this is just a thought experiment — we have no way of knowing whether <em>X_i</em> is drawn from <em>p_+</em> or <em>h</em>. So we don’t know which points are witnesses, in fact, we don’t even know how many there are. Nonetheless, this thought experiment will be helpful to construct our estimator.</p><p>In the following we will propose a <em>candidate </em>for <em>TV(P,Q), </em>say <em>lambda_c</em> and then check whether this candidate fits a condition. If it does we choose a new candidate <em>lambda_c</em> that is higher than the old one and check the condition again. We do that until <em>lambda_c</em> violates the condition.</p><h4>Let’s do some cleaning</h4><blockquote>Concepts in this section: Bounding the number of witnesses with high probability and the use an intuitive “cleaning operation” to get a better behaved process B_z that is always larger or equal V_z.</blockquote><p>We now want to use this idea that some points are witnesses and others come from the part where <em>P=Q</em> for the statistics <em>V_z. </em>As mentioned above, we don’t really know which points are witnesses! That would be ok, what we need is actually just the number of witnesses, though we don’t even know that. However, we can find an <em>upper bound</em> for this number.</p><p>Recall that we assume <em>X_1,…, X_m </em>were sampled<em> </em>by drawing each with probability TV(P,Q) from <em>p_+</em> and with probability <em>1-TV(P,Q)</em> from <em>h</em>. So the number of witnesses in <em>m</em> observations, denoted <em>w_p</em>, actually follows a Binomial distribution with success probability <em>TV(P,Q)</em>. So if we have a candidate <em>lambda_c</em>, which we suspect should be the true TV distance, the number of witnesses should follow the distribution</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/493/1*AYO4_cUCQjEK6h1rg1Jqww.png" /></figure><p>This is still random, and thus we don’t know the exact outcome for a given sample. But since we know the distribution, we can find a higher quantile <em>W_p</em>, such that <em>w_p</em> is overshooting this quantile with a probability less than <em>alpha/3</em>. For instance, for <em>m=10</em> and <em>lambda_c=0.1</em>, this can be found as</p><pre>lambda_c&lt;-0.1<br>m&lt;-10<br>alpha&lt;-0.05<br><br>W_p&lt;-qbinom(p=alpha/3, size=m, prob=lambda_c, lower.tail=F)<br><br><br># test<br>w_p&lt;-rbinom(n=3000, size=m, prob=lambda_c)<br>hist(wp,main=&quot;&quot;)<br>abline(v=W_p, col=&quot;darkred&quot;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/730/1*P8-j_RQqqygTD_UATCv0jQ.png" /></figure><p>We can do the same for the witnesses <em>W_q</em>.</p><p>So we have a candidate <em>lambda_c</em> and, based on this candidate, two values <em>W_p</em> and <em>W_q </em>that bound<em> w_p </em>and<em> w_q </em>with high probability. In particular, <em>W_p</em> and <em>W_q </em>depend directly on<em> lambda_c, </em>so it would actually be better to<em> </em>write <em>W_p(lambda_c) </em>and <em>W_q(lambda_c),</em> but that would blow up the notation too much.</p><p>To obtain the new process <em>B_z</em>, we first make up new witnesses in the order statistics above:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*cvPDpt9koMlC4wjtGWiFCA.png" /></figure><p>The red squares now denote random points that we designated to be witnesses. This is done so that the number of witnesses matches the upper bounds <em>W_p</em> and <em>W_q</em>. This is ok in our context, as it actually does not change <em>V_z</em>.</p><p>Now we perform our <em>cleaning operation</em>. This will get us from <em>V_z</em> to <em>B_z</em> in a way that guarantees <em>B_z</em> is at least as large as <em>V_z</em>. We go through the order statistics from left to right and from right to left. First, from left to right, every time we see a circle without a cross, we randomly choose a witness from <em>P</em> (circle with a cross) on the right and put it before the empty circle. We do so without changing the order of the squares and circles without crosses, like so:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*i7bvIN9-Zl8za6d-DqCBtw.png" /></figure><p>The first circle was already a witness, so we left it as is. The second circle was a nonwitness, so we randomly moved a witness circle from further up to where it was before. The next thing was a square which was no a witness, so we moved a circle witness from the right before it and so on. The whole idea is simply to move all witnesses from <em>P</em> to the left and all witnesses of <em>Q</em> to the right, without changing the order of the non-witnesses within themselves:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*aOgtjYftfvGCKX9haQu03Q.png" /></figure><p>Now <em>B_z</em> is just counting the number of observations below <em>z=1,…,N</em> that belong to <em>P</em> in this new ordering! Note that for the first set of observations <em>B_z</em> just increases linearly by 1. Then there is a middle part in which B_z behaves like a hypergeometric process:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1pr-xqYXFFl2W0wpCqLW_A.png" /></figure><p>Finally, the last few observations are only squares, so the value of <em>B_z </em>just reaches<em> m </em>and stays there.</p><pre>## Using the function defineB_z below<br><br>## Define n + m and confidence alpha<br>n&lt;-50<br>m&lt;-100<br>alpha&lt;-0.05<br><br># Define the candidate<br>lambda_c &lt;- 0.4<br><br><br>plot(1:(m+n),defineB_z(m,n,alpha,lambda_c), type=&quot;l&quot;, cex=0.5, col=&quot;darkblue&quot;)<br><br>for (b in (1:100)){<br>  <br>  lines(defineB_z(m,n,alpha,lambda_c), col=&quot;darkblue&quot;)<br>  <br>}<br></pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*it7RCKJGl1nV2iN-kF-mTw.png" /></figure><p>The key is that we moved all the circles more to the left than they were before! This is why <em>B_z </em>is always larger (or the same) than<em> V_z. </em>In particular, for <em>lambda=0</em>, we expect <em>W_p=0</em> and thus <em>B_z=V_z.</em></p><pre>library(stats)<br><br>defineB_z &lt;- function(m,n,alpha,lambda_c){<br><br>## Upper bound witesses for the given m, n, alpha and lamba_c<br><br>W_p&lt;-qbinom(p=alpha/3, size=m, prob=lambda_c, lower.tail=F)<br>W_q&lt;-qbinom(p=alpha/3, size=n, prob=lambda_c, lower.tail=F)<br><br>B_z&lt;-matrix(0, nrow=n+m)<br><br># First part: B_z=z<br>B_z[1:W_p,] &lt;- 1:W_p<br><br># Last part: B_z=m<br>B_z[(m+n-W_q):(m+n),] &lt;- m<br><br># Middle part: Hypergeometric<br>for (z in (W_p+1):(m+n-W_q-1) ){<br>  <br>  B_z[z,]&lt;-rhyper(1, m-W_p, n-W_q, z-W_p)+W_p<br>  <br>}<br><br>return(B_z)<br>}</pre><h4>Putting it all together</h4><blockquote>Concepts in this section: Using the above to get a bound on B_z, leading to a bound on V_z and a subsequent HPLB lambdahat we can use. It is defined through an infimum, and to find it, we need to cycle through several candidates.</blockquote><p>Next, given the <strong>true <em>lambda=TV(P,Q)</em></strong>, we want to find a <em>Q(z,alpha, lambda)</em> that has</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/743/1*8p2wLBo7SioBBgHuBJ3FyA.png" /></figure><p>This will be used to define the final estimator in a second. Crucially, <em>Q(z,alpha, lambda_c)</em> needs to be defined for any <em>lambda_c</em>, but the probability statement only needs to be true at the true candidate <em>lambda_c=lambda</em>. So this is the “candidate” I focus on for the moment.</p><p>From above we know that for the first <em>W_p</em> values, <em>B_z</em> is just linearly increasing. So <em>B_z=z</em> and we can also set <em>Q(z,alpha, lambda)=z</em>, for <em>z=1,…,W_p</em>. Similarly on the other side, when all m circles are counted, we know <em>B_z=m</em> and thus we can set <em>Q(z,alpha, lambda)=m</em> for all <em>z=m+n-W_q, …m +n</em>. (Remember that in each case <em>lambda=TV(P,Q) </em>enters through <em>W_p</em> and <em>W_q</em>.)</p><p>What remains is the part in the middle that behaves as if under the Null. This is true for <em>z=W_p+1,…, m+n-W_q</em>. But since here <em>B_z</em> is again hypergeometric, we can use the same <em>q</em> function as above to get</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/989/1*zLJY85_TzHGG4RJoC6avaw.png" /></figure><p>The <em>alpha/3</em> we need, since we can potentially make mistakes with <em>W_p</em> and <em>W_q</em>, i.e. there is an <em>alpha/3</em> chance we do not overestimate.</p><p>So we have all cases considered! For <em>z=1,…,W_p, B_z-Q(z,alpha,lambda)=0</em>, the same holds true for the last <em>W_q z</em>’s. The middle part finally is covered by the above equation. But since <em>B_z</em> is larger than <em>V_z</em>, we also have</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/778/1*Alk9VcuZn6FcaWhq6I250A.png" /></figure><p>Using this <em>Q</em> function, we can define our final estimator as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*NCxccHOji97phAT6HIoH9A.png" /></figure><p>This looks horrible, but all it means is that starting from <em>lambda_c=0</em>, you (1) calculate <em>W_p(lambda_c), W_q(lambda_c)</em>, and thus <em>Q(z,alpha,lambda_c)</em>, and (2) check whether</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/632/1*vi95ZTPnPPlrWlL9s_aU1g.png" /></figure><p>is true. If it is, you can increase <em>lambda_c</em> a little bit and repeat steps (1) and (2). If it is not true, you stop and set the estimator as <em>lambda_c</em>.</p><p>Mathematically, why does this inf definition of the estimator work? It just means that <em>lambdahat</em> is the smallest <em>lambda_c </em>such that</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/638/1*0jk13__4Zag1Q_l9a61ujw.png" /></figure><p>is true. So if the true <em>lambda (=TV(P,Q))</em><strong><em> </em></strong>is smaller than that smallest value (our estimator), this condition cannot hold, and instead, the &gt; 0 condition above is true. But we have just seen above that this &gt;0 condition has a probability of occurring ≤ alpha, so we are fine.</p><p>All of this can be found implemented in the HPLB package on CRAN. The next section presents two examples.</p><p><strong>Some Examples</strong></p><p>Here, we use the estimator derived in the last section on two examples. In the first example, we use a Random Forest-induced estimate of the probability of belonging to class 1, as discussed above. In the second example, we actually use a regression function, showing that one can generalize the concepts discussed here.</p><p>In the first article, we already studied the following example</p><pre>library(mvtnorm)<br>library(HPLB)set.seed(1)<br>n&lt;-2000<br>p&lt;-2#Larger delta -&gt; more difference between P and Q<br>#Smaller delta -&gt; Less difference between P and Q<br>delta&lt;-0# Simulate X~P and Y~Q for given delta<br>U&lt;-runif(n)<br>X&lt;-rmvnorm(n=n, sig=diag(p))<br>Y&lt;- (U &lt;=delta)*rmvnorm(n=n, mean=rep(2,p), sig=diag(p))+ (1-(U &lt;=delta))*rmvnorm(n=n, sig=diag(p))plot(Y, cex=0.8, col=&quot;darkblue&quot;)<br>points(X, cex=0.8, col=&quot;red&quot;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*h_tUwRru7K-qLLpD7Kb15A.png" /></figure><p>In the above simulation, the <em>delta </em>parameter determines how different <em>P</em> and <em>Q </em>are, from <em>delta=0</em>, whereby <em>P=Q</em>, to <em>delta=1</em>, whereby <em>P</em> is a bivariate normal with mean (0,0) and <em>Q</em> is a bivariate normal with mean (2,2). Even a strong two-sample test would simply reject in all of these cases. With our method, using Random Forest probability estimates, we get</p><pre>#Estimate HPLB for each case (vary delta and rerun the code)<br>t.train&lt;- c(rep(0,n/2), rep(1,n/2) )<br>xy.train &lt;-rbind(X[1:(n/2),], Y[1:(n/2),])<br>t.test&lt;- c(rep(0,n/2), rep(1,n/2) )<br>xy.test &lt;-rbind(X[(n/2+1):n,], Y[(n/2+1):n,])<br>rf &lt;- ranger::ranger(t~., data.frame(t=t.train,x=xy.train))<br>rho &lt;- predict(rf, data.frame(t=t.test,x=xy.test))$predictions<br>tvhat &lt;- HPLB(t = t.test, rho = rho, estimator.type = &quot;adapt&quot;)<br>tvhat</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/326/1*A0sjCxyIRd5wYF5n5-NOdw.png" /></figure><p>as we would have hoped: The lower bound is zero when the distributions are the same (i.e., the implicit test cannot reject) and progressively increases as <em>P</em> and <em>Q</em> get more different (i.e., <em>delta</em> increases).</p><p>We can also look at a more general example. Suppose we observe a (more or less) independent sample, that has however a mean shift in the middle:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ppvDY9FyJHPWkLmLmZEarQ.png" /></figure><p>Let us consider the index of observations as time t in this example and from left to right we move from time <em>t=1</em> to <em>t=1000</em>. The code follows below. We can now check at each point of interest (for instance at each sample point) how large the TV distance is between <em>P</em>=points on the left and <em>Q</em>=points on the right. This can be done by again using a probability estimate for each <em>t</em>, but to speed things up, we instead use a regression of time <em>t</em> on the observations <em>z_t. </em>That is we check whether the observation value gives us an indication of whether it lies more on the left or right.</p><p>The following picture shows the true TV in red and our HPLB in black:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QmNIna8_zLDb0zmkECMl6w.png" /></figure><p>It can be seen that the method nicely detects the increase in TV distance when we move from left to right in the graph. It then peaks at the point where the change in distribution happens, indicating that the main change happens there. Of course, as can be seen in the code below, I cheated a bit here; I generated the whole process two times, once for training once for testing. In general in this example, one has to be a bit more careful about how to choose training and test set.</p><p>Importantly, at each point where we calculate the TV distance, we implicitly calculate a two-sample test.</p><pre>library(HPLB)<br>library(ranger)<br>library(distrEx)<br><br><br>n &lt;- 500<br>mean.shift &lt;- 2<br>t.train &lt;- runif(n, 0 ,1)<br>x.train &lt;- ifelse(t.train&gt;0.5, stats::rnorm(n, mean.shift), stats::rnorm(n))<br>rf &lt;- ranger::ranger(t~x, data.frame(t=t.train,x=x.train))<br><br>n &lt;- 500<br>t.test &lt;- runif(n, 0 ,1)<br>x.test &lt;- ifelse(t.test&gt;0.5, stats::rnorm(n, mean.shift), stats::rnorm(n))<br>rho &lt;- predict(rf, data.frame(t=t.test,x=x.test))$predictions<br><br>## out-of-sample<br>tv.oos &lt;- HPLB(t = t.test, rho = rho, s = seq(0.1,0.9,0.1), estimator.type = &quot;adapt&quot;)<br><br><br>## total variation values<br>tv &lt;- c()<br>for (s in seq(0.1,0.9,0.1)) {<br>  <br>  if (s&lt;=0.5) {<br>    <br>    D.left &lt;- Norm(0,1)<br>  } else {<br>    <br>    D.left &lt;- UnivarMixingDistribution(Dlist = list(Norm(0,1),Norm(mean.shift,1)),<br>                                       mixCoeff = c(ifelse(s&lt;=0.5, 1, 0.5/s), ifelse(s&lt;=0.5, 0, (s-0.5)/s)))<br>  }<br>  if (s &lt; 0.5) {<br>    <br>    D.right &lt;- UnivarMixingDistribution(Dlist = list(Norm(0,1),Norm(mean.shift,1)),<br>                                        mixCoeff = c(ifelse(s&lt;=0.5, (0.5-s)/(1-s), 0), ifelse(s&lt;=0.5, (0.5/(1-s)), 1)))<br>  } else {<br>    <br>    D.right &lt;- Norm(mean.shift,1)<br>  }<br>  tv &lt;- c(tv, TotalVarDist(e1 = D.left, e2 = D.right))<br>}<br><br>## plot<br>oldpar &lt;- par(no.readonly =TRUE)<br>par(mfrow=c(2,1))<br>plot(t.test,x.test,pch=19,xlab=&quot;t&quot;,ylab=&quot;x&quot;)<br>plot(seq(0.1,0.9,0.1), tv.oos$tvhat,type=&quot;l&quot;,ylim=c(0,1),xlab=&quot;t&quot;, ylab=&quot;TV&quot;)<br>lines(seq(0.1,0.9,0.1), tv, col=&quot;red&quot;,type=&quot;l&quot;)<br>par(oldpar)<br><br></pre><h4>Conclusion</h4><p>In this article, we took a deep dive into the construction of an HPLB for the TV distance. Of course, what we really lower-bounded was the TV distance on the probability estimates. In fact, the challenge is to find a “projection” or classifier that is powerful enough to still find a signal. Algorithms like Random Forest, as we used here, are examples of such powerful methods, that moreover don’t really require any tuning.</p><p>We hope that with the code provided here and on CRAN in the HPLB package, these basic probability considerations we did here might actually be used on some real-world problems.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=11f0fdd73044" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/deep-dive-into-hplbs-for-a-b-testing-using-random-forest-11f0fdd73044">Deep Dive into HPLBs for A/B testing using Random Forest</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[R-NL: Robust Nonlinear Shrinkage]]></title>
            <link>https://medium.com/data-science/high-dimensional-covariance-estimation-when-tails-are-heavy-da61723ce9b3?source=rss-ca780798011a------2</link>
            <guid isPermaLink="false">https://medium.com/p/da61723ce9b3</guid>
            <category><![CDATA[statistics]]></category>
            <category><![CDATA[covariance-matrix]]></category>
            <category><![CDATA[thoughts-and-theory]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[high-dimensional-data]]></category>
            <dc:creator><![CDATA[Jeffrey Näf]]></dc:creator>
            <pubDate>Wed, 02 Nov 2022 17:39:01 GMT</pubDate>
            <atom:updated>2023-08-03T10:18:00.411Z</atom:updated>
            <content:encoded><![CDATA[<h4>High-dimensional covariance estimation when tails are heavy</h4><p>In this article, I discuss a new covariance estimation method from our recent <a href="https://ieeexplore.ieee.org/abstract/document/10109124">paper </a>“R-NL: Covariance Matrix Estimation for Elliptical Distributions based on Nonlinear Shrinkage’’. I introduce the problem we are solving, try to give some intuition on how we are solving it, and briefly present the simple code we developed. On the way, I touch upon some interesting concepts, like the (robust) “Tyler’s estimator”, which I feel are somewhat underused in modern data science, probably because no author apparently ever provided code with their paper (or maybe it is because most of the papers regarding this topic appear to appear in the signal processing community, which is another potential reason they get overlooked by data scientists).</p><p><strong>Nonlinear Shrinkage</strong></p><p><a href="https://towardsdatascience.com/nonlinear-shrinkage-an-introduction-825316dda5b8">Nonlinear shrinkage</a> is a powerful tool in high-dimensional covariance estimation. In the new paper, I and my coauthors introduce an adapted version that tends to produce even better results in heavy-tailed models while keeping the strong results in other cases. To showcase this new approach and the problem it solves, let’s start with an R example. We first load the necessary functions and define the dimensions <em>p</em> and numbers of examples <em>n, </em>as well as the true covariance matrix we want to study.</p><pre># For simulating the data:<br>library(mvtnorm)</pre><pre># The NL/QIS method available on <a href="https://github.com/MikeWolf007/covShrinkage/blob/main/qis.R">https://github.com/MikeWolf007/covShrinkage/blob/main/qis.R</a><br>source(&quot;qis.R&quot;)</pre><pre># Set the seed, n and p<br>set.seed(1)</pre><pre># p quite high relativ to n<br>n&lt;-300<br>p&lt;-200</pre><pre>#Construct the dispersion matrix<br>Sig&lt;-sapply(1:p, function(i) {sapply(1:p, function(j) 0.7^{abs(i-j)} )} )</pre><p>The covariance matrix we defined here corresponds to an AR process. That is, while the observations are independent, the dimensions <em>X_i</em> and <em>X_j</em> are less related the bigger the difference between <em>i</em> and <em>j</em> are in absolute values. This means correlations get exponentially smaller, the farther away one is from the diagonal and there is actually some structure to be learned here.</p><p>First, let’s simulate an i.i.d. sample of random vectors from a Gaussian distribution with the correlation structure defined before:</p><pre>### Multivariate Gaussian case<br>X&lt;-rmvnorm(n = n, sigma = Sig)</pre><p>In the <a href="https://towardsdatascience.com/nonlinear-shrinkage-an-introduction-825316dda5b8">nonlinear shrinkage article<em>,</em></a> I explained what the optimal estimator is if one is only allowed to modify the <em>eigenvalues </em>of the sample covariance matrix, but has to leave the <em>eigenvectors </em>intact (which is what a lot of shrinkage methods do): Let thus <strong><em>u</em></strong><em>_j</em>, <em>j=1,..p</em>, be the eigenvectors of the sample covariance matrix and</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/414/1*g2g8SuAioFIpIuQjDVKdsw.png" /></figure><p>the matrix of eigenvectors. Then the optimal values are given by</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/766/1*Gm7X0gdnFwwU3UTsUG67uQ.png" /></figure><p>resulting in the optimal estimator</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/220/1*Oarbjaw495ygtULYQXN_og.png" /></figure><p>Note that the optimal values in (1) are not exactly equal to the true eigenvalues in general, but instead linear combinations of the true eigenvalues, according to the (fixed) sample eigenvectors.</p><p>So let’s calculate the sample covariance matrix and the nonlinear shrinkage matrix and check how close we are to this ideal case:</p><pre>## Sample Covariance Matrix<br>samplespectral&lt;-eigen(cov(X))</pre><pre>## Nonlinear Shrinkage<br>Cov_NL&lt;-qis(X)<br>NLvals&lt;-sort( diag( t(samplespectral$vectors)%*%Cov_NL%*%samplespectral$vectors  ), decreasing=T)</pre><pre>## Optimal: u_j&#39;*Sig*u_j for all j=1,...,p<br>optimalvals&lt;-sort(diag( t(samplespectral$vectors)%*%Sig%*%samplespectral$vectors  ), decreasing=T)</pre><pre>plot(sort(samplespectral$values, decreasing=T), type=&quot;l&quot;, cex=1.5, lwd=2, lty=2, ylab=&quot;Eigenvalues&quot;,)<br>lines(optimalvals, type=&quot;l&quot;, col=&quot;red&quot;, cex=1.5, lwd=2, lty=1)<br>lines(NLvals, type=&quot;l&quot;, col=&quot;green&quot;, cex=1.5, lwd=2, lty=3)<br>title(main=&quot;Multivariate Gaussian&quot;)</pre><pre>legend(200, 8, legend=c(&quot;Sample Eigenvalues&quot;, &quot;Attainable Truth&quot;, &quot;NL&quot;),col=c(&quot;black&quot;, &quot;red&quot;, &quot;green&quot;), lwd=2, lty=c(2,1,3), cex=1.5)</pre><p>which gives the plot:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/850/1*JDBAChvG0VaWIDgMZ1N-Sg.png" /><figcaption>Source: Author</figcaption></figure><p>The plot shows the optimal value on the diagonal of (1), as well as the sample and nonlinear shrinkage estimates. It looks like one would hope, the sample eigenvalues show excess dispersion (too large for large values, too small for small ones), while nonlinear shrinkage is pretty close to the ideal values. This is what we would expect simply because the dimension <em>p=200</em> is quite high compared to the sample size <em>n=300</em>. The larger <em>p</em> is chosen relative to <em>n </em>the worse this would look for the sample covariance matrix.</p><p>Now we do the same, but simulating from a multivariate t-distribution with 4 degrees of freedom, a (very) heavy-tailed distribution:</p><pre>### Multivariate t case<br>X&lt;-rmvt(n=n, sigma=Sig, df=4)</pre><pre>## Truth<br>Sig &lt;-4/(4-2)*Sig  ## Need to rescale with a t distribution</pre><pre>## Sample Covariance Matrix<br>samplespectral&lt;-eigen(cov(X))</pre><pre>## Nonlinear Shrinkage<br>Cov_NL&lt;-QIS(X)$Sig<br>NLvals&lt;-sort( diag( t(samplespectral$vectors)%*%Cov_NL%*%samplespectral$vectors  ), decreasing=T)</pre><pre>## Optimal: u_j&#39;*Sig*u_j for all j=1,...,p<br>optimalvals&lt;-sort(diag( t(samplespectral$vectors)%*%Sig%*%samplespectral$vectors  ), decreasing=T)</pre><pre>plot(sort(samplespectral$values, decreasing=T), type=&quot;l&quot;, cex=15, lwd=2, lty=2, ylab=&quot;Eigenvalues&quot;,)<br>lines(optimalvals, type=&quot;l&quot;, col=&quot;red&quot;, cex=1.5, lwd=2, lty=1)<br>lines(NLvals, type=&quot;l&quot;, col=&quot;green&quot;, cex=1.5, lwd=2, lty=3)<br>title(main=&quot;Multivariate t&quot;)</pre><pre>legend(200, 40, legend=c(&quot;Sample Eigenvalues&quot;, &quot;Attainable Truth&quot;, &quot;NL&quot;),col=c(&quot;black&quot;, &quot;red&quot;, &quot;green&quot;), lty=c(2,1,3), cex=1.5, lwd=2)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/850/1*7GBDW1lH4VcgoW0qKLMFdQ.png" /><figcaption>Source:author</figcaption></figure><p>Now, this doesn’t look as good anymore! The nonlinear shrinkage values in green also show some excess dispersion: Large values get way too large, while small values are a bit too small. Apparently, in finite samples, heavy tails can distort nonlinear shrinkage. It would be nice to have a method that displays the same amazing results in the Gaussian case but also keeps good results in heavy-tailed models. This is the motivation behind our new method. I now go into some details, beginning with the key to the approach: elliptical distributions.</p><p><strong>Elliptical distributions</strong></p><p>The class of elliptical distributions includes a reasonable range of different distributions such as multivariate Gaussians, multivariate t, multivariate generalized hyperbolic, and so on. If a random vector <strong>X</strong> follows an elliptical distribution, it can be written as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/420/1*h1Kkc__FPO3F9UJN3nWMuQ.png" /></figure><p>where <strong>S </strong>is uniformly distributed on the unit sphere in <em>p</em> dimensions and <em>R</em> is some nonnegative random variable independent of <strong>S</strong>. This may sound complicated, but it just means that an elliptical distribution can be reduced to a uniform distribution on a circle (in two dimensions) or on a sphere (in general). Thus these kinds of distributions have a very specific structure. In particular, we need to mention an important point: In the equation above, <strong>H</strong> is called the <em>dispersion matrix</em>, as opposed to the <em>covariance matrix</em>. In this article, we want to estimate the covariance matrix, which, if it exists, is given as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/456/1*whurbjSlrgh_4f4QopfgzQ.png" /></figure><p>This is already interesting: In elliptical models, <strong>H</strong> exists by assumption but the covariance matrix might not if the expected value of <em>R</em> is not finite! For instance, the multivariate t distribution with degrees of freedom smaller than 2 has no covariance matrix. Nonetheless, we could still estimate <strong>H </strong>in this case. So the dispersion matrix is in a sense a more general concept. In this article, however, we will assume the covariance matrix exists, and in this case, we see from the above that dispersion and covariance matrix are the same up to a constant.</p><p>Interestingly, one can then look at <strong>Z</strong>=<strong>X/||X||, </strong>that is the random vector<strong> X </strong>divided by its Euclidean norm<strong> </strong>and this thing <em>will always have the same distribution</em>! In fact, this is just the uniform distribution on the <em>p-dimensional</em> sphere. We can see this in an example for <em>p=2</em>.</p><pre>X&lt;-rmvnorm(n = n, sigma = diag(2))<br>Y&lt;-rmvt(n = n, sigma = diag(2), df=4)</pre><pre># standardize by norm<br>ZGaussian&lt;-t(apply(X,1, function(x) x/sqrt(sum(x^2)) ))<br>Zt &lt;- t(apply(Y,1, function(x) x/sqrt(sum(x^2)) ))</pre><pre>par(mfrow=c(1,2))<br>plot(ZGaussian)<br>plot(Zt)</pre><p>which gives</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*u77c_5rCCHEjo9PqjdYImQ.png" /><figcaption>Source:author</figcaption></figure><h4>(Linear-Shrinkage) Tyler’s Estimator</h4><p>Tyler’s estimator of the dispersion matrix <strong>H</strong>, derived in [1], uses the fact that <strong>Z</strong>=<strong>X/||X|| </strong>always has the same distribution. Using the likelihood of that distribution, one can derive a maximum likelihood estimator (basically just taking derivatives and setting to zero) that looks like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/634/1*ExjMfKfKcYxaAlW1VhoiYg.png" /></figure><p>Note that this only defines <strong>H</strong> implicitly (it is both on the left and on the right), so the natural way to try to get to <strong>H</strong> is iterative:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/700/1*zt3Uj90djvh755KANBSXDQ.png" /></figure><p>where the second step is just a renormalization that is needed for technical reasons. One can show that indeed, this simple iterative scheme will converge to the solution in (2). This estimator of <strong>H </strong>is what is referred to as ‘’Tyler’s estimator’’.</p><blockquote>Tyler’s estimator is an iterative estimate of the dispersion matrix of an i.i.d. sample from an elliptical distribution. It is derived using the fact that an elliptical random vector standardized by its Eucledian norm always has the same distribution.</blockquote><p>Ok, so this is a way of robustifying the covariance or dispersion estimator against heavy tails. But the above Tyler’s estimator only works for <em>p &lt; n, </em>and deteriorates when<em> p </em>gets closer to<em> n</em>, so we still need to robustify against the case when <em>p</em> is close or even larger than <em>n</em>. A lot of papers in the signal processing community simply do this by using linear shrinkage (which I also explained <a href="https://towardsdatascience.com/nonlinear-shrinkage-an-introduction-825316dda5b8">here</a>) in each iteration. This then looks like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/946/1*OXDfRAWjFtuRCp_p8g9DOg.png" /></figure><p>where different choices of <em>rho </em>have inspired different papers. For instance, one of the papers that started this line of research is [2]. We now instead use <em>nonlinear </em>shrinkage together with Tyler’s method.</p><blockquote>One can robustify Tyler’s estimator also for high dimensions, by using linear shrinkage in each iteration. Different papers have found smart adaptive ways to calculate the free parameter rho.</blockquote><p><strong>Robust Nonlinear Shrinkage</strong></p><p>The goal would now be to use the same iterative scheme as above, but instead of using linear shrinkage, iterate with nonlinear shrinkage. Unfortunately, this is not so straightforward. I won’t go into detail in this article, since some trickery is needed to make it work, but instead, I refer to our implementation on <a href="https://github.com/hedigers/RNL_Code">Github</a> and the paper. The implementation we provide also may be quite handy, because none of the above-mentioned papers of the signal processing community appear to give out code or implement their method in a package.</p><p>To showcase the performance and the code, we repeat the multivariate t analysis at the beginning with this new method. We restate the whole procedure above for completeness:</p><pre>### Multivariate t case<br>X&lt;-rmvt(n=n, sigma=Sig, df=4)</pre><pre>## Truth<br>Sig &lt;-4/(4-2)*Sig  ## Need to rescale with a t distribution</pre><pre>## Sample Covariance Matrix<br>samplespectral&lt;-eigen(cov(X))</pre><pre>## Nonlinear Shrinkage<br>Cov_NL&lt;-QIS(X)$Sig<br>NLvals&lt;-sort( diag( t(samplespectral$vectors)%*%Cov_NL%*%samplespectral$vectors  ), decreasing=T)</pre><pre>## R-NL code from <a href="https://github.com/hedigers/RNL_Code">https://github.com/hedigers/RNL_Code</a><br>Cov_RNL&lt;-RNL(X)<br>RNLvals&lt;-sort( diag( t(samplespectral$vectors)%*%Cov_RNL%*%samplespectral$vectors  ), decreasing=T)</pre><pre>## Optimal: u_j&#39;*Sig*u_j for all j=1,...,p<br>optimalvals&lt;-sort(diag( t(samplespectral$vectors)%*%Sig%*%samplespectral$vectors  ), decreasing=T)</pre><pre>plot(sort(samplespectral$values, decreasing=T), type=&quot;l&quot;, cex=1.5, lwd=2, lty=2, ylab=&quot;Eigenvalues&quot;,)<br>lines(optimalvals, type=&quot;l&quot;, col=&quot;red&quot;, cex=1.5, lwd=2, lty=1)<br>lines(NLvals, type=&quot;l&quot;, col=&quot;green&quot;, cex=1.5, lwd=2, lty=3)<br>lines( RNLvals, type=&quot;l&quot;, col=&quot;darkblue&quot;, cex=1.5, lwd=2, lty=3)<br>title(main=&quot;Multivariate t&quot;)</pre><pre>legend(200, 40, legend=c(&quot;Sample Eigenvalues&quot;, &quot;Attainable Truth&quot;, &quot;NL&quot;, &quot;R-NL&quot;),<br>       col=c(&quot;black&quot;, &quot;red&quot;, &quot;green&quot;, &quot;darkblue&quot;), lty=c(2,1,3,4), cex=1.5, lwd=2)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/809/1*SRccWlmnj4pbXn48py8xeA.png" /><figcaption>Source:author</figcaption></figure><p>R-NL (the blue line) is almost perfectly on the red line and thus mirrors the good performance we saw for NL in the Gaussian case! This is exactly what we wanted. Also, though it might be obscure in the code above, using the function is super easy, <em>RNL(X)</em> gives the estimator of the covariance matrix (if you can assume it exists) and <em>RNL(X, cov=F)</em> gives an estimate of <strong>H</strong>.</p><p>Finally, if we were to use the RNL function in the beginning for the Gaussian example, the values would look almost perfectly the same as they do with nonlinear shrinkage. In fact, the figure below shows simulation results from the paper using our two methods R-NL and a slight adaptation, R-C-NL, and a range of competitors. The setting is almost exactly as in the code above, with the same dispersion matrix and <em>n=300</em>, <em>p=200. </em>The difference is that we now vary the tail-determining parameter of the multivariate t distribution, from 3 (extremely heavy-tailed) to infinity (Gaussian case) on a grid. We won&#39;t go into details about what the numbers on the y-axis exactly mean, just that larger is better and 100 is the maximal value. The competitors “R-LS” and “R-GMV-LS” are two linear shrinkage Tyler’s estimators, as mentioned above, while “NL” is nonlinear shrinkage. It can be seen that we are (much) better than the latter for heavy tails and then converge to the same values, once we approach the Gaussian tail behavior.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-up4AM0hLmPdi3lD9-JiKg.png" /><figcaption>Simulation results from the paper on arXiv.</figcaption></figure><h4>Conclusion</h4><p>This article discussed the paper “R-NL: Fast and Robust Covariance Estimation for Elliptical Distributions in High Dimensions’’. The paper on arXiv contains a wide range of simulation settings, indicating that the R-NL and R-C-NL estimators do exceedingly well in a wide range of situations.</p><p>Thus I hope that the estimator(s) can be successfully used in a lot of real applications as well!</p><p><strong>References</strong></p><p>[1] Tyler, D. E. (1987a). A distribution-free M-estimator of multivariate scatter. Annals of Statistics, 15(1):234–251.</p><p>[2] Chen, Y., Wiesel, A., and Hero, A. O. (2011). Robust shrinkage estimation of high dimensional covariance matrices. IEEE Transactions on Signal Processing, 59(9):4097– 4107</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=da61723ce9b3" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/high-dimensional-covariance-estimation-when-tails-are-heavy-da61723ce9b3">R-NL: Robust Nonlinear Shrinkage</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>