Robust Measure of Location and Simulation Study

Burak Dilber
Data Science Earth
Published in
13 min readJan 13, 2021

--

Are the estimators such as mean, median, and mode sufficient? Using these estimators, we can actually solve some problems in terms of implementation. For example, in a data set we have missing values and we want to assign values ​​to these missing observations. Usually, the value we assign is the average or the median of that data set. So how do we do this? First of all, if we look at the distribution of the data set or if it is close to the normal distribution, we prefer values ​​such as the mean and the median as we move away from the normal distribution. The mode value is used to eliminate missing observations in data sets that are generally composed of qualitative data. But have you ever thought that there could be some durable estimators that would be an alternative to these estimators? Now let’s talk about these estimators…

Why are robust estimators called durable? Is it because it gives better results than mean or median? Actually, there are two answers to this question, both yes and no. No, because the mean still gives reliable results in normally distributed and nearly normal distribution data sets. Yes, because as you move away from normal, robust measure of location estimators give better results, especially against contradictory observations … Now let’s take a look at these robust measure of location and how they are calculated.

Trimmed Mean

The trimmed mean, as the name implies, makes a prediction by trimming the dataset. A trimming percentage is determined for this, the percentage of trimming is usually 20%. First, the data set is sorted from small to large and 20% of the total number of data is taken. Trimming is done starting from the smallest and largest values of the data set. Let’s give an example like this:

Let the data set be as follows:

“10,15,30,25,20,22,32,40,14,11”

Let’s sort this data set from small to large and get 20% of the total number of data:

“10,11,14,15,20,22,25,30,32,40” ==> 20% = 10*(20/100)=2

20% of the 10 observations were determined as 2. Now; Let’s prune out the first 2 and last 2 observations by trimming the dataset and calculate the mean of the remaining 6 observations:

“14,15,20,22,25,30”

Now, by taking the mean of these observations, the trimmed mean is obtained: (14+15+20+22+25+30)/6=21

The trimmed mean of this data set is 21. The codes to do this in the R programming language are shown below.

data<-c(10,11,14,15,20,22,25,30,32,40)
mean(data,tr=0.2)
[1] 21

In the “mean” function, “tr” parameter is used to determine the percentage of trimming. This value should be set to “0.2” for 20% trimming. So we learned how to calculate the trimmed mean in R. Now we can move on to the other location estimator.

Winsorized Mean

The mean is affected very quickly by outliers, so one of the other estimators that can be effective against outliers is the winsorized mean. In this estimator, estimation is made by taking the percentage of the number of observations in the data set as in the trimmed mean. However, unlike trimming, the values in the dataset are added to the smallest and largest values as much as the percentage of winsorized. Let’s see this on an example. We can show on the previous example.

Likewise, the data set is ordered from small to large and its percentage is calculated:

“10,11,14,15,20,22,25,30,32,40” ==> 20% = 10*(20/100)=2

Let’s replace the first two observations with the value “14” and the last two observations with the value “30”:

“14,14,14,15,20,22,25,30,30,30”

Now the mean of these observations is calculated winsorized mean:(14+14+14+15+20+22+25+30+30+30)/10=21.4

The winsorized mean of this data set is “21.4”. For this, “winmean” function is used in R. This function is included in the “WRS2” package:

data<-c(10,11,14,15,20,22,25,30,32,40)
install.packages(“WRS2”)
library(“WRS2”)
winmean(data,tr=0.2)
[1] 21.4

In the “winmean” function, the “tr” parameter is used to determine the winsorized mean. This value should be set to “0.2” for 20% winsorized. So we learned how to calculate the winsorized mean in R.

One — Step M — Estimator

It is an estimator developed by Huber in 1981. Again, it is aimed to give more reliable results in outliers and large samples. The formula for this estimator is shown below:

Here, “MADN” denotes normalized median absolute deviation. Calculated with MAD / 0.6745. Here, MAD is the median absolute deviation, MAD = | X1-M |, …, | Xn-M | and M indicates the median value.

“i1” is the number of Xi observations with (Xi-M) / MADN <-1.28 and “i2” is the number of Xi observations with (Xi-M) / MADN> 1.28. Here, 1.28 value is the outlier value detection value.

R codes for calculating the One — Step M — Estimator are shown below. Again, the “WRS2” package must be installed.

data<-c(10,11,14,15,20,22,25,30,32,40)
onestep(data,bend=1.28)
[1] 21.57576

The one — step M — estimator was calculated from the sample we just made and it was found to be 21.57579.

Modified One — Step M — Estimator

Another method that is resistant to outliers is the modified one — step M — estimator. In this method, in order to obtain a reasonably good efficiency under normality, the outlier observation detection value has been changed.

It is set to (Xi-M) / MADN <2.24 and (Xi-M) / MADN> 2.24.

The “mom” function can be used in the “WRS2” package for the modified one — step M — estimator:

data<-c(10,11,14,15,20,22,25,30,32,40)
mom(data,bend=2.24)
[1] 21.9

Thus, the modified one — step M — estimator value for the sample we made earlier was found to be 21.9.

Tau Measure of Location

Another estimator Tau measure of location. The Tau estimator uses weights to find the predictive value in a dataset. The formula for this estimator is shown below:

Here, the indicator function is I (| x | <= c) = 1 when | x | <= c, otherwise I (| x | <= c) = 0. Weights is calculated by the formula. Here the c value is determined as “4.5” and the estimation value is calculated by the formula below. So we saw how the Tau estimator was calculated. In R, the “tauloc” function in the “WRS2” package can be used:

data<-c(10,11,14,15,20,22,25,30,32,40)
tauloc(data,cval = 4.5)
[1] 21.11821

Looking at the example shown earlier, we see that our value is calculated as “21.11821” with the Tau estimator.

So far, we have focused on location estimators, which may be an alternative to classical estimators, and are more resistant to skewed distribution and extreme values. We have learned how to calculate these methods. Now, let’s compare these methods by doing a simulation study:

SIMULATION STUDY

Random samples are produced for simulation study. Of course, the data we will produce here consists of skewed distributions including extreme values. The distribution we will use for this may be the g-and-h distribution. Let’s explain the g-and-h distribution:

The g-and-h distribution was proposed by Hoaglin in 1985 and is a special version of the normal distribution. In order to generate random data from this distribution, g and h values are used and these values are between 0 and 1. Now let’s look at the shape of the distribution by specifying these values:

When g = 0 and h = 0, the shape of the distribution is shown below:

Standard Normal Distribution (g=0, h=0)

As shown in the figure, when we specify the g and h values as 0, the data produced comes from the normal distribution. Let’s examine the shape of the distribution at different values:

When g = 0 and h = 0.5, the shape of the distribution is shown below:

Symmetric and Heavy Tailed Distribution (g=0, h=0.5)

With these values, we can say that the distribution is symmetrical and heavy-tailed.

When g = 0.5 and h = 0, the shape of the distribution is shown below:

Asymmetric and Light Tailed Distribution (g=0.5, h=0)

Here we can say that it is asymmetrical and light tailed for the shape of the distribution.

When g = 0.5 and h = 0.5, the shape of the distribution is shown below:

Asymmetric and Heavy Tailed Distribution (g=0.5, h=0.5)

When g and h take the values “0.5”, we can say that the distribution is asymmetrical and heavy-tailed for its shape.

As seen here, the random data produced from this distribution are produced from the skewed distribution, especially as g and h values increase. There are also outliers. For the g-and-h distribution, the “ghdist” function inside the “WRS2” package is used in R:

install.packages(“WRS2”)
library(WRS2)
ghdist(n,g=0,h=0)

The parameter “n” in the “ghdist” function shows the number of observations we will generate randomly. “g” and “h” parameters and g and h values are specified.

Now let’s do 1000 repetitive simulations using this distribution and at different sample widths:

The distributions we will use ==> (g = 0, h = 0), (g = 0, h = 0.5), (g = 0.5, h = 0), (g = 0.5, h = 0.5)

Random sample widths we will produce ==> 10, 25, 50, 100, 500, 1000

Simulation repeat ==> 1000

The criterion we will compare ==> Mean squared error.

Yes, let’s start if you’re ready

First, let’s generate the data from the standard normal distribution and compare the estimators in terms of the MSE criterion:

K = 1000n1=10n2=25n3=50n4=100n5=500n6=1000c<-ghdist(1000000,g=0,h=0)ort<-mean(c)ortanca<-median(c)budan<-mean(c,tr=0.2)winsor<-winmean(c,tr=0.2)one<-onestep(c)mom1<-mom(c)tau1<-tauloc(c)mse10_mean<-c()mse10_median<-c()mse10_trim<-c()mse10_win<-c()mse10_onestep<-c()mse10_mom<-c()mse10_tau<-c()mse25_mean<-c()mse25_median<-c()mse25_trim<-c()mse25_win<-c()mse25_onestep<-c()mse25_mom<-c()mse25_tau<-c()mse50_mean<-c()mse50_median<-c()mse50_trim<-c()mse50_win<-c()mse50_onestep<-c()mse50_mom<-c()mse50_tau<-c()mse100_mean<-c()mse100_median<-c()mse100_trim<-c()mse100_win<-c()mse100_onestep<-c()mse100_mom<-c()mse100_tau<-c()mse500_mean<-c()mse500_median<-c()mse500_trim<-c()mse500_win<-c()mse500_onestep<-c()mse500_mom<-c()mse500_tau<-c()mse1000_mean<-c()mse1000_median<-c()mse1000_trim<-c()mse1000_win<-c()mse1000_onestep<-c()mse1000_mom<-c()mse1000_tau<-c()for(k in 1:K){x1<-ghdist(n1,g=0,h=0)x2<-ghdist(n2,g=0,h=0)x3<-ghdist(n3,g=0,h=0)x4<-ghdist(n4,g=0,h=0)x5<-ghdist(n5,g=0,h=0)x6<-ghdist(n6,g=0,h=0)mse10_mean[k]<-mean(x1)mse10_median[k]<-median(x1)mse10_trim[k]<-mean(x1,tr=0.2)mse10_win[k]<-winmean(x1,tr=0.2)mse10_onestep[k]<-onestep(x1,bend=1.28)mse10_mom[k]<-mom(x1,bend = 2.24)mse10_tau[k]<-tauloc(x1,cval = 4.5)mse25_mean[k]<-mean(x2)mse25_median[k]<-median(x2)mse25_trim[k]<-mean(x2,tr=0.2)mse25_win[k]<-winmean(x2,tr=0.2)mse25_onestep[k]<-onestep(x2,bend=1.28)mse25_mom[k]<-mom(x2,bend = 2.24)mse25_tau[k]<-tauloc(x2,cval = 4.5)mse50_mean[k]<-mean(x3)mse50_median[k]<-median(x3)mse50_trim[k]<-mean(x3,tr=0.2)mse50_win[k]<-winmean(x3,tr=0.2)mse50_onestep[k]<-onestep(x3,bend=1.28)mse50_mom[k]<-mom(x3,bend = 2.24)mse50_tau[k]<-tauloc(x3,cval = 4.5)mse100_mean[k]<-mean(x4)mse100_median[k]<-median(x4)mse100_trim[k]<-mean(x4,tr=0.2)mse100_win[k]<-winmean(x4,tr=0.2)mse100_onestep[k]<-onestep(x4,bend=1.28)mse100_mom[k]<-mom(x4,bend = 2.24)mse100_tau[k]<-tauloc(x4,cval = 4.5)mse500_mean[k]<-mean(x5)mse500_median[k]<-median(x5)mse500_trim[k]<-mean(x5,tr=0.2)mse500_win[k]<-winmean(x5,tr=0.2)mse500_onestep[k]<-onestep(x5,bend=1.28)mse500_mom[k]<-mom(x5,bend = 2.24)mse500_tau[k]<-tauloc(x5,cval = 4.5)mse1000_mean[k]<-mean(x6)mse1000_median[k]<-median(x6)mse1000_trim[k]<-mean(x6,tr=0.2)mse1000_win[k]<-winmean(x6,tr=0.2)mse1000_onestep[k]<-onestep(x6,bend=1.28)mse1000_mom[k]<-mom(x6,bend = 2.24)mse1000_tau[k]<-tauloc(x6,cval = 4.5)print(k)}size10n1<-var(mse10_mean)+(mean(mse10_mean-ort)²) #mean (MSE)size10n2<-var(mse10_median)+(mean(mse10_median-ortanca)²) #median(MSE)size10n3<-var(mse10_trim)+(mean(mse10_trim-budan)²) #trim (MSE)size10n4<-var(mse10_win)+(mean(mse10_win-winsor)²) #winsorized (MSE)size10n5<-var(mse10_onestep)+(mean(mse10_onestep-one)²) #onestep (MSE)size10n6<-var(mse10_mom)+(mean(mse10_mom-mom1)²) #mom (MSE)size10n7<-var(mse10_tau)+(mean(mse10_tau-tau1)²) #tau (MSE)size25n1<-var(mse25_mean)+(mean(mse25_mean-ort)²) #mean (MSE)size25n2<-var(mse25_median)+(mean(mse25_median-ortanca)²) #median(MSE)size25n3<-var(mse25_trim)+(mean(mse25_trim-budan)²) #trim (MSE)size25n4<-var(mse25_win)+(mean(mse25_win-winsor)²) #winsorized (MSE)size25n5<-var(mse25_onestep)+(mean(mse25_onestep-one)²) #onestep (MSE)size25n6<-var(mse25_mom)+(mean(mse25_mom-mom1)²) #mom (MSE)size25n7<-var(mse25_tau)+(mean(mse25_tau-tau1)²) #tau (MSE)size50n1<-var(mse50_mean)+(mean(mse50_mean-ort)²) #mean (MSE)size50n2<-var(mse50_median)+(mean(mse50_median-ortanca)²) #median(MSE)size50n3<-var(mse50_trim)+(mean(mse50_trim-budan)²) #trim (MSE)size50n4<-var(mse50_win)+(mean(mse50_win-winsor)²) #winsorized (MSE)size50n5<-var(mse50_onestep)+(mean(mse50_onestep-one)²) #onestep (MSE)size50n6<-var(mse50_mom)+(mean(mse50_mom-mom1)²) #mom (MSE)size50n7<-var(mse50_tau)+(mean(mse50_tau-tau1)²) #tau (MSE)size100n1<-var(mse100_mean)+(mean(mse100_mean-ort)²) #mean (MSE)size100n2<-var(mse100_median)+(mean(mse100_median-ortanca)²) #median(MSE)size100n3<-var(mse100_trim)+(mean(mse100_trim-budan)²) #trim (MSE)size100n4<-var(mse100_win)+(mean(mse100_win-winsor)²) #winsorized (MSE)size100n5<-var(mse100_onestep)+(mean(mse100_onestep-one)²) #onestep (MSE)size100n6<-var(mse100_mom)+(mean(mse100_mom-mom1)²) #mom (MSE)size100n7<-var(mse100_tau)+(mean(mse100_tau-tau1)²) #tau (MSE)size500n1<-var(mse500_mean)+(mean(mse500_mean-ort)²) #mean (MSE)size500n2<-var(mse500_median)+(mean(mse500_median-ortanca)²) #median(MSE)size500n3<-var(mse500_trim)+(mean(mse500_trim-budan)²) #trim (MSE)size500n4<-var(mse500_win)+(mean(mse500_win-winsor)²) #winsorized (MSE)size500n5<-var(mse500_onestep)+(mean(mse500_onestep-one)²) #onestep (MSE)size500n6<-var(mse500_mom)+(mean(mse500_mom-mom1)²) #mom (MSE)size500n7<-var(mse500_tau)+(mean(mse500_tau-tau1)²) #tau (MSE)size1000n1<-var(mse1000_mean)+(mean(mse1000_mean-ort)²) #mean (MSE)size1000n2<-var(mse1000_median)+(mean(mse1000_median-ortanca)²) #median(MSE)size1000n3<-var(mse1000_trim)+(mean(mse1000_trim-budan)²) #trim (MSE)size1000n4<-var(mse1000_win)+(mean(mse1000_win-winsor)²) #winsorized (MSE)size1000n5<-var(mse1000_onestep)+(mean(mse1000_onestep-one)²) #onestep (MSE)size1000n6<-var(mse1000_mom)+(mean(mse1000_mom-mom1)²) #mom (MSE)size1000n7<-var(mse1000_tau)+(mean(mse1000_tau-tau1)²) #tau (MSE)matrix(c(size10n1,size25n1,size50n1,size100n1,size500n1,size1000n1,size10n2,size25n2,size50n2,size100n2,size500n2,size1000n2,size10n3,size25n3,size50n3,size100n3,size500n3,size1000n3,size10n4,size25n4,size50n4,size100n4,size500n4,size1000n4,size10n5,size25n5,size50n5,size100n5,size500n5,size1000n5,size10n6,size25n6,size50n6,size100n6,size500n6,size1000n6,size10n7,size25n7,size50n7,size100n7,size500n7,size1000n7),nrow = 7,ncol=6,byrow = T,dimnames = list(c(“Mean”,”Median”,”Trimmed Mean”,”Winsorized Mean”,”One Step”,”Modified M”,”Tau”),c(“n=10”,”n=25",”n=50",”n=100",”n=500",”n=1000")))

Let’s look at the output and interpret the results:

Simulation Results (g=0, h=0)

Here we can say that the mean and one-step M-estimator gives good results. As the sample widths increase, the MSE values of the estimators are getting closer together. As a result, we can say that there is a mean and one-step M-estimator for predictors that can be preferred in a normally distributed data set.

Let’s change it to “g = 0, h = 0.5” in R codes and look at the results:

Simulation Results (g=0, h=0.5)

In the symmetrical and heavy-tailed distribution, we can say that the average gives bad results compared to other estimators. It gives the best results median and trimmed mean. As the sample width increases, MSE values converge. The estimators that should be preferred in such a distribution should be median and trimmed mean.

Let’s change it to “g = 0.5, h = 0” in R codes and look at the results:

Simulation Results (g=0.5, h=0)

Here the values of the estimators are seen very close to each other. Trimmed mean, winsorized mean and one — step M — estimator may be preferred.

Now let’s change the “g = 0.5, h = 0.5” in the R codes and look at the results:

Simulation Results (g=0.5, h=0.5)

We can say that the mean in asymmetrical and heavy tailed distributions gives bad results. Especially in the sample of 50 units, the average MSE value is seen to be high compared to other estimators. Other estimators give close results. Median, trimmed mean, one — step M — estimator and modified M — estimator may be preferred.

Finally, let’s generate data from the lognormal distribution, which is a more skewed distribution, and compare the results. R codes are shown below:

K = 1000n1=10n2=25n3=50n4=100n5=500n6=1000c<-rlnorm(1000000)ort<-mean(c)ortanca<-median(c)budan<-mean(c,tr=0.2)winsor<-winmean(c,tr=0.2)one<-onestep(c)mom1<-mom(c)tau1<-tauloc(c)mse10_mean<-c()mse10_median<-c()mse10_trim<-c()mse10_win<-c()mse10_onestep<-c()mse10_mom<-c()mse10_tau<-c()mse25_mean<-c()mse25_median<-c()mse25_trim<-c()mse25_win<-c()mse25_onestep<-c()mse25_mom<-c()mse25_tau<-c()mse50_mean<-c()mse50_median<-c()mse50_trim<-c()mse50_win<-c()mse50_onestep<-c()mse50_mom<-c()mse50_tau<-c()mse100_mean<-c()mse100_median<-c()mse100_trim<-c()mse100_win<-c()mse100_onestep<-c()mse100_mom<-c()mse100_tau<-c()mse500_mean<-c()mse500_median<-c()mse500_trim<-c()mse500_win<-c()mse500_onestep<-c()mse500_mom<-c()mse500_tau<-c()mse1000_mean<-c()mse1000_median<-c()mse1000_trim<-c()mse1000_win<-c()mse1000_onestep<-c()mse1000_mom<-c()mse1000_tau<-c()for(k in 1:K){x1<-rlnorm(n1)x2<-rlnorm(n2)x3<-rlnorm(n3)x4<-rlnorm(n4)x5<-rlnorm(n5)x6<-rlnorm(n6)mse10_mean[k]<-mean(x1)mse10_median[k]<-median(x1)mse10_trim[k]<-mean(x1,tr=0.2)mse10_win[k]<-winmean(x1,tr=0.2)mse10_onestep[k]<-onestep(x1,bend=1.28)mse10_mom[k]<-mom(x1,bend = 2.24)mse10_tau[k]<-tauloc(x1,cval = 4.5)mse25_mean[k]<-mean(x2)mse25_median[k]<-median(x2)mse25_trim[k]<-mean(x2,tr=0.2)mse25_win[k]<-winmean(x2,tr=0.2)mse25_onestep[k]<-onestep(x2,bend=1.28)mse25_mom[k]<-mom(x2,bend = 2.24)mse25_tau[k]<-tauloc(x2,cval = 4.5)mse50_mean[k]<-mean(x3)mse50_median[k]<-median(x3)mse50_trim[k]<-mean(x3,tr=0.2)mse50_win[k]<-winmean(x3,tr=0.2)mse50_onestep[k]<-onestep(x3,bend=1.28)mse50_mom[k]<-mom(x3,bend = 2.24)mse50_tau[k]<-tauloc(x3,cval = 4.5)mse100_mean[k]<-mean(x4)mse100_median[k]<-median(x4)mse100_trim[k]<-mean(x4,tr=0.2)mse100_win[k]<-winmean(x4,tr=0.2)mse100_onestep[k]<-onestep(x4,bend=1.28)mse100_mom[k]<-mom(x4,bend = 2.24)mse100_tau[k]<-tauloc(x4,cval = 4.5)mse500_mean[k]<-mean(x5)mse500_median[k]<-median(x5)mse500_trim[k]<-mean(x5,tr=0.2)mse500_win[k]<-winmean(x5,tr=0.2)mse500_onestep[k]<-onestep(x5,bend=1.28)mse500_mom[k]<-mom(x5,bend = 2.24)mse500_tau[k]<-tauloc(x5,cval = 4.5)mse1000_mean[k]<-mean(x6)mse1000_median[k]<-median(x6)mse1000_trim[k]<-mean(x6,tr=0.2)mse1000_win[k]<-winmean(x6,tr=0.2)mse1000_onestep[k]<-onestep(x6,bend=1.28)mse1000_mom[k]<-mom(x6,bend = 2.24)mse1000_tau[k]<-tauloc(x6,cval = 4.5)print(k)}size10n1<-var(mse10_mean)+(mean(mse10_mean-ort)²) #mean (MSE)size10n2<-var(mse10_median)+(mean(mse10_median-ortanca)²) #median(MSE)size10n3<-var(mse10_trim)+(mean(mse10_trim-budan)²) #trim (MSE)size10n4<-var(mse10_win)+(mean(mse10_win-winsor)²) #winsorized (MSE)size10n5<-var(mse10_onestep)+(mean(mse10_onestep-one)²) #onestep (MSE)size10n6<-var(mse10_mom)+(mean(mse10_mom-mom1)²) #mom (MSE)size10n7<-var(mse10_tau)+(mean(mse10_tau-tau1)²) #tau (MSE)size25n1<-var(mse25_mean)+(mean(mse25_mean-ort)²) #mean (MSE)size25n2<-var(mse25_median)+(mean(mse25_median-ortanca)²) #median(MSE)size25n3<-var(mse25_trim)+(mean(mse25_trim-budan)²) #trim (MSE)size25n4<-var(mse25_win)+(mean(mse25_win-winsor)²) #winsorized (MSE)size25n5<-var(mse25_onestep)+(mean(mse25_onestep-one)²) #onestep (MSE)size25n6<-var(mse25_mom)+(mean(mse25_mom-mom1)²) #mom (MSE)size25n7<-var(mse25_tau)+(mean(mse25_tau-tau1)²) #tau (MSE)size50n1<-var(mse50_mean)+(mean(mse50_mean-ort)²) #mean (MSE)size50n2<-var(mse50_median)+(mean(mse50_median-ortanca)²) #median(MSE)size50n3<-var(mse50_trim)+(mean(mse50_trim-budan)²) #trim (MSE)size50n4<-var(mse50_win)+(mean(mse50_win-winsor)²) #winsorized (MSE)size50n5<-var(mse50_onestep)+(mean(mse50_onestep-one)²) #onestep (MSE)size50n6<-var(mse50_mom)+(mean(mse50_mom-mom1)²) #mom (MSE)size50n7<-var(mse50_tau)+(mean(mse50_tau-tau1)²) #tau (MSE)size100n1<-var(mse100_mean)+(mean(mse100_mean-ort)²) #mean (MSE)size100n2<-var(mse100_median)+(mean(mse100_median-ortanca)²) #median(MSE)size100n3<-var(mse100_trim)+(mean(mse100_trim-budan)²) #trim (MSE)size100n4<-var(mse100_win)+(mean(mse100_win-winsor)²) #winsorized (MSE)size100n5<-var(mse100_onestep)+(mean(mse100_onestep-one)²) #onestep (MSE)size100n6<-var(mse100_mom)+(mean(mse100_mom-mom1)²) #mom (MSE)size100n7<-var(mse100_tau)+(mean(mse100_tau-tau1)²) #tau (MSE)size500n1<-var(mse500_mean)+(mean(mse500_mean-ort)²) #mean (MSE)size500n2<-var(mse500_median)+(mean(mse500_median-ortanca)²) #median(MSE)size500n3<-var(mse500_trim)+(mean(mse500_trim-budan)²) #trim (MSE)size500n4<-var(mse500_win)+(mean(mse500_win-winsor)²) #winsorized (MSE)size500n5<-var(mse500_onestep)+(mean(mse500_onestep-one)²) #onestep (MSE)size500n6<-var(mse500_mom)+(mean(mse500_mom-mom1)²) #mom (MSE)size500n7<-var(mse500_tau)+(mean(mse500_tau-tau1)²) #tau (MSE)size1000n1<-var(mse1000_mean)+(mean(mse1000_mean-ort)²) #mean (MSE)size1000n2<-var(mse1000_median)+(mean(mse1000_median-ortanca)²) #median(MSE)size1000n3<-var(mse1000_trim)+(mean(mse1000_trim-budan)²) #trim (MSE)size1000n4<-var(mse1000_win)+(mean(mse1000_win-winsor)²) #winsorized (MSE)size1000n5<-var(mse1000_onestep)+(mean(mse1000_onestep-one)²) #onestep (MSE)size1000n6<-var(mse1000_mom)+(mean(mse1000_mom-mom1)²) #mom (MSE)size1000n7<-var(mse1000_tau)+(mean(mse1000_tau-tau1)²) #tau (MSE)matrix(c(size10n1,size25n1,size50n1,size100n1,size500n1,size1000n1,size10n2,size25n2,size50n2,size100n2,size500n2,size1000n2,size10n3,size25n3,size50n3,size100n3,size500n3,size1000n3,size10n4,size25n4,size50n4,size100n4,size500n4,size1000n4,size10n5,size25n5,size50n5,size100n5,size500n5,size1000n5,size10n6,size25n6,size50n6,size100n6,size500n6,size1000n6,size10n7,size25n7,size50n7,size100n7,size500n7,size1000n7),nrow = 7,ncol=6,byrow = T,dimnames = list(c(“Mean”,”Median”,”Trimmed Mean”,”Winsorized Mean”,”One Step”,”Modified M”,”Tau”),c(“n=10”,”n=25",”n=50",”n=100",”n=500",”n=1000")))

Now let’s look at the output and compare the results:

Simulation Results (Lognormal Distribution)

Here again, we can say that the mean is not resistant to data sets consisting of skewed distributions and extreme values. In such distributions, robust measure of location should be preferred.

CONCLUSION

We talked about which estimators should be used in data sets consisting of outlier observations. In addition, we decided on the estimators to be preferred by comparing performance between the estimators in skewed distributions. In other words, we got detailed information about the estimators that should be used by looking at the distribution of a data set and their outlier observations. See you in next article…

REFERENCES

  • Rand Wilcox (2017), Introduction to Robust Estimation and Hypothesis Testing (4th edition), Elsevier, Los Angeles.
  • Rand Wilcox (2017), Modern Statistics for the Social and Behavioral Sciences (2nd edition), Taylor & Francis, Los Angeles.

--

--