Statistical Methods for Identifying Outliers( Regression analysis Approach ) (Part II)

Chetan Borse
Analytics Vidhya
Published in
4 min readJul 11, 2020

In my previous article, we see the outlier detection technique univariate approach, let’s look further

Statistical Techniques and tools

2.1 Standardized Residuals
2.2 Studentized Residuals
2.3 COOK’S Distance
2.4 Leverage
2.5 DFBETAS
2.6 DFFITS

2.1 Standardized Residuals

Since the approximate average variance of a residual is estimated by MSRes, a
logical scaling for the residuals would be the standardized residuals. The standardized residuals have mean zero and approximately unit variance.

Criteria:

A large standardized residual (di > 3) potentially indicates an outlier.

2.2 Studentized Residuals

A studentized residual (sometimes referred to as an “externally studentized
residual
” or a “deleted t residual”) is:

Criteria :

Studentized residuals are going to be more effective for detecting outlying
observations than standardized residuals. If an observation has a studentized residual that is larger than 3 (in absolute value) we can call it an outlier.

2.3 COOK’S Distance

Its formula is given as,

Criteria :

We usually consider points for which Di>1 We can call i th observation is an outlier.

2.4 Leverage

Criteria :

2.5 DFBETAS

2.6 DFFITS

Delivery Time Data Description:
A soft drink bottler is analyzing the vending machine service routes in his distribution system. He is interested in predicting the amount of time required by the route driver to service the vending machines in an outlet. This service activity includes stocking the machine with beverage products and minor maintenance or housekeeping. The industrial engineer responsible for the study has suggested that the two most important variables affecting the delivery time (y) are the number of cases of product stocked (x1) and the distance walked by the route driver (x2). The engineer has collected 25 observations on delivery time.

TABLE: Delivery Time Data

Output:
9
4.31078

fig: a plot of fitted values against standardized residuals

Conclusion:

Here using studentized residual we observe that 9th observation is an outlier.

Fig: plot of DBETAS
fig: plot of influential obs by Cook’s distance
fig: plot of Leverage

Overall Conclusions for delivery time data:
We calculated the scaled residuals for the delivery time data. Table 2.1 contains the studentized residuals and standardized residuals.. However, both scaled residuals for point 9 are moderately large, suggesting that this observation may have a moderate influence on the model. To illustrate the effect of these two points on the model, three additional analyses were performed:
One deleting observation 9, a second deleting observation 22, and the third
deleting both 9 and 22. The results of these additional runs are shown in the following:

The table contains the values of Cook’s distance measure for the soft drink
delivery time data. The largest value of the statistic is D9= 3.41835, which indicates that observation 9 is influential.

--

--