Statistical Methods for Identifying Outliers( Regression analysis Approach ) (Part II)
In my previous article, we see the outlier detection technique univariate approach, let’s look further
Statistical Techniques and tools
2.1 Standardized Residuals
2.2 Studentized Residuals
2.3 COOK’S Distance
2.4 Leverage
2.5 DFBETAS
2.6 DFFITS
2.1 Standardized Residuals
Since the approximate average variance of a residual is estimated by MSRes, a
logical scaling for the residuals would be the standardized residuals. The standardized residuals have mean zero and approximately unit variance.
Criteria:
A large standardized residual (di > 3) potentially indicates an outlier.
2.2 Studentized Residuals
A studentized residual (sometimes referred to as an “externally studentized
residual” or a “deleted t residual”) is:
Criteria :
Studentized residuals are going to be more effective for detecting outlying
observations than standardized residuals. If an observation has a studentized residual that is larger than 3 (in absolute value) we can call it an outlier.
2.3 COOK’S Distance
Its formula is given as,
Criteria :
We usually consider points for which Di>1 We can call i th observation is an outlier.
2.4 Leverage
Criteria :
2.5 DFBETAS
2.6 DFFITS
Delivery Time Data Description:
A soft drink bottler is analyzing the vending machine service routes in his distribution system. He is interested in predicting the amount of time required by the route driver to service the vending machines in an outlet. This service activity includes stocking the machine with beverage products and minor maintenance or housekeeping. The industrial engineer responsible for the study has suggested that the two most important variables affecting the delivery time (y) are the number of cases of product stocked (x1) and the distance walked by the route driver (x2). The engineer has collected 25 observations on delivery time.
Output:
9
4.31078
Conclusion:
Here using studentized residual we observe that 9th observation is an outlier.
Overall Conclusions for delivery time data:
We calculated the scaled residuals for the delivery time data. Table 2.1 contains the studentized residuals and standardized residuals.. However, both scaled residuals for point 9 are moderately large, suggesting that this observation may have a moderate influence on the model. To illustrate the effect of these two points on the model, three additional analyses were performed:
One deleting observation 9, a second deleting observation 22, and the third
deleting both 9 and 22. The results of these additional runs are shown in the following:
The table contains the values of Cook’s distance measure for the soft drink
delivery time data. The largest value of the statistic is D9= 3.41835, which indicates that observation 9 is influential.