It's all about Outliers
An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset. Or in a layman term, we can say, an outlier is something that behaves differently from the combination/collection of the data.
Outliers can be very informative about the subject-area and data collection process. It’s essential to understand how outliers occur and whether they might happen again as a normal part of the process or study area. To understand outliers, we need to go through these points:
- what causes the outliers?
- Impact of the outlier
- Methods to Identify outliers
What causes the outliers?
Before dealing with the outliers, one should know what causes them. There are three causes for outliers — data entry/An experiment measurement errors, sampling problems, and natural variation.
- Data entry /An experimental measurement error
An error can occur while experimenting/entering data. During data entry, a typo can type the wrong value by mistake. Let us consider a dataset of age, where we found a person age is 356, which is impossible. So this is a Data entry error.