Outliers and Anomalies Explained!
We’ve all received datasets that have a lot of numbers and are crunched for the metrics. We understand these datasets using statistical measures and we assume that just the central tendencies, the skewness or kurtosis etc. will tell us enough about our datasets. It is true that these metrics tell us about Data and communicate with us information that isn’t possibly understood by mere numbers. But have we ever wondered about the credibility of data?
Have we ever wondered whether the information that we receive in the form of data is legitimate? Is it precise and accurate? Does it not include falsified data? 9/10 times it is possible that there is a doubt pertaining to the credibility of a dataset. There is a possibility of “Contamination”. What does contamination in a dataset really mean? It is indicative of the presence of ‘Outliers’ and/or ‘Anomalies’ in the data.
In this article, we’ll go over what they are and how they can affect your dataset and how to go about their use/rectification.
What are Outliers?
There are times when we have datasets which have numbers that are supposed to be close together and distributed about the Standard Deviation of the Mean/Median. There might be instances when they’re not distributed centrally but are distributed evenly/unevenly in a range. But what if there are some ‘extremes’ present in your data? Extremes are essentially very large or very small values that don’t coincide with your outlook of the entire dataset. For Example:
Consider the following observations;
At the first look, they seem evenly spaced out numbers starting at 1 with an increment of +1. But then there is the last value of 100 which doesn’t really fit the rest of the data and is exponentially bigger than the rest of the data. This is an Outlier.
What are Anomalies?
An anomaly is a term which is used instead of an outlier and at times is interchangeable with the same. An anomaly is simply a term or an observation that does not follow a pattern as the rest of it’s neighboring data entries.
For a simple example; consider a Movie which has received critical acclaim to be a very good watch by almost all it’s viewers and critics. It is assumed to be a good movie which will appeal to everyone. In case someone doesn’t like the movie and thinks it to be one of his/her worst watches, this review is an anomaly in consideration to everyone else.
In this article, we’ll use the terms interchangeably as their impacts are not very different on a Dataset.
Coming back to understanding how Outliers and Anomalies impact your analysis; let’s answer the following questions.
- Why do I have Outliers in my Data?
- How does it impact my Data?
- How does one identify Outliers?
- What does the presence of Outliers indicate?
- How do I eliminate them?
Why do I have Outliers in my data?
There are multiple possible reasons for the presence of Outliers in a given dataset. The most common reason would be miscommunication or a mistake in the entries. For example; if a value ‘10’ is entered as ‘100’. Such errors will happen and cannot be avoided mostly. These entries, though a mistake, will still impact your dataset negatively.
Another possible reason would be misunderstanding; for example a person who is giving the movie reviews might not understand if in a rating of 1 to 5, 1 is for the highest ranked or the lowest ranked scale. Another possible reason would be that it is a true entry of the dataset and not a mistake, in the case of which it is imminent to understand the occurrence of the Outliers and work on stabilizing them.
How does it impact my Data?
It wouldn’t be possible to measure the magnitude at which an outlier will impact your data. But sometimes it can change the entire outlook of your data. For example if you take the previous example where we had 10 observations;
1, 2, 3, 4, 5, 6, 7, 8, 9, 100
Here N = 10, Sum of the numbers = 145
The Mean, therefore is 14.5 which is not the true mean of the dataset considering that 100 is an outlier. If the entry would follow the pattern and increment by one at 9, then the last number would be 10 and the Mean would then be 5.5 which is the true mean of the data. It is possible to identify such mistakes in cases when there are very less number of entries.
It shows that the Mean is pulled by the outliers towards itself which gives us the wrong impression about the data.
The Median on the other hand would be the same irrespective of the presence of Outliers as it ignores them and focuses solely on the central tendency. In most cases this is the reason for using the Median and not the mean.
How does one identify Outliers?
It is simple to identify outliers when there is a drastic difference between the Mean and the Median and the number of entries are less enough to see the problem with the naked eye. It is easier to do the same even on a large dataset using Data Visualizations such as a Scatterplot and a Boxplot or Violin plot. These easily show you the presence of Outliers and can be used whenever you’re encountered with credibility problems of a dataset. Following are a few examples for identifying outliers using visualizations:
What does the presence of Outliers indicate?
Outliers indicate that the data is not clean and has problems which will eventually impact anyone’s understanding of how the data is and what it depicts. The metrics will be false and it will lead to a negative impact if assumed to be true. It is necessary to identify them and eliminate or change them in order to get a true picture of the data and the information it communicates.
The presence of outliers also means that you might have to take samples of your data excluding the outliers in order to understand it. If they are mistakes, your sources will have to ensure that the data is correct and mistakes will have to be avoided moving ahead. In case there are true outliers, the information will have to be looked into as it will indicate some changes in your data or your understanding of the data.
How do I eliminate them?
While this question talks about eliminating Outliers, another question you need to ask yourself before you eliminate the outliers is if you really have to eliminate them. It may be so that your data is not scaled and is exponentially increasing or decreasing numerically and this will be viewed as a outlier. A simple fix to this would be to use a logarithmic scale or a normalized value so that the changes in the values can be equivalently scaled.
In case your outliers are minute in comparison of the entire dataset, they can be eliminated. In case they are quite a few, sampling would be a better approach. If possible, try avoiding the Mean and use the Median to calculate the central tendency in order to nullify the impact of Outliers on your statistics.
It also comes down to the domain expertise on handling Outliers and their elimination. It is suggested that an expert is approached before taking any decision on the Outliers.
Outliers can change your entire analysis and is important to be taken care of and with extreme prejudice in order to not eliminate true entries as outliers. This will help your analysis be clean and accurate. For any further clarifications required, post it down in the comments section!
For more such articles, stay tuned with us as we chart out paths on understanding data and coding and demystify other concepts related to Data Science and Coding. Please leave a review down in the comments. It was a long article, thank you very much for reading it all the way here! Great going!