OutLiers in Machine Learning

Prerna Nichani
Analytics Vidhya
Published in
3 min readApr 22, 2020
Photo by Jessica Ruscello on Unsplash

What are Outliers?

Outliers are datapoints in dataset in which are abnormal observations amongst the normal observations and can lead to weird accuracy scores which can skew measurements as the results do not present the actual results.

Formal Definition:

Outlier is an observation that appears far away and diverges from an overall pattern in a sample. Outliers in input data can skew and mislead the training process of machine learning algorithms resulting in longer training times, less accurate models and ultimately poorer results.

Example: Suppose you have a sample of 1000 people, and amongst them all have to choose one colour between Red and Blue.

If 999 choose Red and only one person chooses Blue, I would say that that person who chooses Blue is an outlier for that sample.

Reasons of Occurrence of Outliers:

· Data entry errors (human errors)

· Measurement errors (instrument errors)

· Experimental errors (data extraction or experiment planning/executing errors)

· Intentional (dummy outliers made to test detection methods)

· Data processing errors (data manipulation errors)

· Sampling errors (extracting or mixing data from wrong or various sources)

· Natural (not an error, novelties in data)

Detecting outliers:

Data Visualization:

Visualisation Methods such as Distribution Curve, Box-plot, Histogram and Scatter Plot can be used to detect Outliers.

Z-Score or Extreme Value Analysis:

The z-score or standard score of an observation is a metric that indicates how many standard deviations a data point is from the sample’s mean, assuming a gaussian distribution. Some Python libraries like Scipy and Sci-kit learn help to get the z-score of any data point can be calculated with the following expression:

When computing the z-score for each sample on the data set a threshold must be specified.

Clusering Methods:

Relationships between features, trends and populations in a data set can be graphically represented via clustering methods like the k-means and . dbscan can be applied to detect outliers in parametric and nonparametric distributions in many dimensions.

Treating OutLiers:

· Delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers also remove points that lay beyond a given threshold which we are classifying into outliers.

· If the the number are outliers are small then use mean/median/random imputation to replace them.

· Use projection methods to summarize your data to two dimensions such as PCA, SOM or Sammon’s mapping

· If there are significant number of outliers, we should treat them separately in the statistical model. One of the approach is to treat both groups as two different groups and build individual model for both groups and then combine the output.

Why is Treating Outliers important?

Outliers are very important because they affect the mean and median which in turn affects the error (absolute and mean) in any data set. When you plot the error you might get big deviations if outliers are not treated are in the data set which will result in inappropriate accuracy.

--

--