Identify Outliers With Pandas, Statsmodels, and Seaborn

The complete guide to clean data sets — Part 2

Published in

The Startup

10 min readJul 31, 2020

The success of a machine learning algorithm highly depends on the quality of the data fed into the model. Real-world data is often dirty containing outliers, missing values, wrong data types, irrelevant features, or non-standardized data. The presence of any of these will prevent the machine learning model to properly learn. For this reason, transforming raw data into a useful format is an essential stage in the machine learning process.

Outliers are objects in the data set that exhibit some abnormality and deviate significantly from the normal data. In some cases, outliers can provide useful information (e.g. in fraud detection). However, in other cases, they do not provide any helpful knowledge and highly affect the performance of the learning algorithm.

In this article, we will learn how to identify outliers from a data set using multiple techniques such as boxplots, scatterplots, and residuals.

Now, let’s get started 💚

Data Set

The data set used for this article contains the weight (kg) and height (cm) of 100 women. As the first step, we load the CSV file into a Pandas data frame using the pandas.read_csv function. Then, we visualize the…

Identify Outliers With Pandas, Statsmodels, and Seaborn

The complete guide to clean data sets — Part 2

Data Set

Written by Amanda Iglesias Moreno