Member-only story
Imputing Missing Data with Simple and Advanced Techniques
A tutorial on mean, mode, time series, KNN, and MICE imputation
Missing data occurs when there is no data stored for a variable of interest in a dataset. Depending on its volume, missing data can harm the findings of any data analysis or the robustness of machine learning models.
While dealing with missing data using Python, dropna()
function from Pandas comes in handy. We use it to remove rows and columns that include null values. It also has several parameters such as axis to define whether rows or columns drop, how to determine if missing values occur in any or all of the rows/columns, and subset to select a group of columns or labels to apply the drop function on.
df.dropna(axis=0, how='any', subset=None, inplace=False)
However, there are other and probably better ways of dealing with missing data. In this article, we will see how to impute (replace) missing data with simple and advanced techniques. We will first cover simple univariate techniques such as mean and mode imputation. Then, we will see forward and backward filling for time series data and we will explore interpolation such as linear, polynomial, or quadratic for filling missing values. Later, we will explore advanced multivariate techniques and learn how to…