Data Preprocessing: Part III — Dealing with missing values

“Data science is a process of discovery, a journey of new perspective.”

Analogous to the working of a vacuum cleaner , data preprocessing achieves the same objective of filtering out the dirty data from the real world datasets and transforming it to a much cleaner version.

Missing value representations in real-world datasets

Due to data being generated from multiple heterogeneous sources, it often has inconsistent representation due to various sources following their own standards. For example some may measure a person’s height in cms, some in inches and foot representation. Moreover some sources may also miss out on values in certain columns for some samples. The missing values can be quite painful especially if the given column/feature is highly ranked. Given the importance of such features, the next set of articles would provide insights into various ML algorithms that can be employed to handle missing values.

This section provides an overview on various ways to deal with missing values.

1.Deletion

If the feature/column contains more than 60% missing values, it’s better to discard it considering the feature is insignificant. It is important to ensure before deletion that the variable is not important in the decision making process. Due to this limitation, imputation is always preferred over dropping variables.

2. Imputation

Imputation is filling the missing values with some other values from the available values for the feature. There are various techniques for handling missing data like statistical methods, prediction algorithms and ML algorithms.

  1. Calculating the overall mean, median or mode of the feature is the most preferred way of filling missing values for variables especially if the variables are uncorrelated with other variables. They act accurately when the data is linear. These techniques are best suited for numerical data. Similarly, we can use median and mode for filling the missing values.
  2. The best way to handle missing values is the use of ML algorithms for predicting the missing values. ML algorithms like linear regression are widely used for predicting commonly occurring features like age, weight, height etc.
  3. Other popular ML techniques like K-nearest neighbours (KNN), Random forests and XGBoost can also be used to predict or fill in the missing values for numerical as well as categorical variables. The simplest and widely popular way is with KNN where the missing value is assigned the value which is nearest to its neighbours. The distance between these entities is calculated by incorporating various distance measures.

Commonly used distance metric for categorical values is hamming distance while, for numeric values, the distance metrics used are Euclidean, Cosine and Manhattan distance. The entire code for handling missing values can be found here.

Machine learning models can quickly adapt to filling in numerical missing values,but poses some difficulties while dealing with missing values for categorical features.

The preceding article in the series focuses on dealing with missing values for categorical data.

Get in touch!

Reach out to us at perspectivesondatascience@gmail.com for any question and we will be happy to answer!

--

--

Insights on Modern Computation
Perspectives on data science

A Communal initiative by Meghana Kshirsagar (BDS| Lero| UL, Ireland), Gauri Vaidya (Intern|BDS). Each concept is followed with sample datasets and Python codes.