Lesson 3 — Machine Learning: Handling Missing Data

2 min readMar 30, 2023

Missing data occurs when some values are not available or not recorded in your dataset. This can happen for various reasons, such as data entry errors, faulty sensors, or participants not providing information. Having missing data can lead to biased or incorrect results when training a machine learning model.

Here are some common techniques for handling missing data:

Remove Rows with Missing Data:

One simple approach is to remove any rows in your dataset that contain missing values. This can be an effective strategy if the number of missing values is small, and their removal doesn’t significantly impact the overall dataset. However, if a large portion of your data contains missing values, this method may result in losing too much valuable information.

2. Fill in Missing Values with a Constant or Mean/Median/Mode:

Another approach is to fill in missing values with a constant value or the mean, median, or mode of the available data for that feature. This method is easy to implement but may not always provide the most accurate results, especially if the missing data is not missing at random.

3. Use an Imputation Method:

Imputation methods are more advanced techniques that estimate missing values based on the relationships between features in your dataset. One popular imputation method is k-Nearest Neighbors (KNN), which fills in missing values based on the average value of the k most similar data points. Another option is regression imputation, which estimates missing values by fitting a regression model to the available data.

4. Use Machine Learning Models that Can Handle Missing Data:

Some machine learning algorithms, such as decision trees and their ensemble versions (Random Forest, XGBoost, etc.), can handle missing data without requiring any preprocessing. These models can estimate the importance of missing values and make predictions accordingly.

Now, let’s see how to apply these techniques in a Machine Learning workflow:

Identify the features with missing data in your dataset.
Decide on a strategy to handle missing data (e.g., removal, filling, or imputation).
Apply the chosen strategy to the training set. If using imputation, fit the imputer on the training set and save the imputation parameters.
Train your machine learning model using the processed training set.
Before making predictions with the testing set, apply the same strategy to handle missing data using the parameters from the training set.
Evaluate the model’s performance on the processed testing set.

Handling missing data is an essential step in the preprocessing phase of a machine learning project. In the next lesson, we will explore how to work with categorical features.

Lesson 3 — Machine Learning: Handling Missing Data

Written by Machine Learning in Plain English