ML Series: Day 7 — Preprocessing (Part 2) —Addressing Missing Values, Outliers, and Errors for Robust Data Insights

Navigating the Level 3 Data Cleaning

Ebrahim Mousavi
10 min readJan 22, 2024
Figure 1. Missing values, and Outliers

🔙 Previous: Preprocessing (Part 1)

🔜 Next: Preprocessing (Part 3)

Level 3 cleanup (missing values, outliers, and errors)

This level of data cleaning is the most difficult level of data cleaning because it requires a deeper understanding of the analytical goals of data preprocessing. At level one, we cleaned the table regardless of the data structure or the recorded values. In the second level, our attention was to have a data structure that supports our analytical goal, but we still did not pay much attention to the correctness or appropriateness of the recorded values, which we addressed in the third level. In the third level of data cleaning, we will focus on recorded values and take steps to ensure that the three issues of missing values, outliers, and errors related to recorded values in the data are checked. Is. First, we make sure that the missing values in the data are identified and investigate why this happened and take appropriate actions to address them so that the recorded values are correct and error-free. Then, make sure that outliers are identified in the data and appropriate measures are taken to fix them.

The third level of data cleaning is similar to the second level concerning the goals and tools of data analysis. Level one data cleaning can be done separately and regardless of the goals and tools of data analysis.

Missing values

In data analysis, missing values refer to the absence of a value in a particular observation or variable. Missing values may occur for various reasons such as data entry errors, equipment failure, or even intentional deletion. Missing values may be represented as empty cells or as a special character such as “N/A” or “-9999” or “NaN” or a null value.

Management of missing values is an important aspect of data cleaning and preprocessing because it can have a significant impact on the accuracy and validity of data analysis. There are various methods to deal with missing values, including imputation, deletion, and using algorithms that can handle missing data, the choice of method depends on the specific nature of the data and objectives. In Figure 2, we have dummy data that has some missing values, and these values can be identified with the “isna()” command.

Here’s a code to create and convert a data into a CSV file:

import csv

data = [
['Column_A', 'Column_B', 'Column_C'],
[10, 20, 30],
['', 40, ''],
[50, '', '']
]

with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(data)

And we can read and show it with pandas:

import pandas as pd

df = pd.read_csv('output.csv')
df.head()

The code below identifies these missing values and calculates their number in each column.

print('Number of missing values:')
for col in df.columns:
n_MV = sum(df[col].isna())
print(f'{col}: {n_MV}')

Output:

Number of missing values:
Column_A: 1
Column_B: 1
Column_C: 2

Now that we know how to detect missing values, we need to turn our attention to understanding what is causing these values to be missing. In our quest to deal with missing values, we first need to know why it happened. In the following, we will discuss which situations cause of missing values.

Reasons for missing values

There can be many reasons why missing values occur. Knowing why a value is missing is the most critical information that enables us to manage missing values effectively. The following list provides the most common reasons why values may be missing or missing:

· human mistake

· Respondents may refuse to answer a survey question.

· The person who participates in the survey does not understand the question.

· The value provided was an obvious error, so it has been removed.

· Not enough time to answer the questions.

· Deliberate deletion and ignoring of data collection (probably with the intention of fraud)

· Loss of records due to lack of effective database management

· Third-party manipulation or blocking of data collection

· Faulty sensors

· Programming errors

When working with data, sometimes all we have is the data and no one to ask questions about the data. So, the important thing here is to be curious about the data and know what reasons could be behind the missing values. However, if we have access to someone who knows the data, the best course of action to find the causes of missing values is to ask that person. Regardless of what caused the missing values, from a data analytics perspective, we can categorize all missing values into three types. Understanding these types will be important in deciding how to manage them.

Types of missing values

Figure 3. Types of missing values

Several types of missing values can occur in a data set:

1. Missing Completely at Random (MCAR)

When the missing data are not related to any other variable in the data set. In other words, missing data is completely random and occurs without a systematic pattern. When a missing value is classified as MCAR, the data object that has the missing value can be any of the data objects. For example, if an air quality sensor cannot communicate with its server to store records due to random fluctuations in the Internet connection, the missing values are of the MCAR type. This is because Internet connection problems could have happened to any of the data objects, but they happened completely randomly for a particular one.

2. Missing at Random (MAR)

This condition is when the missing data is related to other variables of the data set, but not to the missing data itself. This means that the probability of missing depends on other observed variables in the data set. In other words, missingness is not random but can be explained by other variables in the data set. For example, suppose we have a dataset that contains information on income and level of education, but some observations do not have income values. In this case, missing data may be related to education level, but not to income itself. In particular, people with higher education levels may not enter income values. This is an example of MAR because the probability of missing data depends on the observed variable (education level), but not on the missing variable (income). One of the ways to deal with MAR is to use imputation methods that estimate the missing values based on the observed data. There are many imputation methods, including Mean imputation, Regression imputation, and multiple imputation. The choice of imputation method depends on the specific characteristics of the dataset and the research question of interest.

3. Missing Not at Random (MNAR)

This type of missing data is the most difficult to handle because it occurs when the probability of missingness depends on unobserved data and may be related to factors not captured by the dataset. In other words, missingness is not random and cannot be explained by any observed or measured variable in the data set. For example, suppose we have a dataset that contains information on income, education level, and job satisfaction, but some observations do not have income values. In this case, missing data may be related to income itself, such that individuals with higher incomes may not have recorded relevant amounts of income (for example, for tax evasion). This is an example of MNAR because the probability of absence depends on the unobserved variable (income). Dealing with MNAR is challenging because missing data are related to unobserved factors that may be difficult or impossible to measure or control. Imputation methods may not be suitable for MNAR because these methods assume that missingness is only related to observed variables. Instead, the data analyst may need to use more sophisticated statistical methods such as maximum likelihood estimation or Bayesian modeling that can explain the missing data mechanism and estimate the desired parameters. It is important to note that MNAR can cause bias and affect the validity of statistical conclusions.

Understanding the type of missing values in a data set is important for choosing appropriate methods to handle the data. In general, MCAR is the easiest type to miss, followed by MAR, while MNAR is the most difficult.

Methods for dealing with missing values

There are four different approaches to dealing with missing values:

1. First approach: keep the missing values as they are.

This approach preserves the missing value and enters the next stage of data preprocessing. This approach is the best way to deal with missing values in the following two situations:

  • We use this strategy in cases where we have to share the data with others and we are not the only person who is going to use the data for analysis. In this way, we also allow others to decide how to deal with missing values based on their own analytical needs.
  • If both the data analysis goals and the tools we use can handle missing values, the optimal approach is to retain them. For instance, the K-Nearest Neighbors (KNN) algorithm can be employed to address missing values without the need for data removal.

2. The second approach: remove the row or rows of data that have missing values.

This approach should be chosen very carefully because it can cause bias in the data set and also omit valuable information. For example, when the missing values in a dataset are of type MNAR or MAR, we avoid deleting data with missing values as much as possible. This is because a certain part of the samples are excluded from the data set.

Even if the missing values are of MCAR type, we should first try to find other ways to deal with these values before moving on to deleting the rows. This method should be considered as a last resort when there is no other way to deal with missing values.

3. Third approach: removing the features (attributes or columns) containing missing values

When most of the missing values in a dataset are from one or two features, we may consider feature deletion as a way to deal with missing values. Of course, if the feature is a key feature that we can’t continue the project without, this method won’t work. However, if the features are not essential to the project, removing features with large missing values may be a good approach. When the number of missing values in a feature is large enough (approximately more than 25% of the data), estimating and imputing missing values becomes pointless, and dropping the feature is better than estimating missing values.

4. Fourth approach: estimating a value and replacing the missing values

In this approach, we use knowledge and analytical tools to fill in the missing values. We replace the missing value with another value knowing that this can bias the analysis. If the missing values are of the MCAR or MAR type and the method we have chosen cannot process the data set with missing values, imputing the missing values may be the best approach.

There are four general methods for estimating replacements for missing values, which we refer to below:

A. Interpolation with values of central tendency such as mean, median, or mode:

This method is more suitable for missing MCAR values.

B. Imputing central tendency values of a more relevant group of data to missing values: This method is more suitable for MAR missing values.

C. Regression analysis: Often the method is not efficient, but if we have to work with a data set that has missing values of the MNAR type, this method is more suitable for such a data set.

D. Interpolation: This method is more suitable when the data set is a time series data set and the missing values are of MCAR type.

Note: A common mistake in the estimation and imputation process is that we want to impute missing values with the most accurate substitutes, which is not the right idea. When imputing, our goal is not to best predict the value of the missing values, but to replace them with values that cause the least bias for the analysis. For example, for clustering analysis, if a data set has MCAR missing values, choosing the central tendency of the whole population is the best way to go. This is because the value of central tendency acts as a neutral vote in the process of grouping data objects.

Note: In this chapter, the techniques and methodologies discussed are based on the book ‘Hands-On Data Preprocessing in Python’ by Roy Jafari, which served as a valuable reference for understanding and implementing effective data preprocessing workflows.

Conclusion:

In part 7 of the machine learning series, we delved into Level 3 cleanup, which comprises three main components: 1. missing values, 2. outliers, and 3. errors. In this post, we focused on missing values and explored various aspects related to them, including the reasons behind their occurrence, the different types of missing values, and the methods available for handling them effectively.

For the upcoming post, Machine Learning Series: Day 8 — Preprocessing (Part 3), we will shift our attention to outliers. Outliers are data points that deviate significantly from the majority of the observations in a dataset. They can arise due to various reasons, such as measurement errors, data entry mistakes, or genuine extreme values in the data. Detecting and dealing with outliers is crucial because they can have a disproportionate impact on the analysis and modeling process.

If you like the article and would like to support me make sure to:

👏 Clap for the story (as much as you liked it) and follow me 👉
📰 View more content on my medium profile
🔔 Follow Me: LinkedIn | Medium | GitHub | Twitter

References:

  1. https://www.amazon.com/Hands-Data-Preprocessing-Python-effectively/dp/1801072132

--

--