Machine Learning — Data Preprocessing #5

Ufuk Çolak
Nerd For Tech
Published in
8 min readMay 30, 2021

Hello to everyone! In my previous article, we examined the bias and variance relationship of machine learning models. In this article, we will examine data preprocessing.

Data Preprocessing

It is one of the most crucial steps in machine learning models. Data cleaning, transformation, and modeling steps are a large part of our work. Data collected from multiple sources exist in unorganized form. This situation affects the prediction performance of models. Therefore, raw data must be modified before training, evaluating, and using machine learning models.

In this article, we will be reviewing the following items.

  • Importance of Data Cleaning Processes
  • Scattered Data
  • Outlier Analysis

In terms of importance, almost the vast majority of us answer the question of Data or Model as Data.

While algorithms are well understood operationally, most do not have satisfying theories about why they work or how to map algorithms to problems. Therefore, predictive modeling projects are experimental rather than theoretical and require systematic testing of algorithms on data.

Considering that machine learning algorithms are automated today, the only thing that changes from project to project is the data used in modeling. So our answer will be data.

Data quality is one of the most important problems in data management since dirty data often leads to inaccurate data analytics results and incorrect business decisions.

Therefore, in projects, we will spend most of our time on data. Gathering data, verifying data, cleaning data, visualizing data, transforming data, and so on.

Let’s briefly examine the details of these methods.

Data Cleaning / Cleansing

Data cleaning is a critically important step in any machine learning project. This process includes fixing systematic problems or errors in messy data. Among this data, there is the case of having incorrect values ​​such as data mistyped, data corrupted, and duplicated data, and so on. For example, the data includes athletes’ long-jump information. Since a person cannot jump 50 meters, the data contains incorrect information. If we consider another example, the fact that the pregnancy status of a male variable in the data set is yes indicates that the data is incorrect.

There are many different statistical analysis and data visualization techniques we can use to identify data cleaning processes. These are so fundamental that they are often overlooked even by most experienced machine learning practitioners. However, it is so critical that if overlooked, it can lead to deterioration in the performance of the models.

Once messy, noisy, or erroneous observations have been identified, they should be addressed.

This situation can happen to remove a row or column. Alternatively, you could replace it with new observation values. As an example, as follows.

  • Using statistics to define normal data and identify outliers
  • Identifying columns that have the same value or no variance and removing them
  • Identifying duplicate rows of data and removing them.
  • Marking empty values as missing.
  • Imputing missing values using statistics or a learned model.

Data cleaning is an operation that is typically performed first, before other data preparation operations.

Messy Datasets

A dataset can contain many types of errors. Columns that do not contain much information, and data with duplicate rows are at the top.

Columns with a single observation or value probably won’t work for modeling. The variance of these columns or estimators (indicating how far they drifted from the mean) would be zero.

When an estimator contains a single value, we call it a zero-variance estimator because it has no real variance displayed by the estimator. Variables or columns that have a single value should probably be removed from the dataset

Besides data containing a single observation, some columns may have very few unique values. For example, suppose we have a variable that contains only the values 1, 3, and 5. This situation can be meaningful for ordinal or categorical variables. But in this case, the dataset contains only numeric variables. This variable will have nonzero variances but a very small number close to zero. Removing these variables from the model will not be a solution because they may contribute to the model performance, albeit slightly. At this point, a decision can be made about what to do based on the effect.

Rows with the same data may also be useless in the modeling process. Here is the case where the data in the duplicate row is the same in every column. Removing duplicate data will be a crucial step in ensuring that data can be used correctly. These rows may cause a misleading effect on model performance.

Outlier Analysis

Observations that fall outside the general trend in the data or are quite different from other observations are called outliers. It is one of the serious problems we will encounter in the data processing steps. We must take great care not to hastily subtract or change values, especially if the data size is small.

Outliers can have many causes, such as:

  • Measurement or input error
  • Data corruption
  • True outlier observation

So what causes outliers? It misleads the rule sets or functions created with the concern of generalizability. They can cause bias, that is, errors in model predictions.

The chart above shows a model with outlier observation and deletion of an outlier. Accordingly, the model curve changes visibly after the outlier observation is removed.

The following methods can be used to decide whether a value is outliers or not.

Related Work

Due to the unique characteristics of each dataset, there is generally no definitive way to identify outliers. At this point, we can decide whether the observation is an outlier using industry knowledge. For example, we should not model a record with a 17 digit Credit Card number (usually 16 digits). Because while we are building a model, we aim to create unbiased models that are generalizable and capable of representing structures in the data set. If the established model has a generalization concern; Structures that are already very rare and do not fit in general should be excluded from the modeling process.

Standard Deviation Method

It is applied to detect the values outside the distribution above the mean. It is added to the mean of a variable by calculating the standard deviation of the same variable. The threshold value is created by adding the 1st, 2nd, or 3rd standard deviation value to the mean. Values above or below this threshold value are defined as outliers. For example, let’s consider the house prices, the mean of the houses is calculated, and the standard deviation value is added to the mean.

Threshold Value = Mean + 1 * Standard Deviation

Threshold Value = Mean + 2 * Standard Deviation

Threshold Value = Mean + 3 * Standard Deviation

Z-Score Method

It works similarly to the standard deviation method. The variable is adapted to the standard normal distribution, that is, it is standardized. Then, a threshold value is set according to the +-2.5 value from the right and left of the distribution. Values above or below this value are marked as outliers.

Interquartile Range Method

It is one of the most commonly used methods. The values of the variable are ordered from smallest to largest. A threshold value is calculated using the values corresponding to their quartiles (percentiles), namely Q1 (25th percentile) and Q3 (75th percentile), and an outlier determination is decided according to this threshold value.

IQR= 1.5 * (Q3 — Q1)

Lower Threshold= Q1 — IQR

Upper Threshold= Q3 + IQR

In summary, when we have a data set, instead of modeling without applying any operation to the variables of this data set, we can see the inconsistent status of the variables in the data set, incomplete observation, etc. We need to examine the circumstances. One of the most important is Outlier Observation. Why is that? If it is not taken into account, it can even change the direction of the line in the model. Usually, we will try to solve this problem using methods 1 and 4.

Outlier observation detection methods according to model variables;

If we are going to do it over a single variable, it will be appropriate to determine it with a boxplot. For the determined outliers, we will use the following three methods;

  • Delete Method (Values ​​are extracted from the data.)
  • Average Method (Values ​​are equal to the mean value of the variable)
  • Suppression Method (Values ​​are equated to Lower or Upper limit value.)

If we are going to do it on more than one variable, the Local Outlier Factor (LOF) method is applied. In this method, we can identify values ​​that may be outliers by scoring the observations based on the intensity at their location. The local density of a point is compared with its neighbors. If it is significantly lower than the density of its neighbors, we interpret that point is in a less frequent region than its neighbors. In other words, if the perimeter of a value is not dense, we infer that this value may be an outlier. The above methods can be used over the coefficient determined here.

In this article, we focused on the importance of data pre-processing processes before modeling, what we can do when cleaning messy data, and outlier detection and what needs to be done. Next article, we will continue with missing data and data transformation.

--

--