Machine Learning Step 2 (A)— Data Preprocessing

8 min readMar 18, 2024

In Part A, we will discussed on duplication, fixing structural errors, handling of missing data and handling of outliers.

In today’s data-driven world, ensuring the quality of data is paramount for informed decision-making, accurate analysis, and efficient operations across various domains. A significant portion of a data scientist’s time, according to studies and articles, is dedicated to preparing data for analysis. This data cleaning process can take up anywhere from 60% to 80% of their work hours.

However, assessing data quality isn’t a straightforward task; it requires a multi-dimensional approach that goes beyond simple correctness. A comprehensive measure of data quality encompasses several key aspects, including validity, accuracy, consistency, integrity, timeliness, and completeness.

Steps of Data Cleaning. Created by NguHE.

1. Duplicates

Definition: Duplicates in a dataset refer to records or observations that are identical or nearly identical to other records within the same dataset. These duplicates can occur due to various reasons such as data entry errors, data integration processes, or system malfunctions. Duplicates can exist across all or specific columns within a dataset. When dealing with datasets from multiple sources, integrate and consolidate data carefully to avoid introducing duplicates.

Treatments: Complete Removal, Partial Removal

Complete Removal: Removing all duplicates from the dataset, keeping only one instance of each unique record.
Partial Removal: Removing duplicates based on specific criteria or columns deemed relevant for uniqueness.

2. Fix Structural Errors

Definition: Structural errors in datasets refer to inconsistencies or discrepancies in the structure of the data that can hinder analysis and modeling. These errors can arise due to various reasons such as data collection methods, data entry errors, or data integration processes. Structural errors can manifest in different forms, including inconsistent data types, mismatched formats, or irregularities in data organization.

Identification: Perform data profiling to identify structural inconsistencies, including irregularities in data types, formats, and structures. Utilize descriptive statistics, visualization techniques, and data exploration tools to uncover structural errors.

Treatments: Standardization, Data Cleaning, Integration and Alignment

Standardization: Standardize data formats and structures to ensure consistency across the dataset. Convert inconsistent data types to unified format (e.g. convert date strings to datetime objects, ensuring numerical variables are in the same scale).
Data Cleaning: Cleanse data by correcting errors, inconsistencies, or anomalies in the dataset. Address misspellings, typographical errors, or incorrect entries to rectify structural inconsistencies.
Integration and Alignment: Integrate data from multiple sources and align data structures to ensure seamless interoperability. Resolve structural discrepancies between datasets by harmonizing schemas and resolving mismatches.

3. Missing Value

Definition: Missing values in datasets refer to the absence of data or information for specific observations or variables. These missing values can occur due to various reasons such as data entry errors, equipment malfunctions, survey non-response, or intentional omission. Missing values are represented differently depending on the data format and may include NaN (Not a Number), NULL, blank cells, or other placeholders.

Identification: Use descriptive statistics or summary functions to identify variables with missing values. Visualize missing value patterns using histograms, or bar charts to understand the extent and distribution of missingness.

Treatments: Removal, Imputation

Removal: Exclude observations or variables with missing values entirely from the analysis. Removal of observations or variables with missing values is typically considered when the missingness is random (does not introduce bias) and the proportion of missing values is relatively small compared to the total dataset.
Mean Imputation: Replace missing numerical values with the mean of the variable. Mean imputation is suitable when the distribution of the variable is approximately symmetric and follows a normal distribution. In such cases, the mean represents the central tendency of the data. Mean imputation is less sensitive to outliers compared to median imputation. If the variable doesn’t have significant outliers that could skew the mean, mean imputation is often preferred.
Median Imputation: Replace missing numerical values with the median of the variable. Median imputation is more robust to outliers and is preferred when the distribution of the variable is skewed or non-normal. Since the median is not affected by extreme values, it provides a better estimate of central tendency in skewed distributions. If the variable contains outliers that could influence the mean significantly, median imputation is a more robust option. Outliers can distort the mean, making it less representative of the central tendency of the data.
Mode Imputation: Replace missing categorical values with the mode (most frequent category). Apply when missing categorical values are expected to occur frequently and replacing them with the mode maintains the overall distribution.
Interpolation: Use interpolation methods (e.g., linear interpolation) to estimate missing values based on neighboring observations. Interpolation methods are useful when missing values exhibit a pattern or trend within the dataset, such as time series data or spatial data. Time series data where missing values can be estimated based on neighboring time points. Spatial data where missing values can be interpolated based on the values of neighboring spatial locations.
Predictive Modeling: Employ machine learning algorithms to predict missing values based on other variables in the dataset. Use when missing values are not missing completely at random and exhibit systematic patterns or relationships with other variables.
K-Nearest Neighbors (KNN) Imputation: Replace missing values with the average of nearest neighbors’ values. Suitable when missing values can be reasonably estimated by averaging the values of the nearest neighbors in the feature space.
Domain-Specific Handling: Leverage domain expertise to determine appropriate handling strategies based on the nature of the data and the context of the analysis.

4. Outliers

Definition: Outliers are data points that significantly deviate from the rest of the observations in a dataset. They can occur due to various reasons, including measurement errors, data corruption, or genuinely rare events. Outliers can distort statistical analyses and machine learning models, leading to biased results and reduced model performance.

Identification: Use statistical measures such as mean, median, standard deviation, and quartiles to detect outliers. Apply techniques like z-score (standard-score, standardized measure of deviation from the mean), interquartile range (IQR), or modified z-score (adjusting the calculation based on the median and median absolute deviation instead of the mean and standard deviation.) to identify observations that deviate significantly from the rest of the data. Visualization techniques such as box plots, scatter plots, histograms, or Q-Q plots can be applied to visualize the distribution of data and identify potential outliers visually.

Treatments: Removal, Log Transformation, Winsorization, Box-Cox Transformation, IQR method, Capping, Flooring, Machine Learning Algorithm

Removal: Remove outliers from the dataset entirely, particularly if they are due to data entry errors or measurement issues. Use removal when outliers are likely due to errors in data collection or measurement. Apply removal sparingly, as it can lead to loss of information and potential bias if outliers represent valid observations.
Log Transformation: Apply logarithmic transformation to skewed data distributions to reduce the impact of outliers. Use log transformation for data with positively skewed distributions, where the majority of observations are clustered around lower values, and outliers are present in the higher end. Log transformation can stabilize variance and reduce the impact of extreme values, making the data more suitable for analysis.
Winsorization: Replace extreme values with less extreme values (e.g., replacing outliers with the nearest non-outlier values), typically the values at a certain percentile (e.g., 95th percentile for upper winsorization, and 5th percentile for lower winsorization).. Use winsorization when the data distribution contains extreme values that are genuine but may negatively impact the analysis or modeling task. Winsorization replaces extreme values with less extreme values, reducing the influence of outliers while preserving the overall distribution of the data.
Box-Cox Transformation: Transform data using the Box-Cox method to stabilize variance and normalize the distribution. Use Box-Cox transformation when the data exhibits heteroscedasticity (unequal variance across the range of predictor variable) or non-normality. Box-Cox transformation optimizes the transformation parameter to stabilize variance and normalize the distribution, making it suitable for linear regression or other parametric models.
IQR method: IQR method is a measure of statistical dispersion that represents the range between the first quartile (25th percentile) and the third quartile (75th percentile) of a dataset. Outliers are typically defined as observations that fall below Q1–1.5 * IQR or above Q3 + 1.5 * IQR. The IQR provides a robust measure of variability that is less influenced by extreme values compared to the range or standard deviation. Use the IQR method when the data distribution is skewed or contains outliers that deviate significantly from the bulk of the data. IQR treatment is robust and less sensitive to extreme values compared to methods like mean and standard deviation, making it suitable for datasets with non-normal distributions or when outliers may represent genuine but unusual observations.
Capping: Set a threshold beyond which data points are considered outliers and replace them with the threshold value. Use capping to set upper and lower bounds for outlier values when the extreme values are genuine but may adversely affect the analysis or modeling task. Capping replaces outlier values beyond the specified thresholds with the threshold values, effectively limiting their impact on subsequent analysis. It may affect the distribution of data, especially if a large number of outliers are capped. Use cases: financial data analysis, quality control, where both high and low outliers may occur.
Flooring: Set a lower bound threshold for outlier values and replace values below this threshold with the threshold value. Use flooring when outliers are present in the lower end of the data distribution and need to be treated similarly to capping. Flooring replaces values below a specified threshold with the threshold value, mitigating the influence of extreme values while retaining the integrity of the data. It tends to preserve the distribution shape while limiting extreme values. Use cases: medical data, where outliers represent minimum thresholds for certain physiological measures.
Machine learning Algorithm: Utilize robust statistical methods and machine learning algorithms that are less sensitive to outliers, such as robust regression, robust covariance estimation, or tree-based models. Ensemble methods like Random Forests or Gradient Boosting are generally less affected by outliers compared to linear models. Use machine learning algorithms that are inherently robust to outliers when outlier handling methods may not be sufficient.

Skewness in Data. Source: Biology for life.

Check out my next post on Data Preprocessing Part B.

Connect with Me:

If you found this guide insightful, feel free to connect with me on LinkedIn. Let’s continue the conversation on the intriguing world of machine learning data preprocessing. Happy exploring! 🚀✨