Chaos to Order: Data Pre-processing for Machine Learning Mastery

9 min readMar 10, 2023

Data pre-processing is like preparing a dish — if you don’t clean, peel and chop your ingredients, you won’t end up with a very nice dish.

Continuing with the culinary analogy, we’ve gathered our ingredients in our exploratory data analysis, and now it’s time to start preparing them for our dish. In this article, we’ll learn how to turn that mess of raw data into a well-organised and refined dataset that’s ready to power your models.

Data pre-processing is an essential step in machine learning that involves transforming raw data into a clean, structured dataset that machine learning algorithms can easily analyse. The goals of pre-processing are to remove noise, correct errors, and ensure that the data is accurate and complete. Without proper pre-processing, models may suffer from overfitting, underfitting, or other issues that can lead to inaccurate or unreliable results. When carried out well, data pre-processing can significantly improve the quality and reliability of your models.

Data cleaning

Checking data types

When it comes to data pre-processing, checking and verifying the data types of your features is crucial. This step ensures that each feature is correctly interpreted by your machine learning algorithms and can help you identify potential issues such as incorrect encoding or mixed data types. For example, you might have a column of data that is supposed to contain numeric values but instead has string values due to formatting errors or missing values. This will be important for pre-processing as different data types will be handled differently.

Removing duplicate values

Duplicate values can occur for various reasons, such as data entry errors, merging data from multiple sources, or technical issues with data collection. If not addressed, duplicate values can lead to biased models, incorrect estimates of variability, and other issues that can impact the accuracy of your results. Therefore, it’s important to identify and remove these from your dataset. This can be done using various techniques, such as sorting the dataset and removing the duplicates or using specific functions in your code to identify and remove duplicates.

Removing outliers and anomalies

Outliers are data points that lie far outside the expected range of values for a particular feature, while anomalies are values that are completely unexpected and likely represent errors or inaccuracies in data collection or entry. These data points can lead to biased models and incorrect predictions, which can be detrimental in critical applications such as healthcare, finance, and transportation. Depending on your application, you may need to remove these from your dataset before building your models. This can be done using various statistical techniques such as the Z-score, Interquartile Range (IQR) or using visualisation tools such as scatterplots and boxplots to identify extreme values.

Handling missing values

Missing values are a common problem in real-world datasets. They can occur due to various reasons, such as the structure and quality of the data, data entry errors, data loss during transmission, or incomplete data collection. These missing values can impact the accuracy and reliability of machine learning models, as they can introduce bias and skew your results. Some models don’t work at all with missing values. Therefore, handling missing values appropriately is essential before you build your model.

There are several techniques for handling missing values. The following are a few common examples:

Deletion: This involves removing the rows or columns with missing values. This is usually done when the percentage of missing values is very small or when the missing values do not significantly impact the analysis or results.
Mean/Median Imputation: Replacing the missing values with the mean or median, or mode of the feature. This should only be used for numerical features (and mean only for a normal distribution).
Mode Imputation: Replacing the missing values with the mode, or most commonly occurring value of the feature. This should only be used for categorical features where one category is dominant.
Forward/Backward Fill: Filling the missing values with the value from the previous or next observation in the dataset. This method is used for time-series data or when the missing values occur in a pattern.
Regression Imputation: Using a regression model to estimate the missing values based on other features in the dataset. This method works well when there is a strong correlation between features.
Machine learning imputation: Using a machine learning model such as K-Nearest Neighbours (KNN) or a neural network to estimate the missing value.
Best guess based on domain knowledge: Missing values can also be replaced with an estimate informed by domain knowledge and business understanding. This should be done with a subject matter expert who understands the domain and the data.

Feature engineering

Feature engineering is a powerful step in the pre-processing pipeline, which involves transforming raw data into useful features that can help improve the performance of a model. Feature engineering is important because the performance of machine learning algorithms largely depends on the quality and relevance of the features used as inputs. This process requires a deep understanding of the data, its underlying patterns, and the problem domain.

Feature transformation

Many machine learning algorithms cannot directly handle categorical data in their original form. Instead, they require numerical data to operate properly. One popular encoding technique is one-hot encoding, which transforms each categorical feature into a set of binary features that represent each of the possible values.

Similarly, we can do the reverse and turn numerical features into categorical ones through bucketisation or binning. This aims to simplify the data and reduce the noise by grouping similar values together. This can make the data more manageable and easier to analyse and help identify patterns and relationships that may not be apparent in the raw data. There are two main types of binning: equal-width binning (fixed number of equally sized bins) and equal-frequency binning (the number of observations in each bin is equal).

Feature construction

This involves combining or transforming existing features to create new ones that capture more complex relationships between the input features and our target. Our domain knowledge will help in this component.

Suppose we are working on a classification problem to predict whether a bank loan will default based on various demographic and financial variables. One of the variables is the borrower’s total income. We could use feature construction to create a new feature that captures the borrower’s debt-to-income ratio, which may better predict loan default risk than total income alone. To do this, we would divide the borrower’s total debt by their total income, creating a new feature that represents their level of debt relative to their income.

Feature scaling

Feature scaling is a step where we bring all the features to a common scale. In short, feature scaling helps models achieve better performance and accuracy. It ensures that all features have the same level of importance and prevents certain features from dominating others. Scaling can also help improve the convergence rate of some algorithms, such as gradient descent. The scaling technique applied (or whether we use one at all) will depend on the algorithm we choose, as some algorithms only benefit from specific scaling techniques.

Standardisation: Scales the features to have a mean of 0 and a standard deviation of 1 by subtracting the mean and dividing by the standard deviation. This is useful for neural networks and other gradient-descent-based algorithms.
Normalisation: Scales the features to a range between 0 and 1 by subtracting the minimum value and dividing by the range. Helpful for algorithms that rely on similarity or distance measures such as clustering algorithms, k-Nearest Neighbours (KNN), support vector machines or principal component analysis (PCA)
Robust scaling: Similar to normalisation, it is designed to be more robust to outliers than other scaling methods. It scales the features using the interquartile range (IQR) instead of the range or standard deviation.
Max-abs scale: Scales the features between -1 and 1 by dividing each feature by its maximum absolute value. Some examples of algorithms that can benefit from max-abs scaling are neural networks and algorithms used for image processing.

Feature selection

Feature selection techniques involve selecting the most relevant features for the problem at hand based on their importance, correlation, or other statistical measures.

Filter methods

Features are selected based on their relationship to the target. These methods evaluate each feature independently based on statistical measures such as correlation and then select the top-ranked features. Here we can go back to our EDA and check which features were the most correlated with our target. Examples of filter methods include Pearson correlation coefficient, Chi-square test, and Information gain.

Wrapper methods

These methods use a model to evaluate the performance of subsets of features and select the subset that yields the best performance (as per a defined performance metric). It evaluates all possible combinations of features on the specific machine learning algorithm we are planning to fit and selects the one with the best result.

Embedded methods

These methods perform feature selection as part of the learning algorithm and combine both filter and wrapper methods. In this case, the algorithm will perform feature selection and learning simultaneously. Examples of embedded methods include Lasso and Ridge regression, Decision Trees, and Random Forests.

Dimensionality reduction techniques

These reduce the number of features by projecting them onto a lower-dimensional space while retaining the most important information. Examples of dimensionality reduction techniques include Principal Component Analysis (PCA), Singular Value Decomposition (SVD), and t-distributed Stochastic Neighbor Embedding (t-SNE).

Splitting data for modelling

When we’re happy with our features, the final step is to split our data into training and testing sets (for supervised learning). Think of it like studying for a test — you wouldn’t just memorise the answers and assume you’re prepared for the whole exam. Similarly, you can’t just train your model on all of your data and hope it performs well on unseen data. By splitting the data into training and testing sets, you can train your model on one set and evaluate its performance on the other. This helps avoid overfitting and ensures your model generalises well to new, unseen data.

Holdout method

The simplest and most common method is to randomly divide the data into two non-overlapping sets — a training set and a testing set. The size of the testing set can be specified, and the remaining data is used for training the model. Commonly a split of 80% training and 20% testing is used. However, other ratios can be used.

Time-based splitting

This method is used when the data is time-series data. In this method, the data is split into training and testing sets based on time, where the earlier data is used for training, and the later data is used for testing. This ensures that the model is trained on historical data and tested on future data, simulating real-world scenarios where the model is expected to make predictions on future data.

When using time-based splitting, the data must be sorted by time before splitting, preventing data leakage (where information from the future is inadvertently used to train the model).

Splitting into three sets

There are situations where it can be helpful to split the data into three sets. The third set is known as the validation set and is used to evaluate the model's performance during hyperparameter tuning or model selection. In this scenario, the data is first split into a training set and a testing set. The training set is then used to train the model, and the testing set is used to evaluate the final performance of the model.

However, before the final evaluation, hyperparameters may need to be adjusted, or multiple models may need to be trained and compared. The validation set is used to make these decisions. This process is repeated with different hyperparameters or models until the best-performing model is selected. Once the best model is selected, it is evaluated on the testing set to estimate its performance on unseen data.

Conclusion

Data pre-processing and feature selection are critical steps in machine learning that can significantly impact the accuracy and reliability of your models. By following best practices and using proven techniques, you can transform your raw data into high-quality features that unlock the true potential of your models. Remember to carefully analyse your data, identify and remove outliers, and select the most relevant features to ensure that your models are accurate and efficient. With these skills in your toolkit, you’ll be well on your way to becoming a data pre-processing pro and unleashing the full power of your machine learning models.

Once we’ve finished pre-processing our data, we can start building our models and get cooking!