Machine Learning Step 2 (B) — Data Preprocessing

5 min readMar 18, 2024

In Part B, we will discussed on data encoding, data normalization, feature selection and data splitting.

Steps of Data Cleaning. Created by NguHE.

5. Data Encoding

Definition: Data encoding is the process of converting categorical or textual data into numerical representations that can be easily processed by machine learning algorithms. Categorical data consists of non-numerical values representing discrete categories or groups, such as gender, color, or product type. Since most machine learning algorithms require numerical inputs, data encoding is essential to transform categorical data into a format that algorithms can understand and utilize effectively.

Treatments: Label Encoding, One-Hot Encoding, Binary Encoding, Ordinal Encoding

Label Encoding: Label encoding assigns a unique numerical label to each category in a categorical variable. Use label encoding when the categorical variable is ordinal, meaning there is a natural order or ranking among categories. Apply label encoding using libraries like scikit-learn’s LabelEncoder.
One-Hot Encoding: One-hot encoding converts categorical variables into binary vectors, where each category is represented by a binary digit (0 or 1). Use one-hot encoding when the categorical variable is nominal, meaning there is no inherent order among categories. Apply one-hot encoding using libraries like Pandas or scikit-learn’s OneHotEncoder.
Binary Encoding: Binary encoding converts categorical variables into binary representations, where each category is represented by its binary equivalent. Use binary encoding when dealing with categorical variables with a large number of unique categories to reduce dimensionality. Apply binary encoding using libraries like category_encoders in Python.
Ordinal Encoding: Ordinal encoding assigns numerical values to categories based on their order or rank. Define the order of categories and map them to numerical values accordingly.
Frequency Encoding: Encoding the categorical levels of feature to values between 0 and 1 based on relative frequency
Target Mean Encoding: Encode each level as the mean of the response.

6. Data Normalization

Definition: Data normalization is the process of rescaling numerical data to a standard range or distribution to ensure consistency and comparability across different features or variables. Normalization techniques adjust the scale of data without distorting its relative differences, making it easier for machine learning algorithms to converge efficiently and interpret the importance of features correctly.

Treatments: Min-max Scaling, Z-score normalization (standardization), Robust Scaling

Min-max Scaling: Min-Max scaling rescales data to a fixed range, typically between 0 and 1. Apply Min-Max scaling when the distribution of data is relatively uniform and bounded. Scale the data using libraries like scikit-learn’s MinMaxScaler.
Z-score Normalization (Standardization): Standardization transforms data to have a mean of 0 and a standard deviation of 1. Apply standardization when the distribution of data is approximately normal or when the algorithm assumes zero-centered data. Standardize the data using libraries like scikit-learn’s StandardScaler.
Robust Scaling: Robust scaling scales data based on robust statistics that are less sensitive to outliers, such as the median and interquartile range (IQR). Apply robust scaling when the data contains outliers or exhibits significant skewness. Scale the data using libraries like scikit-learn’s RobustScaler.

7. Feature Selection

Definition: Feature selection, also known as variable selection or attribute selection, is the process of selecting a subset of relevant features (variables, attributes) from a larger set of available features. The goal of feature selection is to improve model performance, reduce computational complexity, and enhance interpretability by focusing on the most informative and discriminative features while discarding redundant or irrelevant ones.

Treatments: Filter Methods (Correlation, Chi-square test, ANOVA), Wrapper Methods (Forward Selection, Backward Elimination, Recursive Feature Elimination), Embedded Methods

Filter Methods: Filter methods assess the relevance of features independently of the machine learning algorithm. Apply statistical tests, correlation analysis, or information-theoretic measures to evaluate the relevance of features. Select features based on their individual scores or rankings. Common techniques include: Pearson correlation coefficient for numerical features, Chi-square test, mutual information, or ANOVA for categorical features.
Wrapper Methods: Wrapper methods evaluate feature subsets by training and evaluating the performance of a machine learning algorithm. Use search strategies such as forward selection, backward elimination, or recursive feature elimination (RFE) to iteratively select the best subset of features. Train a predictive model using each feature subset and evaluate its performance using cross-validation or other validation techniques. Select the feature subset that optimizes a performance metric such as accuracy, AUC, or F1-score. Wrapper methods can be computationally intensive but provide more accurate feature selection compared to filter methods.
Embedded Methods: Embedded methods perform feature selection as part of the model training process by incorporating feature selection mechanisms directly into the learning algorithm. Utilize algorithms that inherently perform feature selection during training, such as Lasso (L1 regularization), Ridge (L2 regularization), or tree-based methods like Random Forests or Gradient Boosting Machines. Train the model with regularization penalties or feature importance scores and let the algorithm automatically select the most relevant features during training.

8. Data Splitting

Definition: Data splitting, also known as dataset splitting or partitioning, is the process of dividing a dataset into multiple subsets for training, validation, and testing purposes in machine learning tasks. The primary goal of data splitting is to evaluate the performance and generalization ability of machine learning models effectively by ensuring that they are tested on unseen data.

Train-Validation-Test Split: The dataset is split into three subsets: training set, validation set, and test set. Allocate the majority of the data (e.g., 70–80%) to the training set to train the model. Reserve a smaller portion (e.g., 10–15%) for the validation set to tune hyperparameters and monitor model performance during training. Keep a separate portion (e.g., 10–15%) for the test set to evaluate the final performance of the trained model on unseen data. Use libraries like scikit-learn’s train_test_split function to split the dataset into training, validation, and test sets.
Cross-validation: Cross-validation involves repeatedly partitioning the dataset into subsets, training the model on some subsets, and evaluating it on others. Choose a cross-validation method such as k-fold cross-validation or stratified k-fold cross-validation based on the dataset characteristics and modeling requirements. Divide the dataset into k folds, where each fold serves as both a training set and a validation set in different iterations. Repeat the process k times, each time using a different fold as the validation set and the remaining folds as the training set. Use the average performance across all iterations as the final performance metric.

Check out Part A if you haven’t.

Connect with Me:

If you found this guide insightful, feel free to connect with me on LinkedIn. Let’s continue the conversation on the intriguing world of machine learning data preprocessing. Happy exploring! 🚀✨