CRISP-DM: Transforming Data for Accurate Insights

Data Mastery Series — Episode 4: Data Preparation

Published in

Donato Story

6 min readFeb 11, 2023

If you are interested in articles related to my experience, please feel free to contact me: linkedin.com/in/nattapong-thanngam

Data preparation is a crucial step in the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology, which is widely used for data science projects. It involves various steps such as handling missing values, dealing with outliers, and preparing data for modeling. In this article, we will focus on some key topics in data preparation, such as handling missing values, encoding categorical data, and handling imbalanced datasets. We will also cover data normalization, feature engineering, and feature selection, which are essential for building accurate and robust predictive models.

1. Data Preparation:

Handling Missing Values and Imputation: Missing data can cause problems with data analysis and modeling. This step involves identifying and handling missing data through imputation techniques like mean imputation, mode imputation, or more complex methods like regression imputation or k-nearest neighbors imputation.
Data Type: Data types can be categorical or numerical. Categorical data has a limited number of discrete values, while numerical data can be continuous or discrete. This step involves identifying the data type of each variable to decide which data pre-processing technique to use.
Handling Categorical Data (Encoding): Machine learning algorithms generally only accept numerical data. Categorical data needs to be converted into numerical data through techniques like one-hot encoding, ordinal encoding, or cardinal encoding.
Imbalance Handling: Imbalanced datasets occur when the proportion of one class is much higher than the other(s). In this step, techniques like oversampling and undersampling are used to balance the dataset.
Outlier Handling: Outliers are data points that lie far away from the other data points. Outliers can be removed, treated or transformed to prevent them from affecting the results of the analysis.

2. Data Normalization

Different variables may have different scales, and some algorithms can be sensitive to the scale of the variables. Data normalization is the process of scaling the features of a dataset to a range that is easier for the algorithms to work with. The widely used data normalizing techniques include:

Min-Max Scaling: Scales the data between 0 and 1.
Z-score Normalization: Standardizes the data to have a mean of 0 and a standard deviation of 1.
Log Transformation: Reduces the magnitude of large values and increases the magnitude of small values.
Power Transformation: Adjusts the distribution of the data to make it more Gaussian-like.
Robust Scaling: Scales the data based on percentiles and is robust to outliers.
Unit Vector Scaling: Scales the data to have a unit norm.
MaxAbsScaler: Scales each feature individually by its maximum absolute value to a range between -1 and 1.

Feature Engineering and Feature Selection (Image by Author)

3. Feature Engineering

Feature engineering is the process of creating new features or variables from existing data. This is done to improve the performance of machine learning algorithms or to extract more meaningful information from the data. The following are examples of widely used feature engineering techniques:

Polynomial features: This involves creating new features that are a combination of the existing features raised to a certain power. For example, if we have a feature x, we can create a new feature x², which is the square of x. This can help capture non-linear relationships between features and the target variable.
Interaction features: This involves creating new features that are a product of two or more existing features. For example, if we have two features x and y, we can create a new feature x*y, which is the product of x and y. This can help capture the relationship between two features and how they affect the target variable.
Variable transformations: This involves transforming existing features to create more meaningful information. For example, we can apply a logarithmic transformation to a feature if the relationship between the feature and the target variable is exponential. Another example is applying a Box-Cox transformation to a feature to normalize its distribution. This can help improve the performance of the machine learning algorithm.

4. Feature Selection

Feature selection is the process of identifying the most important features that can provide the best results for a specific model. This is done by removing irrelevant features or reducing the number of features to simplify the model. Techniques for feature selection include correlation analysis, backward or forward selection, and regularization.

Univariate Selection: Univariate selection is a statistical method that selects features based on their individual relationship with the target variable, such as the correlation coefficient or chi-squared test.
Recursive Feature Elimination (RFE): RFE is a wrapper method that works by recursively removing features and building a model on the remaining features until the desired number of features is reached. The importance of each feature is determined by the performance of the model.
Model-specific selection: Some models have built-in feature selection methods, such as LASSO for linear regression, which automatically select relevant features and eliminate irrelevant ones.
Dimensionality Reduction: Dimensionality reduction techniques like Principal Component Analysis (PCA) and t-SNE can be used to reduce the number of features by creating new features that retain most of the information of the original data.
Feature Importance: Feature importance is a measure of how much a feature contributes to the performance of a model. This can be determined using methods such as the coefficient values in linear models or the importance values in tree-based models.
- Feature importance is a metric that is often used in tree-based algorithms, such as Random Forest or Gradient Boosted Trees. It calculates the importance of each feature based on how much it contributes to the overall performance of the model. Features with higher importance scores are considered to be more important and can be selected for further analysis.
- SHAP (SHapley Additive exPlanations) is a framework for interpreting the predictions of any machine learning model. It assigns each feature a SHAP value, which represents the contribution of that feature to the prediction for a specific instance. The SHAP values can be used to identify the most important features in the model, which can then be selected for further analysis.

In the next article, we will select some topics to extend the content into more detail, such as imputation, normalization, etc. By mastering these techniques and incorporating them into your data science projects, you can ensure that your data is properly prepared for modeling, leading to more accurate insights and better decision-making.

Please feel free to contact me, I am willing to share and exchange on topics related to Data Science and Supply Chain.
Facebook: facebook.com/nattapong.thanngam
Linkedin: linkedin.com/in/nattapong-thanngam