Part-2 :: Understanding and Handling Imbalanced Numerical Data in Machine Learning

Prathamesh Amrutkar
5 min readAug 1, 2024

--

Numerical data can present its own set of challenges, including scaling, normalization, and handling missing values. Proper preprocessing of numerical data is essential for building effective machine learning models. This section delves into various methods for handling numerical data, with detailed explanations and examples.

Table of Contents

A. Scaling and Normalization
B. Handling Missing Values
C. Feature Engineering
D. Dimensionality Reduction
E. Dealing with Outliers
F. Feature Selection

A. Scaling and Normalization

Scaling and normalization ensure that numerical features are on a comparable scale, which can improve the performance of machine learning algorithms.

  1. Standardization (Z-score normalization)
  • Explanation: Transforming the data to have a mean of 0 and a standard deviation of 1. This is useful when the data follows a Gaussian distribution.
  • Syntax:
from sklearn.preprocessing import StandardScaler 
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
  • Example: Consider a dataset with the features height (in centimeters) and weight (in kilograms). Standardizing these features will result in both having zero mean and unit variance, making them comparable.

2. Min-Max Scaling

  • Explanation: Scaling the data to a fixed range, usually [0, 1]. This is useful when the data does not follow a normal distribution.
  • Syntax:
from sklearn.preprocessing import MinMaxScaler 
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
  • Example: If we have exam scores ranging from 0 to 100, Min-Max scaling would transform these scores to a range of [0, 1].

3. Robust Scaling

  • Explanation: Using statistics that are robust to outliers (e.g., median and interquartile range).
  • Syntax:
from sklearn.preprocessing import RobustScaler 
scaler = RobustScaler()
scaled_data = scaler.fit_transform(data)
  • Example: For a dataset with household incomes, where most values are clustered but some are extremely high, robust scaling can prevent the outliers from skewing the scaling.

B. Handling Missing Values

Handling missing values is crucial to avoid biases and inaccuracies in the model.

  1. Imputation

a. Mean/Median/Mode Imputation: Replacing missing values with the mean, median, or mode of the column.

  • Syntax:
from sklearn.impute import SimpleImputer 
imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(data)
  • Example: For a column with ages, missing values can be replaced with the mean age.

b. K-Nearest Neighbors Imputation: Using the nearest neighbors to estimate and replace missing values.

  • Syntax:
from sklearn.impute import KNNImputer 
imputer = KNNImputer(n_neighbors=5)
imputed_data = imputer.fit_transform(data)
  • Example: In a dataset with features height and weight, missing height values can be imputed based on the height values of the nearest neighbors in the feature space.

c. Predictive Modeling: Using machine learning models to predict and impute missing values.

  • Syntax:
from sklearn.experimental import enable_iterative_imputer 
from sklearn.impute import IterativeImputer
imputer = IterativeImputer()
imputed_data = imputer.fit_transform(data)
  • Example: Using a regression model to predict missing values in a column based on other available features.

C. Feature Engineering

Feature engineering involves creating new features or transforming existing ones to improve the performance of machine learning models.

  1. Binning: Converting continuous numerical features into categorical bins.
  • Syntax:
import pandas as pd 
bins = [0, 10, 20, 30, 40, 50]
labels = ['0-10', '10-20', '20-30', '30-40', '40-50']
data['binned'] = pd.cut(data['feature'], bins=bins, labels=labels)
  • Example: Dividing ages into bins such as 0–18, 19–35, 36–50, and 51+.

2.Polynomial Features: Creating new features by raising existing features to a power.

  • Syntax:
from sklearn.preprocessing import PolynomialFeatures 
poly = PolynomialFeatures(degree=2)
poly_features = poly.fit_transform(data)
  • Example: If we have a feature x, adding a new feature x^2 can help capture non-linear relationships.

3. Interaction Features: Creating new features by multiplying existing features.

  • Syntax:
from sklearn.preprocessing import PolynomialFeatures 
poly = PolynomialFeatures(degree=2, interaction_only=True)
interaction_features = poly.fit_transform(data)
  • Example: For features height and weight, adding an interaction feature height * weight can provide additional information.

4. Log Transform: Applying the logarithm function to skewed features to make them more normally distributed.

  • Syntax:
import numpy as np  
log_transformed_data = np.log(data + 1)
  • Example: If the income feature is highly skewed, applying a log transform can make it more normally distributed, improving the performance of algorithms that assume normality.

5. Discretization: Converting continuous features into categorical bins.

  • Syntax:
from sklearn.preprocessing import KBinsDiscretizer  
discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
discretized_data = discretizer.fit_transform(data)
  • Example: For a feature representing age, discretization can convert it into age groups like [0–10], [11–20], etc.

D. Dimensionality Reduction

Dimensionality reduction techniques are used to reduce the number of features in the dataset, which can help improve the performance and interpretability of machine learning models.

  1. Principal Component Analysis (PCA) Reducing the dimensionality of the data while retaining most of the variance.
  • Syntax:
from sklearn.decomposition import PCA 
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(data)
  • Example: For a dataset with 50 features, PCA can reduce it to a smaller set of principal components that explain most of the variance.

2. t-Distributed Stochastic Neighbor Embedding (t-SNE) Reducing dimensionality for visualization purposes.

  • Syntax:
from sklearn.manifold import TSNE 
tsne = TSNE(n_components=2)
reduced_data = tsne.fit_transform(data)
  • Example: Visualizing high-dimensional data in 2D space to identify clusters.

3. Linear Discriminant Analysis (LDA): Reducing dimensionality while maintaining class separability.

  • Syntax:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA 
lda = LDA(n_components=2)
reduced_data = lda.fit_transform(data, target)
  • Example: Reducing features in a classification problem while maximizing the separation between classes.

E. Dealing with Outliers

Outliers can skew the results of machine learning models, so it’s important to detect and handle them appropriately.

  1. Z-score Method: Identifying outliers by their Z-scores.
  • Syntax:
from scipy import stats  
z_scores = np.abs(stats.zscore(data))
outliers = np.where(z_scores > 3)
  • Example: In a dataset of exam scores, identifying scores that are more than 3 standard deviations away from the mean as outliers.

2. IQR Method: Identifying outliers using the interquartile range (IQR).

  • Syntax:
Q1 = data.quantile(0.25)  Q3 = data.quantile(0.75) 
IQR = Q3 - Q1
outliers = data[((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR)))]
  • Example: For a dataset with house prices, identifying prices that are significantly lower or higher than the typical range as outliers.

3. Handling:

  • Removal: Simply removing outliers from the dataset.
  • Syntax:
data_cleaned = data[(z_scores < 3).all(axis=1)]
  • Example: Removing outliers from a dataset of car prices to focus on the majority of typical car prices.

4. Transformation: Applying transformations to reduce the impact of outliers.

  • Syntax:
log_transformed_data = np.log(data + 1)
  • Example: Applying a log transformation to a feature with extreme values, like income, to reduce the impact of outliers.

F. Feature Selection

Feature selection involves selecting the most relevant features for the model, which can improve performance and reduce overfitting.

  1. Univariate Selection: Selecting features based on univariate statistical tests.
  • Syntax:
from sklearn.feature_selection import SelectKBest, chi2  
selector = SelectKBest(chi2, k=10)
selected_features = selector.fit_transform(data, target)
  • Example: Selecting the top 10 features with the highest chi-squared statistics in a dataset for classification.

2. Recursive Feature Elimination (RFE): Recursively selecting features by considering smaller and smaller sets of features.

  • Syntax:
from sklearn.feature_selection import RFE  
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
rfe = RFE(model, 10) selected_features = rfe.fit_transform(data, target)
  • Example: In a dataset for predicting customer churn, using RFE to select the top 10 features that contribute the most to the prediction.

Conclusion

Proper handling of numerical data is a critical step in the machine learning pipeline. Techniques such as scaling, normalization, imputation, feature engineering, dimensionality reduction, and outlier detection ensure that the data is in a suitable form for building effective models. Implementing these techniques can significantly enhance the performance and accuracy of your machine learning models.

For more information on classification techniques, handling imbalanced data, and other aspects of machine learning, please refer to Part-1: Understanding Imbalanced Classification Data with Machine Learning and Handling Imbalanced Data in Machine Learning.

For more insights and projects, you can connect with me on LinkedIn and explore my work on GitHub.

--

--