Data Preprocessing Steps for Machine Learning in Python (Part 2)

Learn with Nas
Women in Technology
11 min readOct 5, 2023

In the first part of this article, we covered the data preprocessing process, demonstrating how to collect data, clean data including handling missing values, outliers, and duplicate data. Additionally, we presented data transformation techniques, utilizing diverse pandas functions like groupby and pivot_table. The fourth step encompassed a detailed exploration of Feature Engineering, specifically scaling, normalization, and standardization.

In this following segment, we will deeply explore feature selection, the management of imbalanced datasets, encoding features, and data splitting.

Step 5: Feature Selection

Feature selection involves the careful selection of optimal features for your model. While the specific technique may vary, the primary objective remains consistent: to identify the features that exert a greater influence on your model’s performance [12].

Feature selection techniques in machine learning can be broadly categorized into the following groups:

  1. Supervised Techniques: These methods are applicable to labeled data and help identify pertinent features to enhance the performance of supervised models such as classification and regression. Examples include linear regression, decision trees, and SVM [13].
  2. Unsupervised Techniques: These methods are suitable for unlabeled data. Examples include K-Means Clustering, Principal Component Analysis, and Hierarchical Clustering [13].

Feature selection methods:

  1. Filter methods identify the inherent characteristics of features through univariate statistics rather than relying on cross-validation performance. They are quicker and less computationally intensive compared to wrapper methods. When working with high-dimensional data, filter methods are a more computationally efficient choice [13].
  2. Wrapper methods necessitate a mechanism to explore the entire array of potential feature subsets, evaluating their efficacy through training and assessing a classifier with that particular subset of features. This approach hinges on a specific machine learning algorithm tailored to the dataset at hand. It involves an exhaustive search, assessing all feasible feature combinations against the evaluation criterion. Typically, wrapper methods yield superior predictive accuracy compared to filter methods [13].
  3. Embedded methods combine the advantages of both wrapper and filter methods by considering feature interactions while still keeping computational costs manageable. These methods operate iteratively, managing each iteration of the model training process and selectively extracting features that significantly contribute to training for a given iteration [13].

You can consult the diagram below for insights into each method’s technique:

While I won’t delve deeply into each technique here, you can explore a comprehensive article by Analytics Vidya that provides detailed descriptions of each technique: https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/

Implementing Correlation Coefficient Technique to Detect Anomalous Data in Python:

To detect the presence of anomalous or outlier data, the ‘pyod’ library will be used by importing the KNN algorithm. This algorithm can be used to identify data that is unusual or uncommon in a dataset.

plt.figure(figsize=(20,15))
heatmap = sns.heatmap(df_final.corr(), annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12});

Findings

The correlation coefficient between the features sum_active_power and sum_reactive_power is very high (0.96). I will remove sum_reactive_power from the model to avoid overfitting.

# drop sum_reactive_power column
df_final = df_final.drop('sum_reactive_power', axis=1)

The target variable is ‘end_temp’. In the next process, I will evaluate the correlation between features and the target.

df_final.corr()['end_temp'].sort_values()

Findings

1. From the correlation coefficient values above, out of 25 columns, it can be observed that the average correlation values are below 0.1 and also above -0.1. This indicates that the relationship between the target and other features is relatively weak.

2. Due to this condition, only features with correlation values greater than 0.1 and less than -0.1 will be used.

Action to do

Use the features which have a correlation greater than 0.1 and less than -0.1 to check anomalous data.

df_final.corr()['end_temp'][df_final.corr()['end_temp']>0.1]
df_final.corr()['end_temp'][df_final.corr()['end_temp']<-0.1]
outliers = df_final[['sum_active_power',
'start_temp',
'end_temp',
'bulk_12',
'bulk_6',
'wire_2',
'wire_4',
'wire_7']]
model = KNN()
model.fit(outliers)
outliers['is_outlier'] = model.predict(outliers) == 1
outliers_knn = outliers['is_outlier'].sum()
print("Number of Anomalies: ", outliers_knn)

Result:

These anomalous data will be dropped from the dataset.

outlier_keys = list(outliers[outliers['is_outlier'] == 1].index)
good_keys = list(set(outliers.index) - set(outlier_keys))
df_final = df_final.drop(outlier_keys)
df_final.shape

You can visit my GitHub account to access the complete code related to the above example:

Step 6: Handling Imbalanced Data

Imbalanced data denotes datasets in which the distribution of observations within the target class is uneven. In other words, one class label has a significantly higher number of observations, while the other has a notably lower number [14].

Some approaches to handling imbalanced data set problem:

  1. Choose Proper Evaluation Metric: An initial consideration is the nature of the challenge you aim to address in your Machine Learning endeavor. Does it involve classifying data, predicting distinct labels like “spam” or “not spam”? Alternatively, is it a regression task, entailing predictions of continuous values like housing prices or customer ratings? Your choice of evaluation metrics, encompassing accuracy, precision, recall, or prediction error, will depend on the specific problem type [15].
  2. Resampling (Oversampling or Undersampling) involves adjusting the sample size of either the minority or majority class, either increasing or decreasing it. In the case of an imbalanced dataset, we can use oversampling to increase the representation of the minority class with replacement. Conversely, undersampling entails randomly removing rows from the majority class to align its size with that of the minority class. Following this sampling process, we obtain a well-balanced dataset with comparable sizes for both major and minor classes. This balance ensures that the classifier assigns equal significance to both classes when they have a similar number of records in the dataset [14].
  3. SMOTE (Synthetic Minority Oversampling Technique) is an alternative approach for oversampling the minority class. Mere duplication of minority class records doesn’t usually bring new insights to the model. In SMOTE, new instances are created by synthesizing data from the existing records. To put it plainly, SMOTE examines instances in the minority class, selects a random nearest neighbor using k-nearest neighbor, and generates a synthetic instance randomly within the feature space [14].

Implementing Upsampling and Downsampling in Python:

# check data composition of target column in percentage
df['exited'].value_counts()/df.shape[0] * 100

Result:

The composition of the target data is not ideal due to an imbalance. This implies that, since the majority of the target data is 1, there is a tendency for the model to predict the value 1. This can result in poor model performance and low accuracy. To address this imbalance, techniques like upsampling (increasing the frequency of value 0) or downsampling (reducing the frequency of value 1) can be employed. However, both upsampling and downsampling might lead to the introduction of synthetic data points.

Upsampling:

# function to upsample
def upsample (features, target, repeat):
features_zeros = features[target == 0]
features_ones = features[target == 1]
target_zeros = target[target == 0]
target_ones = target[target == 1]

features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

features_upsampled, target_upsampled = shuffle (features_upsampled, target_upsampled, random_state = 42)

return features_upsampled, target_upsampled

Before upsampling:

# check data composition
target_train.value_counts()
# apply the function
features_upsampled, target_upsampled = upsample(features_train, target_train, 4)

After upsampling:

# data composition after upsampling
target_upsampled.value_counts()

Result:

Downsampling:

# function to downsample
def downsample (features, target, fraction):
features_zeros = features[target == 0]
features_ones = features[target == 1]
target_zeros = target[target == 0]
target_ones = target[target == 1]

features_downsampled = pd.concat([features_zeros.sample(frac=fraction, random_state = 42)] + [features_ones])
target_downsampled = pd.concat([target_zeros.sample(frac=fraction, random_state = 42)] + [target_ones])

features_downsampled, target_downsampled = shuffle (features_downsampled, target_downsampled, random_state = 42)

return features_downsampled, target_downsampled

Before downsampling:

# check data composition
target_train.value_counts()

Result:

# apply the function
features_downsampled, target_downsampled = downsample(features_train, target_train, 0.3)

After upsampling:

# check data composition after upsampling
target_downsampled.value_counts()

Result:

You can visit my GitHub account to access the complete code related to the above example:

Step 7: Encoding Categorical Features

Encoding categorical features involves transforming categorical data into integer format, enabling its utilization in various models. Categorical data typically exists in the form of strings or object data types. However, machine learning algorithms exclusively operate on numerical values. Hence, it’s crucial to convert categorical features or data into a numeric representation [16].

How to Encode Categorical Features

Ordinal Encoding

When converting ordinal data to numeric form, it’s crucial to preserve the inherent order. In this process, each category is assigned a numerical value ranging from 0 to the total number of categories. For instance, if we have three categories like ‘bad,’ ‘average,’ and ‘good,’ they would be encoded as 0, 1, and 2 respectively, ensuring the order is preserved [16].

Nominal Encoding

Nominal data lacks any inherent order. Encoding categories in an ordinal manner (e.g., using numbers like 0, 1, 2) can mislead machine learning algorithms into assigning unwarranted importance (e.g., assigning higher importance to 2 over 0 and 1) when, in fact, the data is nominal and lacks such inherent order. Consequently, this approach is not suitable for nominal data. For nominal data, a better approach involves transforming each category into a distinct column and assigning binary values (0 or 1) based on the presence of the respective category. Consequently, this transformation increases the number of features, with each category represented as a separate column [16].

Implementing Nominal Encoding in Pyton:

the dataset:

# define nominal features
df_categorical = ['geography', 'gender']

We will implement get_dummies from the Pandas Library for nominal encoding, specifically a technique called one-hot encoding. One-hot encoding is a common method to convert categorical data (including nominal data) into a numerical format suitable for machine learning [17].

df = pd.get_dummies(df, drop_first=True, columns = df_categorical)

The output shows how each of the categories is transformed into a column.

You can visit my GitHub account to access the complete code related to the above example:

Step 8: Data Splitting

In the realm of data science and machine learning, data splitting is a crucial step. It involves partitioning the given dataset into two or more subsets and facilitating model training, validation, and testing. This process is fundamental, especially when building models reliant on the dataset. Typically, we divide the main dataset into two or three parts to achieve this [18].

How does Data Splitting work?

When engaged in supervised machine learning tasks, it’s advisable to partition the data into three distinct sets: the training set, the testing set, and the validation set. But what exactly does this entail?

  1. Training Set: This subset of the main dataset is utilized to educate the model, enabling it to grasp patterns and relationships within the data for accurate predictions. When selecting training data from the entire dataset, it’s essential to prioritize high data representativeness, ensuring an adequate representation for each class. Additionally, the quality of the extracted data should encompass impartiality, as biased data could compromise the accuracy of the model [18].
  2. Validation Set: This set is used to understand the performance of the model in comparison to different models and hyperparameter choices. During the process of constructing a machine learning model, our approach typically involves training multiple models by adjusting parameters or employing various algorithms. For instance, while developing a decision tree model for our dataset, we engaged in hyperparameter tuning, revealing multiple well-performing models under varying conditions. Consequently, the task at hand is to select the optimal model by considering different parameters. A critical observation is that employing the same data for both training and model tuning can result in overfitting, rendering the model incapable of generalization. This is where the validation set plays a crucial role, serving as independent and unbiased data. It facilitates a fair comparison of model performance, aiding in the selection of the best model algorithm or parameters. Once this data aids in identifying the most promising model and parameter combinations, we proceed to put the model into production, estimating its performance. However, it’s advisable not to employ the test data for evaluating the model before finalizing the optimal choice [18].
  3. Test set: As mentioned earlier, following the steps of training, validation, and model selection, the next phase involves putting the selected model into production after evaluating its performance on a specific subset of data known as the test set. Exercise caution during this phase, as premature execution can result in overfitting, yielding an unreliable model performance. The test set should be employed as the ultimate evaluation stage, once validation set usage is concluded, and the final model is chosen [18].

Implementing Data Splitting in Python:

# Split dataset
df_train_valid, df_test = train_test_split(df, test_size=0.2)

# Apply `random_state 42` to obtain the same results across different executions.
df_train, df_valid = train_test_split(df_train_valid, test_size=0.25)# untuk membuat model

# features and target for training dataset
features_train = df_train.drop('is_ultra', axis=1)
target_train = df_train['is_ultra']

# features and target for validation dataset
features_valid = df_valid.drop('is_ultra', axis=1)
target_valid = df_valid['is_ultra']

# features and target for test dataset
features_test = df_test.drop('is_ultra', axis=1)
target_test = df_test['is_ultra']

Result:

After completing the data preprocessing steps, you’re prepared to commence model training. It’s important to emphasize that data preprocessing encompasses various tasks, including data formatting and feature creation, tailored to the specifics of your AI project.

Thank you for investing your time in reading this article! Stay connected for more enriching content in my forthcoming articles :)

Connect with me on Linkedin

References:

13. Aman Gupta, Feature Selection Techniques in Machine Learning (2023)

14. Saikat Mazumder, 5 Techniques to Handle Imbalanced Data For a Classification Problem (2023)

15. Linkedin, What’s the best way to pick an evaluation metric for your Machine Learning project?

16. Gowtham S R, Encoding Categorical Data- The Right Way (2022)

https://towardsai.net/p/l/encoding-categorical-data-the-right-way#:~:text=Encoding%20categorical%20data%20is%20a,can%20work%20only%20on%20numbers.

17. Pandas Documentation

18. Data Science Wizards, A Guide to Data Splitting in Machine Learning (2022)

https://medium.com/@datasciencewizards/a-guide-to-data-splitting-in-machine-learning-49a959c95fa1#:~:text=What%20is%20Data%20Splitting%3F,get%20trained%2C%20tested%20and%20evaluated.

--

--