[Data Analysis] Feature engineering (6/9)

8 min readOct 22, 2023

Learn how to preprocess, select, transform, create, and scale features for optimal results using Python on the Iris dataset.

[This guide is part 6 of an 9-article walkthrough.]

Key concepts:
Data analysis · Feature engineering · Data analysis process · Data analysis projects · VS Code · Python

In this guide, we’ll walk you through feature engineering for the Iris Flower dataset using Python and Jupyter notebooks in Visual Studio Code. Even if you’re new to this, we’ve got you covered with step-by-step examples.

Feature engineering with the Iris dataset

To remind ourselves where in the data analysis process feature engineering comes into play, here is a general outline of the data analysis process:

Define Objectives: Clearly understand the goals of your analysis.
Data Acquisition: Obtain the dataset you’ll be working with.
Data Exploration: Explore the dataset to get an initial understanding of its structure and content.
Data Cleaning: Preprocess the data to ensure its quality and consistency.
Data Visualization: Create visualizations to gain insights into the data. Use libraries like Matplotlib, Seaborn, or Plotly to create plots, charts, and graphs.
➡️ Feature Engineering: Create new features or transform existing ones to enhance the dataset’s predictive power. This can involve:
◦ Encoding categorical variables (e.g., one-hot encoding).
◦ Scaling numerical features (e.g., standardization or normalization).
◦ Extracting relevant information from text or date columns.
◦ Creating interaction features.
Statistical Analysis (if applicable): Conduct statistical tests or analyses to answer specific questions or hypotheses.
Machine Learning (if applicable): Split the data into training and testing sets, select an appropriate algorithm & train and evaluate the model’s performance using metrics like accuracy, precision, recall, or F1-score.
Present solution: Interpret the findings in the context of your objectives. Document your analysis process and create a report or presentation summarising your analysis.

Prerequisites

Step 1: Setting Up Your Environment

Before we start, make sure you have the necessary tools and libraries installed:

Visual Studio Code (VS Code): A coding environment.
◦ Step-by-step guide
Python: A coding language and the backbone of our data analysis.
◦ Step-by-step guide
Jupyter Extension for VS Code: For interactive notebooks within VS Code.
◦ Step-by-step guide
Pandas, Matplotlib, Seaborn: Python libraries for data manipulation and visualization.
◦ Step-by-step guide

Installing a Python package via the command terminal (macOS)

Step 2: Creating a Jupyter Notebook

Launch Visual Studio Code, create a new Jupyter Notebook, connect a kernal, and save the notebook with an appropriate name like: “Iris_Flower_Data_Visualization.ipynb.”
◦ Step-by-step guide

Step 3: Importing Libraries

In your Jupyter Notebook, start by importing the necessary libraries:
◦ Step-by-step guide

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Step 4: Loading the Iris Flower Dataset

We will use a dataset called the Iris dataset. The Iris dataset is a classic dataset in the field of data analysis and machine learning, often used for classification and data exploration.

Download the dataset: Download the Iris Flower dataset as a CSV file from a trusted source, for example Kaggle.
◦ Step-by-step guide
Upload the dataset to VS Code: Then, load the CSV dataset into a Pandas DataFrame:
◦ Replace ‘your_file_path’ with the actual path to your dataset.
◦ Step-by-step guide

# Import the iris dataset using pd.read_csv 
# Replace 'your_file_path' with the actual path to your dataset.
df = pd.read_csv('your_file_path/iris.csv')

Step 5: Data exploration

Now, we will have a quick check of the data, to get ourselves familiar with it. To do so, we will use the .head() method.
◦ Click here for a more in-depth guide to data exploration.

#Check the first 5 rows of data
df.head()

Step 6: Data cleaning & preprocessing

Finally, we will clean the data by handling missing values, removing duplicates, and addressing outliers:
◦ Step-by-step guide

# Handle missing values
df.dropna(inplace=True)

# Remove duplicates
df.drop_duplicates(inplace=True)

# Outlier removal (if needed)
# Example: Removing values greater than 2 standard deviations from the mean
from scipy import stats
df = df[(np.abs(stats.zscore(df)) < 2).all(axis=1)]

Feature Engineering

Now we come to the core of the article: feature engineering.

Feature engineering involves creating new features or transforming existing ones to enhance the dataset’s predictive power. This can involve:

Feature selection
◦ Example: correlation analysis
Encoding categorical variables
◦ Example: one-hot encoding
Scaling numerical features
◦ Example: standardisation (Z-score scaling)
— Scales data to a smaller range, such as between 0 and 1.²
◦ Example: normalisation (min-max scaling)
— Transforms data so that the mean is 0 and the standard deviation is 1.²
Extracting relevant information from text or date columns
◦ Example: extracting the month from a timestamp
Creating interaction features
◦ Example: Dividing {sepal_length} by {sepal_width} to create a {sepal_ratio} column

Step 1: Feature selection

In the context of data analysis, feature selection is a critical step aimed at determining which variables or features from your dataset will be the most informative and useful for your analysis or model.

In real-world data, it’s common to encounter datasets with numerous features, some of which may be redundant, noisy, or have little impact on the analysis. In such cases, feature selection becomes vital.

To identify and choose the most relevant features, you can use statistical tests, correlation analysis, domain knowledge, or machine learning techniques. By doing so, you can simplify your analysis, reduce computational complexity, and potentially improve the performance of machine learning models by focusing on the most informative attributes.

Keep in mind that the decision of whether to select or drop features greatly depends on the specific dataset and the goals of your analysis. It’s a balance between retaining essential information and simplifying the model to prevent overfitting and improve interpretability.

In the Iris dataset, all four features (“sepal length,” “sepal width,” “petal length,” and “petal width”) are essential. Each feature contributes unique information that helps classify the flowers accurately.

✅ Therefore, in this specific dataset, there’s no need to drop any of these features.

If we did want to drop a column, we could do so with the following code:
◦ You can replace {species} with any other column name.

# Drop the 'species' column
# axis=1 tells the .drop() to drop the column (as opposed to a row)

df = df.drop('species', axis=1)

Iris dataset with the species column dropped

Step 2: Encoding categorical variables

The decision of whether to encode the {species} column in the Iris dataset as categorical or to use one-hot encoding depends on the machine learning algorithms you plan to use and the specific requirements of your analysis.

Leaving it as categorical:

If you’re using machine learning algorithms that can handle categorical data directly (e.g., decision trees or random forests), you can leave the {species} column as categorical.

One-hot encoding:

If you’re using machine learning algorithms that require numerical input and do not inherently handle categorical data (e.g., logistic regression or support vector machines), you should perform one-hot encoding on the {species} column.
◦ One-hot encoding will create separate binary columns for each species, making it suitable for these algorithms.

The choice depends on your specific analysis and the machine learning techniques you intend to apply.

It’s worth noting that the Iris dataset is commonly used for educational and illustrative purposes, and it’s a relatively small dataset, so one-hot encoding may not lead to significant increases in dimensionality.
In practice, for larger datasets with many categories, one-hot encoding may have more noticeable implications for model complexity and performance.

✅ In our case, we will use a decision tree, so there’s no need to encode the species column of our dataset.

Here would be the code to encode the species column, should you wish:

# Perform one-hot encoding on the species column of the Iris dataset
df = pd.get_dummies(df, columns=['species'])

Iris dataset with the species column encoded

Step 3: Feature transformation: scaling numerical features

Now that we have selected and encoded our features, we can standardise them, so that they are on the same scale.

Standardization (Z-Score Scaling):
Standardization, also known as Z-score scaling, transforms data to have a mean (average) of 0 and a standard deviation of 1.

Standardization is more robust to outliers and is useful when the distribution of the data is not known or is assumed to be Gaussian (normal distribution). It centers the data around zero and adjusts the spread.

The formula for standardization is:

Formula for standardisation (Z-score scaling)

Where:
◦ X is the original data point.
◦ μ is the mean of the feature.
◦ σ is the standard deviation of the feature.

Here’s the code to standardise our dataset:
◦ Note: You can scroll left/right/up/down in the code box 😉

#Standardise the length and width columns of the iris dataset

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']] = scaler.fit_transform(df[
    ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])

The iris dataset that has been standardised (Z-score scaling)

Step 3: Creating interaction features

In data analysis, interaction features are created by combining two or more variables to capture complex relationships and dependencies between them.

These features provide valuable insights into how the variables interact and influence each other, allowing data analysts to uncover hidden patterns and improve the performance of predictive models.

For example, in a dataset with variables like “temperature” and “humidity,” an interaction feature might be their product (temparature * humidity) , as it can reveal the joint effect of both factors on a target variable like “rain.”

Interaction features are especially useful in scenarios where the relationships between variables are non-linear or when the combined effect of multiple variables holds critical information for analysis and prediction, helping data analysts extract more value from their data.

For the Iris dataset, let’s calculate the sepal and petal ratios:

# Create two new columns: sepal_ratio and petal_ratio

df['sepal_ratio'] = df['sepal_length'] / df['sepal_width']
df['petal_ratio'] = df['petal_length'] / df['petal_width']

Adding two new columns {petal_ratio} and {sepal_ratio} to the iris dataset

❗️Note: Since we’ve already scaled/standardised our original features, there’s no need to scale the newly created ones.

🎉 And there we have it! By following these steps, you have gained a good foundation in feature engineering. Don’t forget that feature engineering is an iterative process. So, feel free to experiment with different feature combinations and transformations.

Now that we have a clean dataset with standardised features, we can begin to analyse our dataset with statistics or through machine learning — topics which will be covered in the following articles.

In the meantime, happy analysing!

The data analysis process of the iris dataset until the feature analysis step

Reference(s):

¹ Elgazar, E. (2023). ‘Most Used Feature Engineering Techniques’. Kaggle. https://www.kaggle.com/code/ebrahimelgazar/most-used-feature-engineering-techniques#4.-Feature-Crossing. Accessed: October 22, 2023.

² Simplilearn. (2023). ‘What is Data Standardization?’ https://www.simplilearn.com/what-is-data-standardization-article. Accessed: October 22, 2023.