Mastering Exploratory Data Analysis

Prepare data, select and optimize models, and evaluate their performance for accurate machine learning solutions.

10 min readApr 7, 2023

Exploratory Data Analysis (EDA) is crucial for developing effective machine learning models. This article discusses the various techniques and methods used in EDA, such as scatter plots, histograms, box plots, and descriptive statistics, to identify trends and patterns in datasets. It also explains how to handle missing values and transform categorical and numerical data for optimal data preparation.

Additionally, the article covers the process of model selection and optimization, including hyperparameter tuning using GridSearchCV, evaluating model performance using metrics like accuracy, classification report, confusion matrix, and ROC-AUC score, and cross-validation. By following these steps, readers can create accurate and effective machine learning solutions.

Why is it important to perform EDA?

Methods and techniques of EDA

There are several techniques and methods for performing an EDA, such as scatter plots, histograms, box plots, and descriptive statistics. The choice of techniques depends on the nature of the data and the goal of the analysis.

Data visualization

Data visualization is a powerful tool for identifying trends and patterns in datasets. Charts such as lines, bars, scatter, and box plots facilitate the identification of relationships between variables, frequency distributions, and the presence of outliers.

# Load the Iris dataset
iris = sns.load_dataset('iris')

# Create a pairplot with the species column used for coloring
plot = sns.pairplot(iris, hue='species', height=2.5)

# Add a title to the plot
plot.fig.suptitle("Pairwise relationships between features of the Iris dataset", y=1.05)

# Show the plot
plt.show()

Descriptive statistics

Descriptive statistics summarize important information about the data, such as mean, median, mode, range, variance, and standard deviation. These measures can help identify central tendencies, dispersion, and skewness in the data.

Trends and patterns

Recognizing trends and patterns in the data is crucial for building effective machine learning models. These patterns can reveal valuable information about the data and provide insights to enhance models and improve predictions.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load data into a pandas DataFrame
data_path = 'data.csv'
df = pd.read_csv(data_path) #or

# Load Iris 
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
df = pd.read_csv(url)

# Rename the columns
new_column_names = {
    'sepal length (cm)': 'sepal_length',
    'sepal width (cm)': 'sepal_width',
    'petal length (cm)': 'petal_length',
    'petal width (cm)': 'petal_width'
}

df = df.rename(columns=new_column_names)

# Print the new column names
print(df.columns)

# Print the first 5 rows of the DataFrame
print(df.head())

# Check the shape of the DataFrame
print(df.shape)

# Check the data types of the columns
print(df.dtypes)

# Check for missing values
print(df.isnull().sum())

# Check basic statistics of the numerical columns
print(df.describe())

# Define function to plot box plot
def plot_boxplot(data, column):
    sns.boxplot(data=data, x=column)
    sns.despine()
    plt.show()

# Define function to plot histogram
def plot_histogram(data, column):
    sns.histplot(data=data, x=column, kde=True, color='purple')
    plt.xlabel(column)
    plt.ylabel('Count')
    plt.title(f'Distribution of {column}')
    sns.despine()
    plt.show()

# Define function to plot scatter plot
def plot_scatterplot(data, column1, column2):
    sns.scatterplot(data=data, x=column1, y=column2, color='green')
    plt.xlabel(column1)
    plt.ylabel(column2)
    plt.title(f'Scatter plot of {column1} and {column2}')
    sns.despine()
    plt.show()

# Define function to plot correlation heatmap
def plot_heatmap(data):
    sns.heatmap(data.corr(), cmap='coolwarm', annot=True)
    plt.title('Correlation heatmap')
    sns.despine()
    plt.show()

# Define function to plot time-series data
def plot_timeseries(data, date_column, column):
    data[date_column] = pd.to_datetime(data[date_column])
    data.set_index(date_column, inplace=True)
    sns.lineplot(data=data, x=data.index, y=column, color='red')
    plt.xlabel('Year')
    plt.ylabel(column)
    plt.title(f'Trend of {column} over time')
    sns.despine()
    plt.show()

# Define function to plot count plot
def plot_countplot(data, column):
    sns.countplot(data=data, x=column, palette='pastel')
    plt.xlabel(column)
    plt.ylabel('Count')
    plt.title(f'Distribution of {column}')
    sns.despine()
    plt.show()

# Define function to plot pair plot
def plot_pairplot(data, columns):
    sns.pairplot(data=data, vars=columns)
    sns.despine()
    plt.show()

# Example usage of the defined functions
plot_boxplot(df, 'column_name')
plot_histogram(df, 'column_name')
plot_scatterplot(df, 'column_1', 'column_2')
plot_heatmap(df)
plot_timeseries(df, 'date_column', 'column_name')
plot_countplot(df, 'categorical_column')
plot_pairplot(df, ['column_1', 'column_2', 'column_3'])

Data preparation and cleaning

Identifying and handling missing values

Missing values are common in datasets and can negatively impact the effectiveness of machine learning models. It is essential to identify and handle these values, either by filling them with appropriate information or removing them.

# Importing required libraries
import pandas as pd
import numpy as np

# Loading the dataset
data = pd.read_csv("path/to/your/dataset.csv")

# Checking for missing values
missing_values = data.isnull().sum()
print("Missing values by columns:\n", missing_values)

# Handling missing values

# 1. Removing rows with missing values
data_cleaned = data.dropna()

# 2. Filling missing values with a constant
data_filled_constant = data.fillna(value=0)

# 3. Filling missing values with the mean of the column
data_filled_mean = data.fillna(data.mean())

# 4. Filling missing values with the median of the column
data_filled_median = data.fillna(data.median())

# 5. Filling missing values with the mode of the column
data_filled_mode = data.fillna(data.mode().iloc[0])

# 6. Using forward-fill method to fill missing values
data_filled_ffill = data.fillna(method='ffill')

# 7. Using backward-fill method to fill missing values
data_filled_bfill = data.fillna(method='bfill')

# 8. Interpolating missing values
data_filled_interpolate = data.interpolate()

# Choose the appropriate method to handle missing values based on your specific dataset and save it as 'data_preprocessed'
data_preprocessed = data_filled_mean  # For example, using the mean method

This template imports the necessary libraries, loads the dataset, checks for missing values, and provides various methods for handling them.

Treating categorical and numerical data

Categorical and numerical data must be treated differently during data preparation. It is necessary to transform categorical data into numerical variables using techniques such as one-hot encoding, while numerical data can be normalized or standardized.

# Importing required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing 
import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler

# Assuming 'data_preprocessed' is the dataset after handling missing values

# Handling categorical data
# 1. Label encoding 

label_encoder = LabelEncoder()

data_label_encoded = data_preprocessed.copy()
data_label_encoded['categorical_column'] = label_encoder.fit_transform(
data_preprocessed['categorical_column'])

# 2. One-hot encoding
one_hot_encoder = OneHotEncoder()
data_one_hot_encoded = pd.get_dummies(
data_preprocessed, columns=['categorical_column'])

# Choose the appropriate method to handle categorical data based on your specific dataset and save it as 'data_with_categorical_handled'
data_with_categorical_handled = data_one_hot_encoded  
# For example, using one-hot encoding

# ---------------------------------------------------------------------------#

# Handling numerical data
# 1. Standard scaling
standard_scaler = StandardScaler()
data_standard_scaled = data_with_categorical_handled.copy()
data_standard_scaled['numerical_column'] = standard_scaler.fit_transform(data_with_categorical_handled[['numerical_column']])

# 2. Min-max scaling
min_max_scaler = MinMaxScaler()
data_min_max_scaled = data_with_categorical_handled.copy()
data_min_max_scaled['numerical_column'] = min_max_scaler.fit_transform(data_with_categorical_handled[['numerical_column']])

# Choose the appropriate method to handle numerical data based on your specific dataset and save it as 'data_with_numerical_handled'
data_with_numerical_handled = data_standard_scaled  
# For example, using standard scaling

# Final preprocessed dataset
data_final_preprocessed = data_with_numerical_handled

I’ll explain each part of the code:

Handling categorical data:

LabelEncoder() is an object from the sklearn.preprocessing library that converts categorical data into numerical labels.
A new DataFrame is created (data_label_encoded) as a copy of the preprocessed dataset.
The categorical column is transformed into numerical labels using the fit_transform method, and the result is saved in the new DataFrame. b. One-hot encoding:
OneHotEncoder() is an object from the sklearn.preprocessing library that creates new binary columns for each category in the original categorical column.
pd.get_dummies() is a pandas function that performs one-hot encoding on the specified columns.
The appropriate method for handling categorical data is chosen based on the specific dataset, and the result is saved as data_with_categorical_handled. In this example, one-hot encoding is used.

Handling numerical data:

a. Standard scaling:

StandardScaler() is an object from the sklearn.preprocessing library that standardizes numerical data by transforming it to have a mean of 0 and a standard deviation of 1.
A new DataFrame is created (data_standard_scaled) as a copy of the dataset with categorical data handled.
The numerical column is standardized using the fit_transform method, and the result is saved in the new DataFrame.

b. Min-max scaling:

MinMaxScaler() is an object from the sklearn.preprocessing library that scales numerical data to a specific range (default is [0, 1]).
A new DataFrame is created (data_min_max_scaled) as a copy of the dataset with categorical data handled.
The numerical column is scaled using the fit_transform method, and the result is saved in the new DataFrame.
The appropriate method for handling numerical data is chosen based on the specific dataset, and the result is saved as data_with_numerical_handled. In this example, standard scaling is used.

Final preprocessed dataset:

The final preprocessed dataset, data_final_preprocessed, is created by assigning the DataFrame with both categorical and numerical data handled. In this example, it is the DataFrame with one-hot encoding and standard scaling applied.

Model Selection and Optimization

Choosing the appropriate model Selecting the most suitable machine learning model depends on the nature of the data and the problem to be solved. Some options include linear regression, decision trees, SVM, and neural networks.

# Importing required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Assuming 'data_final_preprocessed' is the dataset after handling missing values, categorical, and numerical data

# Splitting the dataset into features (X) and target (y)
X = data_final_preprocessed.drop('target_column', axis=1)
y = data_final_preprocessed['target_column']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Creating a Random Forest Classifier
rfc = RandomForestClassifier(random_state=42)

# Hyperparameters to be tuned using GridSearchCV
param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt']
}

# Setting up GridSearchCV with the Random Forest Classifier
grid_search = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=5, scoring='accuracy', verbose=2, n_jobs=-1)

# Fitting the model
grid_search.fit(X_train, y_train)

# Getting the best parameters
best_params = grid_search.best_params_
print("Best parameters found:", best_params)

# Creating a Random Forest Classifier with the best parameters
best_rfc = RandomForestClassifier(**best_params, random_state=42)

# Fitting the model with the best parameters
best_rfc.fit(X_train, y_train)

# Making predictions
y_pred = best_rfc.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# ROC-AUC score
roc_auc = roc_auc_score(y_test, y_pred)
print("ROC-AUC Score:", roc_auc)

# Cross-validation
kf = KFold(n_splits=5, random_state=42, shuffle=True)
cv_scores = cross_val_score(best_rfc, X, y, cv=kf, scoring='accuracy')
print("Cross-Validation Scores:", cv_scores)
print("Mean Cross-Validation Score:", cv_scores.mean())

This code snippet demonstrates how to train a Random Forest Classifier model using GridSearchCV for hyperparameter tuning and evaluates its performance. Here’s an explanation of each part of the code:

Importing required libraries:

train_test_split: A function from sklearn.model_selection to split the dataset into training and testing sets.
GridSearchCV: A method from sklearn.model_selection to perform a grid search for the best hyperparameters.
RandomForestClassifier: A classifier from sklearn.ensemble that uses a random forest algorithm for classification tasks.
accuracy_score, classification_report, confusion_matrix: Functions from sklearn.metrics to evaluate the performance of the model.

Preparing the data:

Assuming data_final_preprocessed is the dataset after handling missing values, categorical, and numerical data.
Split the dataset into features (X) and target (y) variables.
Split the data into training and testing sets using train_test_split().

Creating a Random Forest Classifier:

Instantiate a RandomForestClassifier object with a specified random_state to ensure reproducibility.

Setting up hyperparameters for GridSearchCV:

Define a dictionary param_grid containing the hyperparameters to be tuned.
Instantiate a GridSearchCV object with the Random Forest Classifier, hyperparameter grid, cross-validation, scoring metric, verbosity, and number of jobs to run in parallel.

Fitting the model and finding the best parameters:

Fit the grid_search object to the training data.
Retrieve the best hyperparameters found by GridSearchCV and print them.

Creating and fitting the best Random Forest Classifier:

Instantiate a new RandomForestClassifier object with the best hyperparameters and the same random_state.
Fit the model with the best hyperparameters to the training data.

Making predictions and evaluating the model:

Use the predict() method of the best Random Forest Classifier to make predictions on the test set.
Evaluate the model using accuracy_score, classification_report, and confusion_matrix, and print the results.

ROC-AUC score:

roc_auc_score: A function from sklearn.metrics that computes the Area Under the Receiver Operating Characteristic Curve (ROC-AUC) from prediction scores. ROC-AUC is a popular evaluation metric for binary classification problems.
Calculate the ROC-AUC score by passing y_test (the true labels) and y_pred (the predicted labels) to the roc_auc_score() function, and print the result.

Cross-validation:

KFold: A method from sklearn.model_selection that provides indices to split the dataset into train and validation sets for cross-validation.
cross_val_score: A function from sklearn.model_selection that evaluates a model using cross-validation.
Instantiate a KFold object with the specified number of splits, random_state for reproducibility, and shuffling enabled.
Calculate the cross-validation scores by passing the best Random Forest Classifier (best_rfc), the features (X), the target variable (y), the KFold object, and the scoring metric to the cross_val_score() function.
Print the cross-validation scores and the mean cross-validation score.

Conclusion

Exploratory Data Analysis (EDA) is essential to understand trends and patterns in data and identify opportunities to improve the effectiveness of machine learning models. By following the steps presented in this article, it is possible to properly prepare the data, select and optimize models, and evaluate their performance, ensuring the creation of effective and accurate solutions.

FAQs:

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is a process used to analyze and summarize datasets, identifying characteristics, patterns, and relationships in the data before applying machine learning techniques.

What are the main techniques and methods of EDA?

Some EDA techniques and methods include scatter plots, histograms, box plots, descriptive statistics, and correlation analysis.

How to handle missing values in a dataset?

Missing values can be handled in various ways, such as filling them with appropriate information (mean, median, mode), using interpolation methods, or simply removing them from the dataset.

What is hyperparameter optimization, and why is it important?

Hyperparameter optimization is the process of adjusting the parameters of a machine learning model to improve its performance. Optimization is important because it helps to find the best combination of hyperparameters that result in an effective and generalizable model.

What performance metrics can be used to evaluate the quality of machine learning models?

Some common performance metrics include RMSE (Root Mean Squared Error), precision, recall, F1-score, and coefficient of determination (R²). The choice of appropriate metrics depends on the type of problem and the analysis objectives.