The Ultimate Guide to Machine Learning: Exploratory Data Analysis (EDA) — Part -1

Simranjeet Singh
16 min readFeb 26, 2023

--

Introduction

The power of machine learning to extract knowledge from data and offer insightful information for decision-making has made it a vital component of contemporary technology. AI is a branch of study that allows machines to learn and get better at their tasks without having to be explicitly programmed. Exploratory Data Analysis will be covered in detail in this article. Next Parts Feature Engineering, Feature Selection, Machine Learning Algorithms, Hyperparameter Tuning, Docker, Kubernetes, and Model Deployment are all topics that are covered in other Parts of this Blog.

EDA requires data comprehension through the use of distribution analysis, pattern recognition, and anomaly detection. It helps in the detection of data quality issues such as ambiguous or missing numbers, outliers, or inconsistencies. Consider a retail company that wanted to analyse sales data in order to better understand customer behaviour. EDA can help you find popular buying times, consumer demographics, and top-selling products.

👉 Before Starting the Blog, Please Subscribe to my YouTube Channel and Follow Me on Instagram 👇
📷 YouTube — https://bit.ly/38gLfTo
📃 Instagram — https://bit.ly/3VbKHWh

👉 Do Donate 💰 or Give me Tip 💵 If you really like my blogs, Because I am from India and not able to get into Medium Partner Program. Click Here to Donate or Tip 💰 — https://bit.ly/3oTHiz3

Fig.1 — Exploratory Data Analysis

To explain how EDA will enhance model performance and influence business outcomes, we’ll also go into real-world examples.

Table of Contents

1. Data Cleaning and Pre-Processing

2. Handling Missing Values

  • Data Type Conversion
  • Scaling Data

3. Data Visualisation and Exploration (All Plots Explained)

4. Statistical Analysis

  • Descriptive Statistics
  • Hypothesis Testing
  • Probability Distribution
  • Confidence Intervals
  • Regression analysis
  • Time Series Analysis
  • Cluster Analysis

5. Outlier Detection and Treatment

  • Visual Inspection using Box Plot
  • Z-Score Method
  • IQR Method
  • DBSCAN Method
  • Removing and Winsorization or Transformation

6. Correlation Analysis

7. Data Distribution Analysis

8. Dimensionality Reduction

9. Identifying Relationships between variables

Exploratory Data Analysis (EDA)

EDA (Exploratory Data Analysis) is a critical component of any data science or machine learning project. It is the process of analysing, visualising, and comprehending a dataset in order to extract useful insights and patterns. EDA helps identify important features and relationships among variables, which aids in the development of a better predictive model. In this section, we will cover the various EDA techniques and how they are implemented in Python.

1. Data Cleaning and Preprocessing: Data cleaning and preprocessing are required before performing any analysis. This involves examining the data for missing or duplicate values, data types, and data consistency. For example, if you have a dataset with inconsistent date formats, you’ll need to use Python’s datetime module to convert them to a standardised format. It contributes to ensuring that the data is correct, complete, and ready for analysis. This step includes operations such as removing duplicates, dealing with missing data, converting data types, and scaling data.

Example: The age of a customer may be missing from a dataset that contains information about them. These missing data can either be deleted or filled in using the mean or median.

2. Handling Missing Values: An incomplete dataset can make forecasts less accurate. Consequently, it’s crucial to find and deal with any missing values before beginning any analysis. Rows with missing values can be removed, and missing values can be filled up using the mean or median of the column, among other methods.

Example: A customer’s most recent transaction could have missing values in a dataset that records the customer’s purchase history. The column mean can be used to fill up these missing numbers.

Fig.2 — Missing Values in Data

Here is an example of how to handle missing data in Python using Pandas:

import pandas as pd
import numpy as np

# Create a sample dataframe with missing data
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8], 'C': [9, 10, 11, 12]})
print(df)

# Drop rows with missing data
df.dropna(inplace=True)
print(df)

# Fill missing values with mean value
df.fillna(df.mean(), inplace=True)
print(df)

In this illustration, we start by making a sample dataframe containing blank data. The dropna() function is then used to remove rows containing blank data. As an alternative, we can use the fillna() function to replace missing data with the column’s mean value.

Data types conversion: Data type conversion is important because it ensures that the data is presented for analysis in a consistent format. For instance, we might need to change a datetime column into a date format or change string data into numerical data.

Here is an example of how to convert data types in Python using Pandas:

# Create a sample dataframe with mixed data types
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['4', '5', '6'], 'C': ['2021-01-01', '2021-01-02', '2021-01-03']})
print(df)

# Convert column 'B' from string to integer
df['B'] = df['B'].astype(int)
print(df)

# Convert column 'C' from string to datetime
df['C'] = pd.to_datetime(df['C'])
print(df)

In this illustration, we start by making a sample dataframe containing a variety of data types. Then, we change column ‘B’ from a string to an integer using the astype() function. Moreover, column “C” is converted from string to datetime format using the pd.to_datetime() function.

Scaling data: Data scaling is crucial because it guarantees that variables have a comparable scale, making it simpler to compare them. For algorithms like KNN and clustering that employ distance measurements, scaling is especially crucial.

Here is an example of how to scale data in Python using Numpy:

import numpy as np

# Create a sample array
data = np.array([[1, 2], [3, 4], [5, 6]])

# Standardize data
mean = np.mean(data, axis=0)
std = np.std(data, axis=0)
data_scaled = (data - mean) / std
print(data_scaled)

In this example, we first create a sample array. Then, we use the mean() and std() functions to calculate the mean and standard deviation of each column. Finally, we use these values to standardize the data using the formula (x - mean) / std.

3. Data Visualization and Exploration: A powerful tool for investigating and understanding data is data visualization. It can help identify patterns, trends, and relationships in the data. Heat maps, scatter plots, line plots, and histograms are a few examples of typical visualization techniques. Also, it can be useful in detecting outliers, anomalies, and data distribution.

Example: In a dataset containing housing prices, you can create a scatter plot to visualize the relationship between the size of the house and the price.

Fig.3 — Data Visualisation and Exploration

Depending on the type of data and the issue statement, many visualization techniques may be used. Following are some typical visualization methods and the scenarios in which they are used:

Scatter Plot: To see the relationship between two continuous variables, use a scatter plot. The strength and direction of the association between the variables can be determined with its help. A scatter plot, for instance, can be used to show how the square footage of a house and the sale price correlate in a dataset of home pricing data.

import matplotlib.pyplot as plt
plt.scatter(x, y)
plt.xlabel('x-axis label')
plt.ylabel('y-axis label')
plt.title('Title')
plt.show()

Line Plot: The trend of a continuous variable over time or other continuous variables is shown using a line plot. Finding patterns and trends in the data can be useful. For instance, a line plot can be used to show the trend of the stock price over time in a dataset of stock prices.

import matplotlib.pyplot as plt
plt.plot(x, y)
plt.xlabel('x-axis label')
plt.ylabel('y-axis label')
plt.title('Title')
plt.show()

Histogram: To see how a continuous variable is distributed, use a histogram. Knowing if the distribution is normal, skewed, or bimodal can be helpful. A histogram, for instance, can be used to show the distribution of scores in a dataset of test results.

import matplotlib.pyplot as plt
plt.hist(data, bins=10)
plt.xlabel('x-axis label')
plt.ylabel('y-axis label')
plt.title('Title')
plt.show()

Heat Map: The link between two categorical variables is represented using a heat map. The frequency of occurrence of several categories and their relationship can be determined. For instance, a heat map can be used to show the frequency of occurrence of certain data points and their sentiment in a dataset of customer evaluations.

import seaborn as sns
sns.heatmap(data)
plt.xlabel('x-axis label')
plt.ylabel('y-axis label')
plt.title('Title')
plt.show()

Box Plot: A box plot is used to show how a continuous variable is distributed among many groups. Finding outliers and examining the distribution across various categories might be helpful. A box plot, for instance, can be used to show how incomes are distributed among various job categories in a dataset of salary data.

import matplotlib.pyplot as plt
plt.boxplot(df['column_name'])
plt.title('Box plot of column_name')
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.show()

Bar Plot: The values of various categories or groups can be compared using bar graphs. When displaying the frequency or distribution of categorical data, they can be used. The number of employees in a company’s many departments, for instance, can be displayed using a bar plot.

import matplotlib.pyplot as plt
plt.bar(x, y)
plt.xlabel('x-axis label')
plt.ylabel('y-axis label')
plt.title('Title')
plt.show()

Pie Charts: These are a further popular method of showing the proportion or percentage of various groups or categories. They are helpful for showing comparable data sizes. A pie chart, for instance, can be used to display the proportion of various product kinds sold by a company.

import matplotlib.pyplot as plt
plt.pie(data, labels=labels)
plt.title('Title')
plt.show()

Scatter Matrices: They are used for investigating the connections between various variables in a dataset. They can help in spotting trends or patterns in the data. A scatter matrix, for instance, can be used to illustrate the correlation between a group of people’s age, income, and educational achievement.

import pandas as pd
import seaborn as sns
sns.pairplot(df)
plt.show()

Parallel Coordinate Plots: They are applied to multivariate data visualisation. They can be used to investigate the connections between various factors and how they relate to one another. A parallel coordinate map, for instance, can be used to illustrate the connection between a car’s many characteristics, such as its weight, horsepower, and fuel efficiency.

import pandas as pd
from pandas.plotting import parallel_coordinates
parallel_coordinates(df, 'Class')
plt.show()

4. Statistical Analysis: Calculating summary statistics like mean, median, and standard deviation as well as evaluating data-related hypotheses are part of statistical analysis. This facilitates the identification of patterns, relationships, and trends in the data.

Example: In a dataset containing student test scores, you can calculate the mean and standard deviation of the scores to understand the distribution of the data.

Fig.4 — Statistical Analysis

Descriptive Statistics: Descriptive statistics are used to summarize the main features of a dataset. They include measurements of dispersion (such as standard deviation and variance), measures of form, and measures of central tendency (such as mean, median, and mode) (such as skewness and kurtosis).

Python code snippet for calculating mean, median, and standard deviation using pandas:

import pandas as pd

# create a sample data frame
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [6, 7, 8, 9, 10]
})

# calculate mean
print(df.mean())

# calculate median
print(df.median())

# calculate standard deviation
print(df.std())

Hypothesis Testing: Hypothesis testing is a way to determine whether a hypothesis about the data is likely to be true or false. This can involve testing whether two samples come from the same population, or testing whether a sample mean is significantly different from a hypothesized population mean. Some common hypothesis tests include t-tests, ANOVA, and chi-squared tests.

Python code snippet for conducting a t-test using scipy:

from scipy.stats import ttest_ind

# create two samples
sample1 = [1, 2, 3, 4, 5]
sample2 = [6, 7, 8, 9, 10]

# conduct a t-test
t_stat, p_val = ttest_ind(sample1, sample2)

print('t-statistic:', t_stat)
print('p-value:', p_val)

Probability Distributions: Probability distributions are mathematical functions that describe the likelihood of different outcomes in a random process. Understanding the underlying probability distribution of a dataset can help identify outliers and understand the variability in the data. Some common probability distributions include the normal distribution, binomial distribution, and Poisson distribution.

Python code snippet for generating random numbers from a normal distribution using numpy:

import numpy as np

# generate 100 random numbers from a normal distribution with mean 0 and standard deviation 1
data = np.random.normal(0, 1, 100)

print(data)

Confidence Intervals: Confidence intervals are used to estimate the range of values within which a population parameter is likely to fall. This is typically done using the standard error of the mean and a specified level of confidence.

Python code snippet for calculating a confidence interval using scipy:

from scipy.stats import t

# create a sample data set
data = [1, 2, 3, 4, 5]

# calculate the mean and standard deviation
mean = np.mean(data)
std = np.std(data)

# calculate the standard error of the mean
sem = std / np.sqrt(len(data))

# calculate the 95% confidence interval
ci = t.interval(0.95, len(data)-1, loc=mean, scale=sem)

print('Mean:', mean)
print('Standard deviation:', std)
print('Standard error of the mean:', sem)
print('Confidence Interval:', ci)

Regression Analysis: Regression analysis is a way to model the relationship between a dependent variable and one or more independent variables. This can help identify the factors that are most strongly associated with the dependent variable and can be used for prediction or causal inference. Some common regression techniques include linear regression, logistic regression, and polynomial regression.

import pandas as pd
import numpy as np
import statsmodels.api as sm

# Load data
df = pd.read_csv('data.csv')

# Define variables
X = df['age']
y = df['salary']

# Add constant
X = sm.add_constant(X)

# Fit model
model = sm.OLS(y, X).fit()

# Print summary
print(model.summary())

Time series analysis: Time series analysis is a way to analyze data that is collected over time, such as stock prices or weather data. This involves identifying trends, seasonal patterns, and other patterns that may be present in the data. Some common time series analysis techniques include moving averages, autoregressive models, and seasonal decomposition.

import pandas as pd
import numpy as np
import statsmodels.api as sm

# Load data
df = pd.read_csv('data.csv')

# Set index to date column
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

# Resample data to monthly frequency
df_monthly = df.resample('M').mean()

# Plot time series
df_monthly.plot(figsize=(12, 8))

Clustering analysis: Clustering analysis is a way to group data points together based on their similarities. This can help identify patterns in the data and can be used for segmentation or classification. Some common clustering techniques include k-means clustering, hierarchical clustering, and density-based clustering.

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

# Load data
df = pd.read_csv('data.csv')

# Define variables
X = df[['age', 'salary']]

# Fit k-means clustering model
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)

# Add cluster labels to data frame
df['cluster'] = kmeans.labels_

# Plot clusters
df.plot.scatter(x='age', y='salary', c='cluster', colormap='viridis', figsize=(12, 8))

5. Outlier Detection and Treatment: Outliers are data points that deviate significantly from the rest of the data. They can affect the accuracy of predictions and should be treated appropriately. This involves either removing them or transforming them using a suitable technique.

Example: In a dataset containing customer income, you can detect outliers by creating a box plot and removing them by filtering values outside of a certain range.

There are several techniques that can be used to detect and treat outliers, such as:

Visual Inspection: Visual inspection involves plotting the data and identifying any data points that are far away from the rest of the data. Box plots, scatter plots, and histograms can be useful for identifying outliers visually.

import matplotlib.pyplot as plt

plt.boxplot(data)
plt.show()

Z-Score Method: The Z-score method involves calculating the standard score for each data point. Data points with a standard score greater than a threshold value are considered outliers. The threshold value is typically set to 3, meaning that any data point with a Z-score greater than 3 is considered an outlier.

from scipy import stats

z_scores = stats.zscore(data)
abs_z_scores = np.abs(z_scores)
outliers = data[abs_z_scores > threshold]

Interquartile Range (IQR) Method: The IQR method involves calculating the range between the first quartile (Q1) and the third quartile (Q3) of the data. Data points outside the range Q1–1.5 * IQR and Q3 + 1.5 * IQR are considered outliers.

q1 = data.quantile(0.25)
q3 = data.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
outliers = data[(data < lower_bound) | (data > upper_bound)]

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) Method: DBSCAN is a clustering algorithm that can be used for outlier detection. Data points that do not belong to any cluster are considered outliers.

from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=eps, min_samples=min_samples)
dbscan.fit(data)
outliers = data[dbscan.labels_ == -1]

Once the outliers are identified, there are several ways to treat them, such as:

Removing Outliers: Removing outliers involves deleting the data points that are identified as outliers. However, this can result in a loss of data and can also affect the accuracy of the analysis.

Winsorization: Winsorization is a technique that involves replacing the outliers with the nearest non-outlier value. This can help reduce the impact of outliers on the analysis.

import numpy as np

# Define a numpy array with some outliers
data = np.array([2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 100, 200, 300, 400])

# Calculate the lower and upper bounds for Winsorization
lower_bound = np.percentile(data, 5)
upper_bound = np.percentile(data, 95)

# Replace the outliers with the lower and upper bounds
winsorized_data = np.clip(data, lower_bound, upper_bound)

print(winsorized_data)

Transformation: Transformation involves transforming the data, such as using a logarithmic or square root transformation, to reduce the impact of outliers on the analysis.

import numpy as np

# generate some data with outliers
data = np.random.normal(loc=10, scale=5, size=100)
data[0] = -100
data[1] = 200

# apply log transformation to treat outliers
transformed_data = np.log(data - min(data) + 1)

# plot the original and transformed data
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 2, figsize=(10, 5))

ax[0].hist(data, bins=20)
ax[0].set_title('Original data')

ax[1].hist(transformed_data, bins=20)
ax[1].set_title('Transformed data')

plt.show()

6. Correlation Analysis: Correlation analysis is an important technique used in data analysis to understand the relationship between variables. It helps identify the degree to which two or more variables are related. Correlation is measured by the correlation coefficient, which ranges from -1 to +1. A correlation coefficient of -1 indicates a perfect negative correlation, while a coefficient of +1 indicates a perfect positive correlation. A coefficient of 0 indicates no correlation.

Here is an example code snippet to calculate the correlation between two variables using pandas:

import pandas as pd

# Read data from a CSV file
data = pd.read_csv('data.csv')

# Calculate the correlation between two variables
corr = data['variable1'].corr(data['variable2'])

# Print the correlation coefficient
print(corr)

It is important to note that correlation does not imply causation. Just because two variables are strongly correlated does not mean that one causes the other.

There are different types of correlation analysis techniques that can be used depending on the type of data and the problem statement. For example, if the data is continuous, Pearson’s correlation coefficient can be used. If the data is categorical, Spearman’s rank correlation coefficient can be used.

It is also important to visualize the correlation between variables to get a better understanding of the relationship. Heat maps and scatter plots are common visualization techniques used for correlation analysis. Heat maps can help identify the strength and direction of the relationship between multiple variables, while scatter plots can help identify the relationship between two variables.

Here is an example code snippet to create a heat map using seaborn:

# Create a correlation matrix
corr_matrix = data.corr()
# Create a heat map
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
# Show the plot
plt.show()

7. Data distribution analysis: Data distribution analysis is an important aspect of EDA, which involves analyzing the distribution of data to understand its properties such as skewness, kurtosis, and central tendency. This helps in identifying any patterns or deviations from normality in the data. Commonly used techniques for data distribution analysis include histograms, density plots, and box plots.

Fig.5 — Data Distributions

It is important to note that data distribution analysis techniques may vary depending on the type of data and the problem statement. It is recommended to explore different visualization techniques and choose the most appropriate one for the specific problem.

8. Dimensionality reduction techniques: This involves reducing the number of features in a dataset while retaining the relevant information. This is useful in situations where the number of features is large and complex, and it can improve the accuracy and efficiency of machine learning models. Some common techniques for dimensionality reduction include Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Linear Discriminant Analysis (LDA).

9. Identifying relationships between variables: This involves exploring the relationships between variables in a dataset to identify any patterns or correlations. This can help in understanding how different variables affect each other and in identifying important predictors for a target variable. Techniques for identifying relationships between variables include scatter plots, correlation analysis, and regression analysis.

Example: In a dataset containing customer data, you can use PCA to reduce the number of features and identify the most important ones for predicting customer churn. You can also use scatter plots and correlation analysis to identify any relationships between customer demographics and their purchasing behavior.

Sample Project on Complete EDA on Car Price Dataset

Comment down if you need full source code.

Car Price Project on Full EDA and Dashboard

Conclusion

In conclusion, exploratory data analysis (EDA) is a critical step in the data analysis process that involves understanding the data, identifying patterns, trends, and relationships, and preparing the data for modeling. Key takeaways from EDA include:

  1. EDA helps to identify patterns and trends in the data and to prepare the data for modeling.
  2. Data cleaning and preprocessing are essential steps in EDA to ensure the data is accurate and consistent.
  3. Data visualization is a powerful tool for exploring and understanding data.
  4. Statistical analysis can help identify patterns, relationships, and trends in the data.
  5. Outlier detection and treatment can help improve the accuracy of predictions.
  6. Correlation analysis helps to identify which variables are strongly correlated and can be used to make predictions.
  7. Data distribution analysis is important in understanding the properties of the data.
  8. Dimensionality reduction techniques help to reduce the complexity of the data and improve model performance.

If you like the article and would like to support me make sure to:

👏 Clap for the story (100 Claps) and follow me 👉🏻Simranjeet Singh

📑 View more content on my Medium Profile

🔔 Follow Me: LinkedIn | Medium | GitHub | Twitter | Telegram

🚀 Help me in reaching to a wider audience by sharing my content with your friends and colleagues.

🎓 If you want to start a career in Data Science and Artificial Intelligence and you do not know how? I offer data science and AI mentoring sessions and long-term career guidance.

📅 Consultation or Career Guidance

📅 1:1 Mentorship — About Python, Data Science, and Machine Learning

Book your Appointment

--

--

Simranjeet Singh

Data Scientist | Blogger | YouTuber | MLOPS | Machine Learning and Deep Learning | NLP | Azure/AWS