Eight Important Reasons for EDA in ML

Why you should stop skipping EDA in your ML Workflow

Published in

GDSC Babcock Dataverse

3 min readDec 23, 2023

EDA if data were a cave (Image generated with DALL-E)

If you are a data scientist or a machine learning enthusiast, you probably know that EDA stands for Exploratory Data Analysis.

But do you know why EDA is so important in the ML workflow?

In this blog post, I will try to explain why EDA is not just a preparatory step, but a critical one that can make or break your ML project.

What EDA is

EDA is the process of exploring and understanding your data before applying any ML algorithms or models.

It involves visualizing, summarizing, and finding patterns, outliers, and anomalies in your data.

EDA helps you to gain insights and intuition about your data, which can guide your ML choices and improve your results.

Why EDA is performed in ML:

1. To identify and fix data quality issues:

This includes missing values, incorrect labels, duplicates, or errors.

These issues can affect the performance and accuracy of your ML models, so it is better to deal with them early on.

2. To understand the data:

EDA helps you gain cognizance with the distribution, range, and variability of your data.

3. To choose the appropriate ML techniques:

Gaining understanding of the data via EDA helps in choosing the right ML techniques such as scaling, normalization, transformation, or feature engineering, that can enhance your data and make it more suitable for machine learning.

4. To select the most relevant features:

EDA helps you to discover the relationships and correlations between your variables. This can help you to select the most relevant and informative features for your ML models, and avoid multicollinearity or redundancy.

5. To generate/ engineer new features:

EDA provides inspiration or reveals avenues for creating/ generating new features i.e. by combining or transforming existing ones.

6. To detect and handle outliers and anomalies:

Outliers are extreme values that deviate from the normal range of your data, while anomalies are values that do not conform to the expected pattern or behavior of your data.

Both outliers and anomalies can affect the performance and generalization of your ML models, so it is important to identify them and decide how to deal with them (e.g., remove them, replace them, or keep them).

7. To test your assumptions and hypotheses about your data:

Note that EDA isn’t enough to draw definitive conclusions but it helps in testing your intuitive assumptions/ hypothesis about your data.

For example, you might have some prior knowledge or expectations about how your data should look like, or how your variables should interact with each other. EDA can help you to validate or invalidate these assumptions and hypotheses, and adjust them accordingly.

8. To communicate and present your findings and insights to others:

Your insights are only valuable if others understand them.

See the other zens of Data Science here:

15 Key Things Every Data Scientist Needs to Know to be in the Top 1%

Let’s hear it:

levelup.gitconnected.com

EDA often involves creating visualizations, such as charts, graphs, plots, or maps, that can help you to convey complex information in a simple and intuitive way.

Visualizations can also help you to tell a story with your data, and highlight the key points and takeaways for your audience.

20% of EDA Plots Data Scientists Use 80% of the Time

All the plots you need to know for EDA

medium.com

In conclusion

Building your ML models and selecting features based on intuition only (without painstakingly carrying out EDA) is bad practice and will undermine the abilities of your model.

C-T-A

Thank you for reading! I hope you learned something new :)

Please leave as many claps as you can (up to 50) if you liked this so other data scientists get to see it.

~~~~~~~~~~

This story isn’t eligible to earn from Medium cos of my country but if you appreciate my work, feel free to tip me by clicking the button below:

Bye for now :)