Exploratory Data Analysis

Omkar Raut
Machine Learning Concepts
4 min readMay 21, 2021

In this post, I have tried to give a brief on exploratory data analysis. We can do it after the successful data preprocessing.

Exploratory data analysis provides more information on the provided dataset. It also gives us the relation with each attribute. It is very useful in cases where the number of features is more. To train the model, we should feed at most 10–12 features. But if the features are more than that, then we have to choose only 10–12 features among them. How to choose the features? It is really a challenging question. Yes, we can solve it by doing exploratory data analysis. We will see it practically later in this post.

Exploratory data analysis deals with the initial investigation of the provided data and that will helps to make assumptions on patterns, hypotheses, machine learning algorithms with the help of charts and graphs.

Now, let’s start the practical along with the explanation of exploratory data analysis.

1) Take the Iris dataset as input: I am using the pandas library to read the CSV file of the iris dataset. To do this, we have to do the following code in python.

We have to put the CSV file of the iris dataset in the directory where we are running the jupyter notebook.

2) Data Preprocessing: We can clearly see that the class attribute is non-numerical. So, we have to convert it into a numerical format. We have seen all methods to do it. Let’s do it.

Now, let’s start with the exploratory data analysis. we have seen that the information about the dataset that helps to find the pattern, making assumptions is nothing but exploratory data analysis. So, the following methods are also giving information about the dataset. So, these are also the techniques of exploratory data analysis.

3) Exploratory Data Analysis:

info() method will return the information about the attributes such as attributes name, non-null values, data type of all attributes, etc.

describe() method will return the more information about the dataset such as count, mean, standard deviation, minimum ,maximum, etc.

We can see that the class attribute is categorical. Let’s see some methods that work on the categorical data.

The unique() method will return the all unique values of the attribute.

value_counts() method will return the unique value of that attribute with its count.

corr() method gives the relation between the attributes of the dataset. The values of the relation are from -1 to 1.

The diagonal elements having the relation 1, because they are in relation with the same attribute. Hence they are fully related to each other. We can make it more attractive using the seaborn library.

As we can see here, the relations table is converted into a graphical view. But the values are missing. Yes, we can approximately find out the values from the number line given on the right side of the graph. Let’s try to improve it more so we can come up with a more informative graphical view.

Now the above code will return the same graphical view with values. It is more informative than the first one. But it will be more interactive when I can change the color scheme of the graph. Let’s try.

Yes, I can. Also, I can change the width of the lines between the boxes.

It will take the linear at the bottom, horizontally.

Choosing the features based on the correlation is a part of feature selection. So, we will see it under that section shortly in upcoming posts.

Conclusion:

In this post, I have introduced you to exploratory data analysis. Also, have given a brief on the methods used for the exploratory data analysis. We have seen how to make a graphical interface of correlations of all attributes using the seaborn library.

Code available here. https://github.com/omkarsantoshraut/MachineLearningConcepts/tree/main/Exploratory%20Data%20Analysis

Thank you.

--

--

Omkar Raut
Machine Learning Concepts

Bachelor of Technology in Computer Engineer, at Dr. Babasaheb Ambedkar Technological University, Lonere, Raigad, India.