Importance of Exploratory Data Analysis in the Journey of Data Science Project (Part I)

Vishal Patil
4 min readNov 26, 2021

--

This is the first part of the article containing the detailed EDA performed on the Car dataset and the second part consists of machine learning model building and predicting the price of old vehical based on the model.

Source Code: GitHub Car_Dataset: Kaggle

Every Data Science Project goes through following various stages starting from data collection to model deployment but the most important step is to perform Exploratory Data Analysis. In this article we mainly focus on importance of Exploratory data analysis.

Exploratory Data Analysis (EDA) is an approach to extract the information enfolded in the data and summarize the main characteristics of the data.It is considered to be a crucial step in any data science project.EDA is essential for a well-defined and structured data science project and it should be performed before any machine learning modeling phase so that accuracy of model can be increased. Using EDA one can also be able to deal with the missing values, duplicates values, outliers, and also able to see some trends or patterns present in the dataset.In this article we mainly focus on EDA performned on car dataset. So let get started first

The first thing we’ll need to do is load the libraries and dataset on which we are performing the various operations. As a data scientist first thing we need to do is to check the contents of dataset we load.

Next Step is to select catagorical columns and checking for null values if any present in the dataset

Now, let’s also see the columns and their data types. For this, we will use the info() method. and let’s get a quick statistical summary of the our car dataset using the describe() method, this function gives a good picture of the distribution of data.

Droping the unnecessary features from original dataset on which the price is not dependent, and also extracting the new variable ‘No of year’ from ‘year’ of car purchased and current year for which we firstly add ‘Current_year’and then extract the new feature and finally droping the Year and Current Year Columns as shown in following figure.

In the next step, catagorical columns like Fule_type, Seller_type and Transmission need to converted as machine learning model only understand the numberical data so using get dumies method from pandas library to converts categorical data into dummy or indicator variables and also we need to find correlation between the variables.

Finally for more detailed visualization purpose we use heatmap from seaborn library to plot rectangular data as a color-encoded matrix to understand the correlation between the varibles.

Conclusion

This article mainly focuses on importance of exploratory data analysis step while doing any data science project. The main purpose of performing the EDA is to maximizing insights of a dataset It gives you a clear picture of the features and the relationships between them also to provide essential variables and removing non-essential variables. This is the first part of the article in which I have performed various EDA operations on car_dataset. Next part consists of feature selection and model building process.

This Project is available at https://vishcarpriceprediction.herokuapp.com/

Part II Link https://medium.com/@vpatil12/feature-engineering-and-modeling-phase-in-the-journey-of-data-science-project-part-ii-4657d1865982

--

--