Detailed Exploratory Data Analysis (EDA) on Used Cars Data
This article is part of a series I am starting today where I will share practical EDA examples on various kinds of datasets. Before showing the code for EDA, I want to talk about it and why is it important so that we can have a better understanding of it.
Why do we need EDA?
Exploratory Data Analysis is a vital step in a data science project or process and helps to better understand the data. There are many reasons why EDA is important, including the following:
- It helps to identify the relationship and patterns between various features in a dataset. These patterns cannot be found immediately by looking at the raw data.
- EDA also helps to find or identify any potential outliers or anomalies in the dataset. Outliers can have a significant impact on the ML model or data analysis results. So, removing outliers or dealing with outliers becomes a critical part of the data science process. If you want to read more about outliers you can read my article here.
- EDA is also very useful in the data-cleaning process. Data Scientists or data analysts ensure the proper formatting of data as EDA can identify the issues in the dataset and using this information data scientists can clean and preprocess the data before further analysis.
- EDA helps to interpret the data using data visualization and statistical techniques. We can interpret our results using various graphs in an easier and more understandable way which can improve the communication of the results.
Hands-on EDA
I have divided the EDA process into small steps which will help us to explore the data and make conclusions about it. Libraries used for this example of EDA are:
- Pandas
- Numpy
- Seaborn
- Matplotlib
- Warnings
Step 1: Problem Statement and Data Collection
Even before importing required packages or libraries, it is a good practice to define the problem statement and give information about the data.
Step 2: Import Required Packages and Load Dataset
You should be able to load these libraries but if still you get a Module error try to search PyPI [library name] in google and run the command on cmd. E.g. In order to install seaborn you need to run pip install seaborn in the command prompt.
Now, we can see there is a column named ‘Unnamed:0’. This column must have been added while scraping and saving the data into a CSV file. So the next step is to just drop the data.
Step 3: Perform Basic functions on the Data.
As discussed we will drop the column ‘Unnamed:0’ and then again visualize the data by showing the top 5 rows.
3.1: Check the shape and columns in the dataset
The shape of the data means the number of rows and columns in the dataset.
3.2: Description of Numerical features and Info about each column
DataFrame.describe() can give the description of numerical data. This can help us to get the values like minimum, maximum, mean, standard deviation, etc for each numerical column.
DataFrame.info() will give us information about the data type and number of non-null values in each column.
3.3: Check Duplicate Values
Duplicate values can impact the accuracy of ML models and data analysis results so, it’s a good practice to always check for duplicate values in the dataset. In our dataset, there are 167 duplicate values.
3.4: Section Conclusion
I will always suggest writing the conclusions or the information to get from the data after each section. It can help us later to identify issues about the dataset while performing data cleaning and feature engineering.
Step 4: Exploring the Data
We will just explore the dataset so that this information will be useful during univariate, bivariate, and multivariate analysis.
As we can see there is a category 0 seat which is not possible. So we can try to explore the seat column where the seat is zero and can then decide to remove those rows or replace 0 values with some other values like mode, mean, etc. Also, we can divide the columns into Numeric or Categorical which will be helpful to find patterns in the data.
Step 5: Univariate Analysis
Univariate Analysis means taking one column at a time and analyzing it. We can use graphs like KDE plot, boxplot, etc for this.
Kernel Density Estimation(KDE) tells about the skewness of the data. In the above graph, we can see that Km_driven, max_power, selling_price, and engine are right-skewed and positively skewed. This means there are Outliers in km_driven, engine, selling_price, and max power.
Count Plot
It shows the count of categories in a column.
Step 6: Multivariate Analysis
Take two or more variables and then analyze them.
Correlation:
We can find the correlation between various features using corr() function. This function is very important in the EDA process as we can find the correlation in the features easily.
This is important because let’s say we have 2 independent columns A and B. A and B are 99% correlated or 95% correlated so during feature selection, we can drop any one of these columns, and won’t affect my model.
Relationship between Continuous features and Target feature(selling price):
Here from the above scatter plot, we can see the following observations:
- Lower Vehicle age has more selling price than Vehicle with more age.
- Engine CC has a positive effect on price.
- Kms Driven has a negative effect on selling price.
Step 7: Graphical Analysis
Graphical Analysis is very important as this process can help us quickly identify and visually identify the patterns and trends in the data.
Selling Price Distribution
Top 10 Most Sold Cars
As we can see in the graph Hyundai i20 shares 5.8% of the total ads posted on the website followed by the Maruti Swift Dzire. The mean price of the top 10 cars sold is 5.4Lakhs INR. So, this feature has a significant impact on the target column.
Brand vs Selling Price
The costliest brand sole is Ferrari at 3.95 Cr INR followed by Rolls-Royce at 2.42 Cr. The brand name has a very clear impact on the selling price.
Kilometer Driven vs Selling Price
Many cars were sold with km between 0 to 20k kilometers and Low km driven cars have more selling prices compared to cars with more km driven.
Fuel Type vs Selling Price
Transmission Type vs Price
Step 8: EDA Report
- The datatypes and Column names were right and there were 15411 rows and 13 columns
- The selling_price column is the target to predict. i.e Regression Problem.
- There are outliers in the km_driven, engine, selling_price, and max power.
- Dealers are the highest sellers of used cars.
- Skewness is found in a few of the columns will check it after handling outliers.
- Vehicle age has a negative impact on the price.
- Manual cars are mostly sold and automatic has a higher selling average than manual cars.
- Petrol is the most preferred choice of fuel on the used car website, followed by diesel and LPG.
We just need less data cleaning for this dataset.
Conclusion
EDA is an essential step in the data science process, and is crucial for understanding, cleaning, and preparing data for further analysis. By thoroughly exploring and summarizing a dataset, we can gain valuable insights and can make informed decisions about how to proceed with their analysis.
You can find the code and dataset on my GitHub profile.
Thanks for reading this article! Leave a comment below if you have any questions. You can follow me on Linkedin and GitHub.