Detailed Exploratory Data Analysis (EDA) on Used Cars Data

Sachin Dev
8 min readDec 13, 2022

--

Source: Unsplash

This article is part of a series I am starting today where I will share practical EDA examples on various kinds of datasets. Before showing the code for EDA, I want to talk about it and why is it important so that we can have a better understanding of it.

Why do we need EDA?

Exploratory Data Analysis is a vital step in a data science project or process and helps to better understand the data. There are many reasons why EDA is important, including the following:

  1. It helps to identify the relationship and patterns between various features in a dataset. These patterns cannot be found immediately by looking at the raw data.
  2. EDA also helps to find or identify any potential outliers or anomalies in the dataset. Outliers can have a significant impact on the ML model or data analysis results. So, removing outliers or dealing with outliers becomes a critical part of the data science process. If you want to read more about outliers you can read my article here.
  3. EDA is also very useful in the data-cleaning process. Data Scientists or data analysts ensure the proper formatting of data as EDA can identify the issues in the dataset and using this information data scientists can clean and preprocess the data before further analysis.
  4. EDA helps to interpret the data using data visualization and statistical techniques. We can interpret our results using various graphs in an easier and more understandable way which can improve the communication of the results.

Hands-on EDA

I have divided the EDA process into small steps which will help us to explore the data and make conclusions about it. Libraries used for this example of EDA are:

  1. Pandas
  2. Numpy
  3. Seaborn
  4. Matplotlib
  5. Warnings

Step 1: Problem Statement and Data Collection

Even before importing required packages or libraries, it is a good practice to define the problem statement and give information about the data.

Problem Statement
Data Description

Step 2: Import Required Packages and Load Dataset

You should be able to load these libraries but if still you get a Module error try to search PyPI [library name] in google and run the command on cmd. E.g. In order to install seaborn you need to run pip install seaborn in the command prompt.

Importing Libraries
Load Dataset and show the top 5 rows.

Now, we can see there is a column named ‘Unnamed:0’. This column must have been added while scraping and saving the data into a CSV file. So the next step is to just drop the data.

Step 3: Perform Basic functions on the Data.

As discussed we will drop the column ‘Unnamed:0’ and then again visualize the data by showing the top 5 rows.

Dropping Column

3.1: Check the shape and columns in the dataset
The shape of the data means the number of rows and columns in the dataset.

Shape and Columns

3.2: Description of Numerical features and Info about each column
DataFrame.describe() can give the description of numerical data. This can help us to get the values like minimum, maximum, mean, standard deviation, etc for each numerical column.

Summary of Data

DataFrame.info() will give us information about the data type and number of non-null values in each column.

Data Information

3.3: Check Duplicate Values
Duplicate values can impact the accuracy of ML models and data analysis results so, it’s a good practice to always check for duplicate values in the dataset. In our dataset, there are 167 duplicate values.

Duplicate Values

3.4: Section Conclusion
I will always suggest writing the conclusions or the information to get from the data after each section. It can help us later to identify issues about the dataset while performing data cleaning and feature engineering.

Section Conclusion.

Step 4: Exploring the Data

We will just explore the dataset so that this information will be useful during univariate, bivariate, and multivariate analysis.

Categories in Columns

As we can see there is a category 0 seat which is not possible. So we can try to explore the seat column where the seat is zero and can then decide to remove those rows or replace 0 values with some other values like mode, mean, etc. Also, we can divide the columns into Numeric or Categorical which will be helpful to find patterns in the data.

Numerical and Categorical Features

Step 5: Univariate Analysis

Univariate Analysis means taking one column at a time and analyzing it. We can use graphs like KDE plot, boxplot, etc for this.

Code for KDE Plot
KDE-Plot

Kernel Density Estimation(KDE) tells about the skewness of the data. In the above graph, we can see that Km_driven, max_power, selling_price, and engine are right-skewed and positively skewed. This means there are Outliers in km_driven, engine, selling_price, and max power.

Count Plot
It shows the count of categories in a column.

Count plot code
Count Plot

Step 6: Multivariate Analysis

Take two or more variables and then analyze them.

Correlation:
We can find the correlation between various features using corr() function. This function is very important in the EDA process as we can find the correlation in the features easily.

Correlation
Heatmap Code
Heatmap

This is important because let’s say we have 2 independent columns A and B. A and B are 99% correlated or 95% correlated so during feature selection, we can drop any one of these columns, and won’t affect my model.

Correlation Observations

Relationship between Continuous features and Target feature(selling price):

Creating Continuous Features variable
Continuous vs Target Feature Code
Continuous vs Target Feature Scatter Plot

Here from the above scatter plot, we can see the following observations:

  1. Lower Vehicle age has more selling price than Vehicle with more age.
  2. Engine CC has a positive effect on price.
  3. Kms Driven has a negative effect on selling price.

Step 7: Graphical Analysis

Graphical Analysis is very important as this process can help us quickly identify and visually identify the patterns and trends in the data.

Selling Price Distribution

Selling Price Distribution

Top 10 Most Sold Cars

Top 10 cars sold

As we can see in the graph Hyundai i20 shares 5.8% of the total ads posted on the website followed by the Maruti Swift Dzire. The mean price of the top 10 cars sold is 5.4Lakhs INR. So, this feature has a significant impact on the target column.

Brand vs Selling Price

Brand vs Selling Price

The costliest brand sole is Ferrari at 3.95 Cr INR followed by Rolls-Royce at 2.42 Cr. The brand name has a very clear impact on the selling price.

Kilometer Driven vs Selling Price

kms_driven vs Selling Price

Many cars were sold with km between 0 to 20k kilometers and Low km driven cars have more selling prices compared to cars with more km driven.

Fuel Type vs Selling Price

Fuel Type vs Selling Price

Transmission Type vs Price

Transmission Type vs Price

Step 8: EDA Report

  1. The datatypes and Column names were right and there were 15411 rows and 13 columns
  2. The selling_price column is the target to predict. i.e Regression Problem.
  3. There are outliers in the km_driven, engine, selling_price, and max power.
  4. Dealers are the highest sellers of used cars.
  5. Skewness is found in a few of the columns will check it after handling outliers.
  6. Vehicle age has a negative impact on the price.
  7. Manual cars are mostly sold and automatic has a higher selling average than manual cars.
  8. Petrol is the most preferred choice of fuel on the used car website, followed by diesel and LPG.
    We just need less data cleaning for this dataset.

Conclusion

EDA is an essential step in the data science process, and is crucial for understanding, cleaning, and preparing data for further analysis. By thoroughly exploring and summarizing a dataset, we can gain valuable insights and can make informed decisions about how to proceed with their analysis.

You can find the code and dataset on my GitHub profile.

Thanks for reading this article! Leave a comment below if you have any questions. You can follow me on Linkedin and GitHub.

--

--