Analytics Vidhya
Published in

Analytics Vidhya

EDA Using SQL

A complete walkthrough of the data analysis process using SQL.

Photo by Luke Chesser on Unsplash

INTRODUCTION

SQL stands for Structured Query Language. It is the standard language to interact with databases. SQL is the most important tool, a data analyst uses to manipulate and gain insights from the data. In this project, we will try to clean, process, and analyze the data, for this purpose we will use this automobile data from UCI Mchine Learning Repository.

For this project, I will be using MySQL Workbench. You can download it from here. MySQL is an open-source relational database management system. Feel free to use any other RDBMS software. So that now we have our software setup, then let’s import the dataset into our database.

Photo by Marc Kleen on Unsplash

IMPORTING DATA

To import the data in MySQL workbench follow the given steps, as the dataset is in CSV format we have to import the data using the Table Data Import Wizard menu.

  1. Create a database named automobile.
  2. Right click on the automobile database in the Schemas column and select the Table Data Import Wizard option from the menu.
  3. Now enter the path of the file, in the file selection window and click Next.
  4. In the Select Destination window click on Create New Table and set the name of the table as automobile_data and click Next.
  5. In the settings window, keep the values as it is and click on Next.
  6. Finally, click on the Next button to execute the task, and the dataset is successfully imported!!!
Importing the dataset

So now, after creating the database let’s start exploring the data and answer some questions. Okay before starting, let’s say you have started a used car dealership venture. You want to know which cars customers are mostly
— to purchase so that you know which cars to keep in stock. So that now our objective is clear let’s analyze the data!!!

DATA CLEANING

So let’s have a look at our data set, to see the data we will use this query:

-- Let's load the data in the database and inspect the data.
Use automobile;
-- Let's have a look at the data. This query will retrieve all the data from the table.
SELECT
*
FROM
automobile_data;

This will return all the rows and columns present in the table. So now let's get the shape of the data.

-- So the below query gives us the number of columns in the dataset.
SELECT
COUNT(*) AS NumberOfColumns
FROM
information_schema.columns
WHERE
table_name = 'automobile_data';
-- Now let's find out the number of rows.
SELECT
COUNT(*) AS NumberOfRows
FROM
automobile_data;

So there are 20 columns and 203 rows present in the data.

Now, let’s inspect each column and check for missing values, outliers, or any spelling mistakes present in the data which can hinder the analysis.

Let’s find out how many unique brands do we have,

-- selecting distinct brands
SELECT DISTINCT
make AS UniqueBrands
FROM
automobile_data;
-- checking missing values
SELECT
*
FROM
automobile_data
WHERE
make IS NULL or make = '';

There are 21 unique brands of cars available. And we have to find top 5 cars customers mostly purchase. While dealing with categorical data, the best way is to find the unique values which will help us to narrow down whether there are any errors in spelling or not. So here there are no errors and missing values present in this column.

Now, let's inspect column num_of_doors,

-- check for distinct values
SELECT DISTINCT
num_of_doors AS NumOfDoors
FROM
automobile_data;
-- check for missing values
SELECT
*
FROM
automobile_data
WHERE
num_of_doors IS NULL OR num_of_doors = '';

So there are two missing values in this column, one car of brand dodge and Mazda each have missing values in the num_of_doors column.

We will try to substitute the missing values, by looking at cars of the same features.

select * from automobile_data
where make = 'dodge' and body_style = 'sedan' and drive_wheels = 'fwd' and engine_location = 'front';

To find out the value to substitute, with this query we will find out the common number of doors, of brand dodge, of body style — sedan, fwd, and front engine location.

So three dodge cars are satisfying these conditions and as we can see that both of them have four doors so we can substitute this same value in the missing location. To substitute the value we will use the UPDATE statement of SQL.

update automobile_data
set num_of_doors = 'four'
where make = 'dodge' and body_style = 'sedan' and drive_wheels = 'fwd' and engine_location = 'front';

We have successfully filled the missing value with ‘four’. Similarly, we can follow the same procedures to substitute the value for the Mazda car.

-- similarly, for mazda
select * from automobile_data
where make = 'mazda' and fuel_type = 'diesel';
-- So we will replace the missing value with 'four' as it is a sedan which usually has four doors.-- substituting value
update automobile_data
set num_of_doors = 'four'
where make = 'mazda' and body_style = 'sedan' and fuel_type = 'diesel';

Now, moving on to the next column let’s inspect the drive_wheels column, as this is also a categorical column let’s find out the distinct values present.

-- check for distinct values
SELECT DISTINCT
drive_wheels AS DriveWheels
FROM
automobile_data;
-- check for missing values
SELECT
*
FROM
automobile_data
WHERE
drive_wheels IS NULL or drive_wheels = '';

In the result set, we can see ‘4wd’ appears twice, but there’s no spelling mistake then why does it appear twice?? This is because of white space present in one of the values, let’s confirm this by looking at the length of values. We use the length() function.

Select drive_wheels, length(drive_wheels) as length
from automobile_data
group by drive_wheels;

So, the result set shows that the length of one value is ‘4’ this is due to the presence of white space, to remove the white space we will use the trim() function.

Update automobile_data
set drive_wheels = trim(drive_wheels)
where length(drive_wheels) > 3;

So we have successfully updated the table. Let's move to the next column, num_of_cylinders. In this column, ‘two’ is misspelled as ‘tow’ so we need to correct this mistake and update the table with the correct value using the UPDATE statement.

SELECT DISTINCT
num_of_cylinders AS NumOfCylinders
FROM
automobile_data;
-- Okay, so here 'two' is misspelled as 'tow', so we need to replace it.UPDATE automobile_data
SET
num_of_cylinders = 'two'
WHERE
num_of_cylinders = 'tow';

So, there are two more columns, we need to check for — compression_ratio and price. Both columns have continuous values, so let’s check the range of values and try to see if any outlier is present or not.

select min(compression_ratio) as MinCopressionRatio, max(compression_ratio) as MaxCompressionRatio
from automobile_data;

This query will return the minimum and maximum values of the compression ratio. We can see that the maximum value exceeds the upper limit of the compression ratio in the result set. This can be verified from the data description. The maximum value according to the description is ‘23’ and the maximum value in the result set is ‘70', which is an outlier. As there is only one record and we don’t know which value to substitute exactly, we will delete this record using the DELETE statement.

delete from automobile_data
where compression_ratio = 70;

Let’s inspect the last column that is the price column.

select min(price) as MinPrice, max(price) as MaxPrice
from automobile_data;

Okay, so we can see some of the cars have a price ‘0', which is not correct, as the prices in the data description start from ‘5118’. So to deal with this we will substitute the values with the average price of all cars.

update automobile_data
set price = 12977
where price = 0;

We have now cleaned the data and are ready for the analysis!!!!

Analyzing the data

Let’s define our objective,

  1. To find the top 5 cars customers bought based on make.
  2. To find the top 5 cars based on body_style.

Let’s start with the first question, to find the top 5 cars based on make we will execute the following query.

select make, count(make) as CarsBought
from automobile_data
group by make
order by CarsBought DESC
Limit 5;

This query basically retrieves the count of cars for brand and sorts the result by descending order of count. As we can see in the result set most customers prefer to buy cars of Toyota brand.

Now let’s find out the top 5 cars based on body style.

select body_style, count(body_style) as CarsBought
from automobile_data
group by body_style
order by CarsBought DESC
Limit 5;

This result shows that the customers prefer buying sedans more than any other body style. Here we have got some concrete results that the customers prefer cars of Toyota brand and cars of sedan body style. So let’s drill down further into the results and determine which body style is preferred by the customer who buys the Toyota cars?

select body_style, count(body_style) as CarsBought
from automobile_data
where make = 'toyota'
group by body_style
order by CarsBought DESC;

So, the result of this query, tells us that the customers buy hatchback models of the Toyota brand, further drilling down the data we understood that the customers prefer buying Toyota cars in the hatchback segment and of gas type.

With this analysis of the data we are now aware of which type of car to keep in stock and can also further create a predictive model to predict the sales of cars but for that, we need some more data.

Thanks for reading! If you want to get in touch with me, feel free to reach me at varunbelgaonkar@gmail.com or my LinkedIn Profile. You can view my Github profile for different data analytics projects.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store