Exploratory data analysis on a web scraped Checki data set
This is an EDA exercise to get insights from a CSV dataset scraped from Checki, an online automobile store. It is a data set containing information about cars for sale, their location, model year, etc. I executed the exercise in collaboration with Hard Core and @dakinbuwa, and the dataset is available on GitHub using the link at the end of this post.
Expected outcomes are insights such as the location with the highest number of car listings, car brand listings by state, car Model with the highest number of listings, the average price of cars per location, and so on. To achieve these, we have to clean our data and do some data engineering using pandas and NumPy for data manipulation, matplot, and seaborn for visualizations.
Time to import the necessary libraries
import numpy as npimport seaborn as snsimport pandas as pdimport matplotlib.pyplot as plt%matplotlib inline
I executed this exercise on google colab; thus, the next step is to load my google drive account where the data set is stored. You can skip this step if you intend on working locally using Jupyter or any other IDE. If you want to set up a google colab notebook, you can check out my previous post where I outlined how to set up a colab notebook using the link below.
Time to mount google drive to google colab
from google.colab import drivedrive.mount('/gdrive')
The next step is to create our file path, make sure you edit the file path to match the location of the data set on your system.
file_path='/gdrive/My Drive/Ai_projects/checki EDA/Cheki_ng_scraping.csv'cd= pd.read_csv(file_path)
Lets have a look at our data set
cd.head
let’s have a look at the data type of each column and check if there are any missing or null values
cd.info()
The price and mileage columns are of the object data type; we will change them to ‘int’ when necessary. From the above, we have no missing or null values.
Creating a new data frame from our data set using only necessary columns (dropping the unnamed column)
cols=['Car_name', 'Year', 'Price', 'Location', 'Status', 'Ratings','Mileage']cd_data = cd[cols]
cd_data.head(10)
Data cleaning and feature engineering
Mileage column
The mileage are in km and miles, we need to create a new column where all values for the mileages are converted to km
km=[] #creating and empty set to hold the converted mileagesfor i in cd_data.Mileage: # for each value in the mileage column if i[-2:]=='km': # if the last two characters are 'km' km.append(i[:-3].replace(',','')) #add the mileage to the empty set 'km' without the units while stripping of comma's by replacing them with an empty character. if i[-2:]=='mi': # if the last two characters are 'mi' j=(i[:-3].replace(',','')) #store the mileage with the 'j' variable without the units while stripping of comma's x=int(j)*1.6 #converting the unit from miles to km km.append(x) #add the mileage to the empty set 'km'
Creating a new column with the mileage all in km and having a look at the new data set
cd_data['Mileage_km']=kmcd_data
Now we can drop the mileage column.
cd_data.drop('Mileage', axis=1 ,inplace=True)cd_data
Price column
We need to remove the Naira symbol and comma’s from the values in our price column before finally changing their data type to int.
prices=[]for i in cd_data.Price:a=i[2:].replace("," , "") #replacing comma's with an empty character (nothing).prices.append(int(a))
creating a new column for the prices and dropping the old price column we have
cd_data['Prices']=pricescd_data.drop('Price',axis=1,inplace=True)cd_data
Creating a new column for the brand names
Let’s get the brand names from the car names
brand=[]for i in cd_data.Car_name:c = i.split(' ')[0] #splliting the car names at the space and taking the first value, which is the brand namebrand.append(c)cd_data['Brand']=brandcd_data
Extracting the car models from the car names
model=[]for i in cd_data.Car_name:c = i.split(' ')[1]model.append(c)cd_data['Model']=modelcd_data
Creating a column for a cars model and year
Create a new column that will contain the model and year of the car by combining data from the year and model column.
years=[]for i in cd_data.Year:i=' '+str(i)years.append(str(i))cd_data['Years']=yearscd_data['Model_year']=cd_data['Model']+cd_data['Years']cd_data
Extracting the states from the location column.
state=[]for i in cd_data.Location:b=i.split(',')[-1]state.append(b)cd_data['State']=statecd_data
Checking for outliers
checking the price column for outliers
sns.boxplot(x=cd_data['Prices'])
From the graph above there’s an outlier with an outlier in the price column. To find the outlier, we would sort the values in the price column
cd_data.sort_values('Prices',axis=0,ascending=False)
Apart from the first car in the table above, all other cars seem reasonably priced. It is an outlier and needs to be dropped using its index.
cd_data.drop(627, axis=0, inplace=True)sns.boxplot(x=cd_data['Prices'])
Data exploration
In this stage of our analysis, answers to the following questions are what we would be searching for:
Location with the highest number of cars
Car brand listings by state
Car brand with the highest number of listings
The average price of cars per location
Car listings by model year
car model with the highest rating
Most frequent rating of the cars
Most frequent status of the cars
Average price based on the condition (status) of the cars
Status of models
Location with the highest number of cars
plt.figure(figsize=(20,8))sns.countplot(data=cd_data, x='State')
Lagos has the highest number of cars.
Car brand listings by state
pd.pivot_table(cd_data,index='State',columns='Brand', values = 'Location', aggfunc='count')
The table above shows the states on the x-axis and the car brands on the y-axis with the number of cars filling the cells. Take note that NaN represents zero (0).
Car Brand with the highest number of listings
plt.figure(figsize=(40,8))sns.countplot(data=cd_data, x='Brand')
pd.pivot_table(cd_data,index='Brand', values = 'Mileage_km', aggfunc='count')
From the table and figure above, toyota has the highest number of listings.
Average price of cars per location
pd.pivot_table(cd_data,index='State', values = 'Prices', aggfunc='mean')
More detailed information on the distribution of car prices can be gotten by grouping the car prices and filtering.
Price_group=[]for c in cd_data.Prices:if c<2000000:Price_group.append("0-1.99m")elif 1999999<c<5000000:Price_group.append("2m-4.99m")elif 4999999<c<10000000:Price_group.append("5m-9.99m")elif 9999999<c<20000000:Price_group.append("10m-19.99m")elif 19999999<c<50000000:Price_group.append("20m-40.99m")elif 40999999<c<100000000:Price_group.append("50m-99.99m")else:Price_group.append("100m-150m")cd_data['Price_group']=Price_group
pd.pivot_table(cd_data,index='State',columns='Price_group', values = 'Mileage_km', aggfunc='count')
Car listings by model year
pd.pivot_table(cd_data,index='Car_name',columns='Years',values='Mileage_km',aggfunc='count')
#take note that NaN represents zero (0)
Most frequent status of the cars
sns.countplot(data=cd_data, x='Status')
Majority of the cars are foreign used.
Average price based on the status of the cars
pd.pivot_table(cd_data,index='Status',columns='Brand',values='Prices',aggfunc='mean')
Status of models
pd.pivot_table(cd_data,index='Status',columns='Brand', values = 'Mileage_km', aggfunc='count')
There are very few cars whose status are new from each brand, with some brands not having new cars at all while all brands except Isuzu have foreign used as their status.
Thats the end of the exercise, you can get the data set using the link below
you can access the google colab notebook via the link below.
Bye for now.