Exploratory data analysis on a web scraped Checki data set

Published in

Analytics Vidhya

7 min readJul 23, 2021

This is an EDA exercise to get insights from a CSV dataset scraped from Checki, an online automobile store. It is a data set containing information about cars for sale, their location, model year, etc. I executed the exercise in collaboration with Hard Core and @dakinbuwa, and the dataset is available on GitHub using the link at the end of this post.

Expected outcomes are insights such as the location with the highest number of car listings, car brand listings by state, car Model with the highest number of listings, the average price of cars per location, and so on. To achieve these, we have to clean our data and do some data engineering using pandas and NumPy for data manipulation, matplot, and seaborn for visualizations.

Time to import the necessary libraries

import numpy as npimport seaborn as snsimport pandas as pdimport matplotlib.pyplot as plt%matplotlib inline

I executed this exercise on google colab; thus, the next step is to load my google drive account where the data set is stored. You can skip this step if you intend on working locally using Jupyter or any other IDE. If you want to set up a google colab notebook, you can check out my previous post where I outlined how to set up a colab notebook using the link below.

A SIMPLE GUIDE TO BUILDING AN IMAGE CLASSIFIER WITH CNN (using google colab)

Today we are going to build an image classifier using CNN. For this project we would train our model to classify two…

usmanbiu.medium.com

Time to mount google drive to google colab

from google.colab import drivedrive.mount('/gdrive')

The next step is to create our file path, make sure you edit the file path to match the location of the data set on your system.

file_path='/gdrive/My Drive/Ai_projects/checki EDA/Cheki_ng_scraping.csv'cd= pd.read_csv(file_path)

Lets have a look at our data set

cd.head

let’s have a look at the data type of each column and check if there are any missing or null values

cd.info()

The price and mileage columns are of the object data type; we will change them to ‘int’ when necessary. From the above, we have no missing or null values.
Creating a new data frame from our data set using only necessary columns (dropping the unnamed column)

cols=['Car_name', 'Year', 'Price', 'Location', 'Status', 'Ratings','Mileage']cd_data = cd[cols]
cd_data.head(10)

Data cleaning and feature engineering

Mileage column

The mileage are in km and miles, we need to create a new column where all values for the mileages are converted to km

km=[]  #creating and empty set to hold the converted mileagesfor i in cd_data.Mileage: # for each value in the mileage column    if i[-2:]=='km':  # if the last two characters are 'km'        km.append(i[:-3].replace(',','')) #add the mileage to the               empty set 'km' without the units while stripping of comma's by replacing them with an empty character.    if i[-2:]=='mi': # if the last two characters are 'mi'        j=(i[:-3].replace(',',''))  #store the mileage with the 'j' variable without the units while stripping of comma's        x=int(j)*1.6  #converting the unit from miles to km        km.append(x) #add the mileage to the empty set 'km'

Creating a new column with the mileage all in km and having a look at the new data set

cd_data['Mileage_km']=kmcd_data

Now we can drop the mileage column.

cd_data.drop('Mileage', axis=1 ,inplace=True)cd_data

Price column

We need to remove the Naira symbol and comma’s from the values in our price column before finally changing their data type to int.

prices=[]for i in cd_data.Price:a=i[2:].replace("," , "") #replacing comma's with an empty character (nothing).prices.append(int(a))

creating a new column for the prices and dropping the old price column we have

cd_data['Prices']=pricescd_data.drop('Price',axis=1,inplace=True)cd_data

Creating a new column for the brand names

Let’s get the brand names from the car names

brand=[]for i in cd_data.Car_name:c = i.split(' ')[0] #splliting the car names at the space and taking the first value, which is the brand namebrand.append(c)cd_data['Brand']=brandcd_data

Extracting the car models from the car names

model=[]for i in cd_data.Car_name:c = i.split(' ')[1]model.append(c)cd_data['Model']=modelcd_data

Creating a column for a cars model and year

Create a new column that will contain the model and year of the car by combining data from the year and model column.

years=[]for i in cd_data.Year:i=' '+str(i)years.append(str(i))cd_data['Years']=yearscd_data['Model_year']=cd_data['Model']+cd_data['Years']cd_data

Extracting the states from the location column.

state=[]for i in cd_data.Location:b=i.split(',')[-1]state.append(b)cd_data['State']=statecd_data

Checking for outliers

checking the price column for outliers

sns.boxplot(x=cd_data['Prices'])

From the graph above there’s an outlier with an outlier in the price column. To find the outlier, we would sort the values in the price column

cd_data.sort_values('Prices',axis=0,ascending=False)

Apart from the first car in the table above, all other cars seem reasonably priced. It is an outlier and needs to be dropped using its index.

cd_data.drop(627, axis=0, inplace=True)sns.boxplot(x=cd_data['Prices'])

Data exploration

In this stage of our analysis, answers to the following questions are what we would be searching for:
Location with the highest number of cars
Car brand listings by state
Car brand with the highest number of listings
The average price of cars per location
Car listings by model year
car model with the highest rating
Most frequent rating of the cars
Most frequent status of the cars
Average price based on the condition (status) of the cars
Status of models

Location with the highest number of cars

plt.figure(figsize=(20,8))sns.countplot(data=cd_data, x='State')

Lagos has the highest number of cars.

Car brand listings by state

pd.pivot_table(cd_data,index='State',columns='Brand', values = 'Location', aggfunc='count')

The table above shows the states on the x-axis and the car brands on the y-axis with the number of cars filling the cells. Take note that NaN represents zero (0).

Car Brand with the highest number of listings

plt.figure(figsize=(40,8))sns.countplot(data=cd_data, x='Brand')

pd.pivot_table(cd_data,index='Brand', values = 'Mileage_km', aggfunc='count')

From the table and figure above, toyota has the highest number of listings.

Average price of cars per location

pd.pivot_table(cd_data,index='State', values = 'Prices', aggfunc='mean')

More detailed information on the distribution of car prices can be gotten by grouping the car prices and filtering.

Price_group=[]for c in cd_data.Prices:if c<2000000:Price_group.append("0-1.99m")elif 1999999<c<5000000:Price_group.append("2m-4.99m")elif 4999999<c<10000000:Price_group.append("5m-9.99m")elif 9999999<c<20000000:Price_group.append("10m-19.99m")elif 19999999<c<50000000:Price_group.append("20m-40.99m")elif 40999999<c<100000000:Price_group.append("50m-99.99m")else:Price_group.append("100m-150m")cd_data['Price_group']=Price_group
pd.pivot_table(cd_data,index='State',columns='Price_group', values = 'Mileage_km', aggfunc='count')

Car listings by model year

pd.pivot_table(cd_data,index='Car_name',columns='Years',values='Mileage_km',aggfunc='count')
#take note that NaN represents zero (0)

Most frequent status of the cars

sns.countplot(data=cd_data, x='Status')

Majority of the cars are foreign used.

Average price based on the status of the cars

pd.pivot_table(cd_data,index='Status',columns='Brand',values='Prices',aggfunc='mean')

Status of models

pd.pivot_table(cd_data,index='Status',columns='Brand', values = 'Mileage_km', aggfunc='count')

There are very few cars whose status are new from each brand, with some brands not having new cars at all while all brands except Isuzu have foreign used as their status.

Thats the end of the exercise, you can get the data set using the link below

GitHub - usmanbiu/Exploratory-data-analysis-on-a-web-scrapped-data-set-of-checki-automobiles: This…

This is an EDA excercise to get some insights from a csv dataset scrapped from checki, an online automobile store…

github.com

you can access the google colab notebook via the link below.

Google Colaboratory

Edit description

colab.research.google.com

Bye for now.