Exploratory Data Analysis (EDA) — Data Science With Craigslist Data Part 4
When you get your hands on some new data, some of the the first questions go something like this. How much data is there, as in rows and columns? How much is missing? How many variables are there? How many numerical fields? What relationships exist in the data? Is there a skew in the data to avoid making the wrong conclusions? etc. The process of getting answers to these kinds of questions is exploratory data analysis.
In previous posts of this series, we web scraped vehicle ads from Craigslist in part 2 using python and then performed some data cleaning in part 3. In this section we’ll perform some EDA on the resulting dataset from the previous two sections. You can find a copy of the Jupyter notebook on my GitHub if you’d like to follow along.
First things first, we load the necessary python libraries and then read the cleaned data into a Jupyter notebook environment, check out the data types and reformat them as needed.
In total, we have 35,887 rows and 25 columns. We’ll start our exploration by seeing how many cities there are in the dataset and how much data we have on each city. A quick way to visualize this is using a histogram which we’ll create using the matplotlib library.
Most of the cities have around 3000 records. The exceptions are, Philadelphia, San Antonio, Washington DC, Boston and Nashville which have records ranging from 800 to 1800 records. The reason is that Craigslist maintains a maximum of 3000 records per city for each post category. We see that some cities do not have enough posts in the vehicle ads category to reach 3000.
This data was downloaded in late October 2021 but we can also see just what date each vehicle ad was posted. Pandas are quite handy at handling time series data so we’ll just group the data by year, month and day to see the trends.
The earliest post was on 9/24/2021 and the last one was on 10/24/2021. That's a months worth of vehicle ads. But on what days were the ads posted?
Saturdays are the most popular to post vehicle ads on Craigslist followed by Saturday, Sunday, Thursday, Wednesday, Monday respectively and finally Tuesday is the slowest day for new ads posts. That’s great to know but what time do these posts happen?
It look like the hours between 9am and 6pm are the most popular for posting vehicle ads with a peak time between 10am and 3pm.
Now that we’ve examined the time components of the data, let’s look at some more variables. Let’s start with the price distribution across the data.
There’s 34K plus vehicles with roughly 11K in the range of $0 to $5,000 and another 11K in the range of $5,000 to $10,000. This means that roughly 65% of all vehicle ads are below ten thousand dollar. Now I’m just burning to know when these cars were made.
It turns out that most of these cars were made between 2004 and 2016 with a peak year of 2008. The mode age of a car on Craigslist is around 12 years old but I wonder which vehicle brands dominate these ads.
In all the 15 cities in our dataset, Ford, Chevrolet and Toyota are the top three car makes/brands on Craigslist respectively making up about 40% of all vehicles in the data. The top 2 are American made while the ranks 3, 4 and 5 are Japanese made. These numbers might change a bit if you group them by parent company but Ford and Chevrolet parent companies would still have a significant lead over the other brands. We also have the color attribute so we’ll go ahead and see what the most popular colors are.
Well, the color white is the most popular followed by black and in third place we have silver. Blue and red also appear a lot in ads making up around 5,000 of the 34K records.
We have three variables left to explore which are vehicle transmission, number of engine cylinders and title status. Let’s visualize those as well.
Automatic transmission is the most popular type of transmission while a clean title status occurs the most in the title status category. Also, 6 cylinder engines are the most prominent but not by much compared to the second ranked 8 cylinders and in third place 4 cylinders. These top 3 account for more than 90% of all records. All other cylinder configurations appear in very small numbers.
So far, we have familiarized ourselves with the data variables and know where to start when we’re asking deeper questions from the data like what is the relationship of one variable to another. Or are cars cheaper in Miami than say Chicago. But more importantly, we also have an inkling of the data limitations from our EDA. Let’s take a look at price and vehicle age for example.
While not enough to make a solid decision, we can see from the boxplots above that the average price of a car is lowest in Atlanta, Philadelphia and Washington DC. Phoenix and Los Angeles have the highest average price. Most of the high averages are driven by the high variance that exists in price within each city.
Looking at the vehicle ages, we can observe that on average, Los Angeles and Seattle have the most oldest cars in the group at 6 years old average. Miami and New York have the newest average age of a car at around 4½ years old. That’s not much of a difference from the latter but it does give you an idea of which city has more new cars than old ones. A point which is further revealed by the boxplot above which shows that cars from the 25th percentile of all cars within those cities were manufactured around the year 2000.
The histogram and KDE plots below show the same information as above but from a different angle.
The vehicle age distribution of cars in all 15 cities is skewed to the left with most cars made between 2004 and 2015. Philadelphia, San Antonio, Boston and Nashville have fewer cars overall which is explained by a much smaller sample size.
The price in all 15 cities is skewed to the right with most cars ranging from $5,000 to $10,000 in price. Price and vehicle age combined yield the scatterplot below which helps us understand some of the relationship that exists between these two factors.
By analyzing the price vs vehicle age relationship, we can see that the newer car models fetch a higher price than older ones. Furthermore, the worse the vehicle condition the lower the price on average. We can also observe that most of the vehicles that are very old, between 1925 and 1975, these car tend to fetch a high price with Los Angeles and Phoenix leading the count in these older vehicles.
Finally, we’ll create a correlation matrix to give us a higher understanding of the numerical variables in the data, i.e. columns. Some of the categorical variables can also be encoded into numerical variables for the correlation analysis. The columns that fit this categorical encoding are:- year, condition, cylinders, drive(manual or automatic), fuel, size, title status, transmission, make. We use the pandas.Categorical method to do the encoding and create new numerical columns while keeping the old categorical ones.
Then create a correlation matrix plot with Seaborn.
On a high level, it seems that none of the variable have a strong correlation with each other and while this may be true, we can now examine closer using variable values for hidden correlations that may not immerge at the moment because of the noise in the data.
When shopping for a car, everyone wants good value for their money. Value is mostly determined by looking a four basic things. Price, age, mileage and vehicle condition. We also have drivetrain, fuel type, transmission among other data points which further inform on the price. In the next section, part 5 of this series, we’ll do a thorough deep dive into these fields and elicit any interrelationships that exist between them. As always, please let me know if you have any questions in the comment section below.