Exploring and Analyzing Used Car Data Set

Irtasam Ali Wains
The Startup
Published in
11 min readNov 20, 2020

This blog post is a component of our undergraduate course of Data Science. Our group has chosen a dataset on Used Cars from Kaggle, that is between the years of 1923–2020 and contains the data on used car adverts on the craigslist website. The data set is restricted to adverts from the United States of America.

Our inspiration was how the biggest perk of buying a secondhand car is that you let the first owner handle the highest deprecation of the car. A brand new car loses its value the most in the first 2 or 3 years of use. Anyone can get a lot more for their money if they have the right information at their disposal!

Through our analysis we want to find interesting trends and correlations between the various features of the sale of the used cars.

Goals:

1. Identify the trends in the sales of cars and car industry in general.

2. Correlation between price and condition, which is then further extended by year, current location of car, number of cylinders in the vehicle and vehicle mileage.

3. Prediction of a used car’s price given its specifications

Data Set:

The Data Set before cleaning has 423857 entries with 24 columns. The data set has the following columns: URL, city, City URL, price, year, manufacturer, make, condition, cylinders, fuel, odometer, Title Status, transmission, VIN, drive, size, type, Paint color, Image URL, description, latitude, longitude. Austin Reese scraped the Craigslist’s website to obtain this data set. According to Austin’s brief introduction on the Kaggle page the data set is scraped and updated every few months.

Key Features of the Data Set:

ID — Unique ID given to every ad and is the primary key to the data set.

Price — Price is given in US dollar and has not been adjusted for inflation.

Year — The year in which the car was manufactured

Manufacturer — with 43 unique businesses engaged in the manufacture of automobiles.

Model — The exact model of the car. Like sierra classic 2500hd.

Condition — The condition of the car; excellent, good, fair, like new, salvage, new.

Cylinders — The number of cylinders in the car engine ranging from 3 to 12. Also has the ‘other’ category too.

Fuel — There were five types of fuel, ‘diesel’, ‘gas’, ‘electric’, ‘hybrid’ and ‘other’.

Odometer — This is the distance that the car has traveled after it being bought.

Status — The cars also had 6 types of statues; ‘clean’, ‘lien’, ‘rebuilt’, ‘salvage’ , ‘parts only’ and ‘missing’.

Drive — There are 3 types of drive transmissions; ‘4WD, ‘FWD’ and ‘RWD’. (Four wheel drive, forward wheel drive and rear wheel drive.)

Type — This feature identifies if a vehicle is a SUV or a mini-van. There 13 unique values in this feature.

State — The state is political territory and is represented in short form in the data set. Like “fl” is used for the state of Florida.

Latitude, Longitude — When both features are combined, they give the location of where the car is being sold at.

Data Cleaning Steps:

1) Categorizing NaN values and labelling them while reading the dataset from csv file.

2) Checking if NaN values can be replaced and removing columns that can’t be filled like the county column which was totally empty.

3) Removing redundant columns giving us the same information like region and county give us the location of the car along with State and latitude, longitude coordinates.

4) Checking for outdated information. Columns containing URLs were removed as the URLs were not functional anymore

5) Dropped all rows that had NaN values which could not be substituted

6) Ran a check for duplicate rows and removed the duplicate data entries

7) Confirmation of removal of duplicates using python’s pandas library’s unique function

8) Renamed columns with lengthy names to shorter ones

9) Changed data types of columns to make them coherent with their value. The data type of Number of cylinders in a vehicle was changed from string to integer.

10) Converted the condition of a car from labels to 0 to 5 numerical scale where 5 represents new and 0 represents salvaged such that is it is easier to see the correlation of condition and car price.

Exploratory Data Analysis (EDA):

For EDA we wanted to visualize the relation of different features with each other and their relation with price of a used car in particular. Apart from that we visualized different sale trends in used cars.

From the Correlation matrix, a few linear relations are clearly visible like the condition of a car and year it was manufactured. A car manufactured at a later year is expected to be in better condition then the ones manufactured in previous years. To further our analysis , different methods were used to visualize the features.

To start our journey of EDA we first decided to check where and how the used cars were available for sale in USA. For this purpose we used the latitude and longitude coordinates of each car and plotted it on an interactive map of USA. Each point on the map represents a used car with it’s manufacturer, model and price. From this we noticed that frequency of cars being sold was higher on the East coast then West coast of USA. Which could be because of the limitations of our data set. For demonstration purposes the graph can be seen below for the 10 percent of used cars available for sale in our data set.

Sale trends in different states of USA:

In terms of sheer volume of sales we can see that California and Florida lead the chart but the prices of these cars tell a different story.

Our hypothesis was that the average price of cars and number of cars available for sale would have an inverse relation. Average prices of cars show that Wyoming the least populous state of USA has the highest average price of cars which could be because of the volume of cars available in the region which are very low. Another key observation was that Florida has relatively cheaper cars then California even though there are more cars for sale in CA then FL. Delaware which has very few vehicles for sale but has least average prices for cars after Washington D.C. Both the cases mentioned above clearly invalidate our hypothesis.

Market Share of different car manufacturers:

Ford leads the pack with most cars up for sale from a manufacturer. This comes as no surprise since Ford’s F-Series line of pickups have been America’s Best Selling Truck for 43 years straight. For the past 38 years, Ford F-Series trucks have also been the Best Selling Vehicle in America. Electric vehicles are relatively new so we clearly see that vehicles manufactured by Tesla are higher in average price compared to other manufacturers . Porsche and Aston Martin are both manufacturers that produce luxurious super cars with premium features so it makes sense for them to have high priced vehicles.

Types of vehicle and the Fuel they use :

The graph shows the trend of hybrid and electric cars is slowly developing specially in hatch backs and sedans. In heavy weight vehicles like trucks , SUV, pickups the breakthrough is yet to occur. Diesel and gas is the primary source to power these vehicles still.

Trend of number of cars for sale according to the year they were manufactured in :

A decrease in trend in cars available for sale from 2009 show that fewer cars were manufactured in 2009 which can be due to the 2009 financial crisis, the decrease from 2014 onwards can be attributed to temporal proximity to the date the data was collected on. As the cars are already newer users are less likely to sale them ,this is confirmed by the steep decrease from 2015 onwards.

The above graphs shed light on the fact that due to inflation and other external factors with each passing year the average prices of cars and range of car prices increase.

Violin plot for condition of a car and the year it was manufactured in shows that older the car worse its condition .Car condition gets better as we move to newer model, this explains the rise in price against year boxplot as poor condition leads to lower price.

Relation of price with mileage , car engine type, condition, number of cylinders in car , type of drive of a car and status of a car:

By looking at box plots of different features and price we can draw multiple conclusions. We see that hybrid technology has reached economies of scale and average price of a hybrid car and car driven on gas is more or less the same. Moreover, the electric vehicles are more costly than hybrid and gas powered vehicles. Diesel powered vehicles are generally the most expensive , mainly due to the fact that heavy utility vehicles like trucks are the ones using a diesel engine. A general trend is seen of upper quartile range increasing as the condition of car improves. This hints that cars with better condition have more value. In terms of number of cylinders a car has , we see a mixed result and a car with 5 cylinders being the cheapest. Average price is similar for 3, 4 and 6 cylinder automobiles. Front wheel drive cars are cheaper than 4 wheel drive and rear wheel drive cars . This could be because 4 wheel drive vehicles are usually larger in size and consist of SUVs, pickups and trucks. Rear wheel drive cars are usually luxury and sport cars which are expensive because of their premium features. Cars with lien status have high average price which is seen as an anomaly since lien cars have due amount needed to paid to lender before you get full ownership of a car.

The plot shows a relation between odometer reading and the price of the vehicle. A decreasing trend can be observed. The plot also shows the relation of manufacturing year with price and odometer reading as the plot get lighter
towards the bottom right showing that older cars sell for a lower price and have travelled more. Also left side of the graph is less dense showing that there are fewer number of newer car models.

Machine Learning:

Our goal with the machine learning phase was that we wanted to train a model which when given the specifications of a car gave us the estimated price. For this purpose, we first converted all our columns to one hot encoded columns. For the odometer we normalized the values so that the range was not very large. We kept the price as it is as we will use it as our Y in the linear regression. Using all of these new columns we made a new data frame.

We trained two models, one linear regression and one Logistic Regression on this new dataset.

The logistic model was trained to predict other category based features other than the price and odometer.

The linear model gave us a very bad estimate of the prices which it always predicted as lower than the original. We figured out that this might be due to inflation and the difference in prices over the year.

To cater for this we made separate models for different years like one model for year 2018 to 2020 another for 2015 to 2017 and so on till the year 1990. We did not cater to car models below 1990 because very few people would need to buy cars of very old models.

After we had done the above our model was predicting better prices. While the predictions were not the best, they were pretty close to the original. This might be because of multiple reasons, like adjusting the prices of the vehicles for inflation. One other reason for not getting very accurate predictions we think was because these were secondhand car ads and people can put varying prices for secondhand things as they see fit. So for example person A and B might have the same car-A has a better condition car and B does not, but he would lie about the condition and say it’s good and both put up different prices for it. This way our model will predict the average price of all similar cars.

Conclusion:

After thorough analysis of the data set, we were successful in making reasonable conclusions regarding the sale trends in used cars and automobile industry generally. Many of our hypothesis regarding used cars and what factors affect their prices were proved right but some of them like more cars for sale mean cheaper cars were proved wrong. This highlights that there are more factors like taxes which are out of scope of our data set but play a significant part in affecting used car prices. For the final part we wanted to predict price of a car given the state it is being sold in , its type , title, drive, mileage and condition. We were somewhat successful in giving a general idea but because of limitations of time and data set we still have a long way to go. For experimental purposes we even trained a logistical model. Our logistical model was not accurate but could make an educated guess regarding any feature of a car given other features . Looking forward, we would like to acquire more data for accurate prediction of price so that if someone wants to buy a used car they have an idea of what it would cost them before hand and not end up paying more than the car’s worth.

--

--