Finding Similar Cars by Using K-Means Clustering

Pranav Vijay
INST414: Data Science Techniques
6 min readApr 11, 2024

Question: What cars in the market are similar to the car I own currently?

The stakeholder asking this question is someone looking for a new car to own. They want to own a new car that has similar features to the car they own currently so that they don’t have to adjust to different features, and they are happy with the current features.

The decision the stakeholder will make after finding the answer to this question is buying a new car that is very similar to the one they own now.

The data that can answer this question is a dataset of cars from the last 10 years. The fields of the dataset would be features of a car that are of importance to the stakeholder, like the horsepower and fuel capacity of cars. This dataset is relevant to my question since I can use the features of cars to calculate similarity between cars and find clusters of cars similar to the car the user owns currently.

I used the Kaggle website to collect a subset of this data. Kaggle contains many datasets that are available for downloading. The dataset I downloaded was called “Car Sales,” which was uploaded by Gaganbhatia. I created a Jupyter notebook to create a program to analyze the data and create clusters. I used a Python kernel to run my program. I downloaded the CSV file of the dataset from Kaggle and read it in Python through a dataframe from the Pandas library using the read_csv() function. I imported the KMeans class from the sklearn.cluster module in order to do K-Means clustering and find similar data points in my dataframe. I imported the matplotlib.pyplot module in order to create and display a plot for the Elbow Method, which would help me identify a value for k, the number of clusters for my dataset.

The similarity metric I am using is Euclidean distance. I am measuring similarity between data points in my dataset by using K-means clustering, which uses Euclidean distance to group data points into clusters. The features I am using to measure similarity are the horsepower, width, length, fuel capacity, and power performance factor of the cars.

I selected a value for k, the number of clusters, by using the Elbow Method. After observing the Elbow Method graph I generated from my dataframe, the value where the graph starts to level off is at k=3. As a result, I decided that the number of clusters for my dataset was 3 clusters.

I took the average of each feature in the clusters in order to better understand how cars were assigned into clusters. After observing the averages, the width feature did not play a significant part in determining clusters since the average width values for the three clusters were very similar to each other.

The first cluster(Cluster 0) represents cars that have the shortest average length, lowest average horsepower, lowest average fuel capacity, and lowest average power performance factor. 50 cars were identified as being part of the first cluster. One element in this cluster is the Acura Integra car. Another element in this cluster is the Toyota Tacoma car.

The second cluster(Cluster 1) represents cars that have the second-longest average length, highest average horsepower, highest average fuel capacity, and highest average power performance factor. 11 cars were identified as being part of the second cluster. One element in this cluster is the Lexus LS400 car. Another element in this cluster is the Audi A8 car.

The third cluster(Cluster 2) represents cars that have the longest average length, second-highest average horsepower, second-highest average fuel capacity, and second-highest average power performance factor. 56 cars were identified as being part of the third cluster. One element in this cluster is the Buick Park Avenue car. Another element in this cluster is the Infiniti I30 car.

By using K-means clustering for my data about car models, I am able to group similar cars into clusters based on the features I selected. The stakeholder can find the car they own in one of the clusters, and they can see other cars in the same cluster that are similar to the car they own. If the user owns an Acura Integra car, four similar cars the user can buy are the Toyota Tacoma, Plymouth Voyager, Ford Ranger, and Volkswagen Golf. If the user owns a Lexus LS400 car, four similar cars are the Audi A8, Chevrolet Corvette, Cadillac DeVille, and Cadillac Eldorado. If the user owns a Buick Park Avenue car, four similar cars are the Infiniti I30, Chrysler Sebring Convertible, Pontiac Grand Am, and Porsche Boxter.

Here is a table of the first cluster with the average values of car features:

Here is a table of the second cluster with the average values of car features:

Here is a table of the third cluster with the average values of car features:

Here is a table for each of the three clusters with 5 cars identified for each cluster:

Here is a figure of the Elbow Method graph I generated to find the k value:

I cleaned up my data by making sure that no values were missing from the data. After reviewing the dataset, there were rows that were missing data. To solve this issue, I used the dropna() function to drop the rows that contain null values. I also reviewed my data to see if there were duplicate values in the dataset. After reviewing, there were no duplicate values present. I also removed any columns that weren’t needed in my current data analysis. In my dataframe, I removed the “Manufacturer,” “Model,” “Sales_in_thousands,” “__year_resale_value,” “Vehicle_type,” “Price_in_thousands,” “Wheelbase,” “Curb_weight,” “Engine_size,” “Fuel_efficiency,” and “Latest_launch” columns since these weren’t features that I wanted to focus on when creating clusters for the dataset. I would later use the values in the “Manufacturer” and “Model” columns for printing out a sample of cars for each of the three clusters. These columns would make it easier to identify the cars.

One limitation of my analysis is that the dataset only contains car sales from 2011 to 2012. The stakeholder may currently own a car that they bought after 2012 and may want to buy a new car that was released more recently. There are many car models missing from this dataset that the stakeholder may be interested in. This analysis may be biased since the features I am focusing on are the horsepower, width, length, fuel capacity, and power performance factor of the cars. The stakeholder may prioritize other car features when looking for similar cars to buy.

Here is a link to my GitHub repository that contains the Jupyter notebook that I used to do K-means clustering on the dataset to group the cars into clusters. The GitHub repository also contains the CSV file of the original dataset that I analyzed.

Link: https://github.com/pvijay2024/module4

Here is a link to the original Kaggle dataset I used to create clusters:

Link: https://www.kaggle.com/datasets/gagandeep16/car-sales

--

--