Cluster Analysis of a Used Vehicles Dataset

Jasmi Kevadia
INST414: Data Science Techniques
4 min readApr 25, 2023

Introduction:

In this post I will be working with a dataset of used vehicles for sale, obtained from Kaggle. The dataset contains information about various used vehicles, including model, year, price, and other features. I aim to extract non-obvious insights from this data using clustering techniques.

Insights and audience benefits:

Finding cars with similar qualities can help inform car dealerships and car manufacturers in making decisions related to pricing, marketing, and inventory management. By identifying groups of cars with similar features and characteristics, car dealerships can better understand their customer preferences and tailor their marketing strategies accordingly. Similarly, car manufacturers can use this information to design and produce cars that align with customer preferences and demand.

Data Collection:

The dataset used for this analysis is obtained from Kaggle and contains information about used vehicles for sale. It consists of a CSV file with over 7,000 rows and 13 columns.

To determine similarity between vehicles, I used a combination of features including make, model, year etc. These features provide a general representation of a vehicle’s characteristics and are likely to influence buyer preferences and pricing decisions. I use Euclidean distance as the similarity metric, which is a common choice for numerical features. To determine the optimal number of clusters, I did the “elbow” method. This involves fitting the data to the K-Means clustering algorithm with different values of ‘k’ and plotting the within-cluster sum of squares against the number of clusters. I experimented with different values of k and found that using k=4 resulted in the most distinct clusters.

Selecting and clustering:

In this analysis, I used Numpy and Pandas to manipulate and preprocess the data. Numpy was used for numerical computations, while Pandas was used to load and manipulate the data frames. Specifically, I used Pandas to remove duplicates, drop columns with missing data, and convert categorical data to numerical data using one-hot encoding. After cleaning and preprocessing the data, I performed k-means clustering with the k -value obtained from the elbow method. The resulting clusters represent groups of similar vehicles based on their make, model, year, price, and mileage. By examining the vehicles within each cluster, I made the following observations:

Luxury SUV: This cluster contains luxury SUVs from various makes and models, typically with high prices, low mileage, and recent model years. This insight may inform business decisions related to pricing, marketing, and inventory management for luxury SUVs.

Compact cars: This cluster includes compact cars from different makes and models, typically with lower prices, moderate mileage, and older model years. This insight may be useful for understanding the demand and pricing dynamics of compact cars in the used vehicle market.

Pickup trucks: this cluster contains pickup trucks from various makes and models, typically with moderate to high prices, moderate to high mileage, and a mix of newer and older model years. This insight may inform decisions related to pricing, inventory management, and marketing strategies for pickup trucks.

Sports cars: this includes sports cars from different makes and models, typically with higher prices, lower mileage, and recent model years. This insight may be useful for understanding the demand and pricing patterns of sports cars in the used vehicle market.

Results:

the following table summarizes the key features of each cluster:

Cleaning the data:

To clean the data, I first removed duplicates from the dataset and then dropped columns with missing data. I also converted categorical data to numerical data using one-hot encoding. Additionally, I removed outliers that were beyond three standard deviations from the mean.

Limitations:

There are several limitations and potential biases in this analysis. Firstly, the data used for clustering is based on the features available in the dataset, which may not capture all relevant factors influencing vehicle similarity. For example, factors such as vehicle condition or maintenance are not included in the analysis. Also the dataset used for this analysis may not be representative of the entire used vehicle market, as it may have certain biases such as sample selection bias or data quality issues. It’s also important to note that clustering is an unsupervised technique and the resulting clusters may not necessarily align with industry standards.

Conclusion:

In this analysis, I used a dataset of used vehicles for sale to perform clustering and extract insights about groups of similar vehicles based on their make, model, year, price, and mileage. The results can provide valuable information for pricing, marketing, and inventory management decisions in the used vehicle market. However, it’s important to consider the limitations and potential biases of the analysis and interpret the results with caution. Further analysis may be needed to confirm the findings and make decisions based on the insights obtained from the analysis.

Github: https://github.com/jasmi01/INST414Exercises/blob/main/assignment4

--

--