What Cars Are Most Similar?

Brandon Fung
INST414: Data Science Techniques
6 min readMar 10, 2024

Introduction

In the vast and ever-changing world of cars, finding the perfect match for your needs and desires can feel a bit like hunting for treasure without a map. Whether you’re after a car that’s a looker, a saver at the pump, or packed with the latest tech, each one brings something special to the table. But what if you already have a dream car in mind and are curious about what else is out there that’s similar? This is where a little data magic comes into play. Through the lens of data analysis, we’ll dive into the quest of uncovering cars that share a resemblance to a model you love, focusing on aspects crucial to car enthusiasts and potential buyers. The insights derived aim to guide several pivotal decisions, including purchase considerations, maintenance expectations, and potential resale value.

Ideal Data

To identify vehicles most similar to a chosen model, a comprehensive dataset encompassing a wide array of automotive specifications and consumer data is essential. This dataset should include fields such as make and model, year of manufacture, price range, engine specifications (including horsepower, torque, and fuel efficiency), safety ratings, and consumer satisfaction indices. Additionally, it would benefit from incorporating advanced features like in-car technology options, interior and exterior dimensions, and environmental impact ratings. The relevance of these fields lies in their ability to capture the multifaceted aspects of a vehicle that influence a buyer’s decision-making process. For example, engine specifications can inform on performance and fuel economy, safety ratings provide insights into the vehicle’s security and durability, and consumer satisfaction indices reflect the vehicle’s reliability and overall satisfaction among current owners. By analyzing these dimensions, the dataset allows for a nuanced comparison of vehicles, identifying those that are not only similar in specifications but also align with potential buyers’ preferences and values, thereby facilitating a more informed and tailored vehicle selection process.

While this dataset is ideal, it is also important to recognize that this data does not come easily. As a result, I have found a dataset on Kaggle that lists the most popular sports cars manufactured from 2010 to the present day. This subset only contains information regarding make and model, price, and engine specifications.

Cleaning

To clean my data, I first had to drop any duplicate entries. I defined duplicate entries where the make, model, and year were all identical. Next, converted all columns that were going to be used in my analysis to float data types. This was done so that it would be possible to calculate a distance in the first place. After, I created a data frame containing just the features and an array that contained just the vehicle names. They both contained the same index, simplifying cross-referencing and data retrieval.

Analysis

For my analysis, however, I chose to not incorporate the price, as I wanted to come at the problem from the angle of a car enthusiast who does not have a budget. This way, my analysis will only find vehicles that are most similar according to performance. The specific fields that will be used to measure similarity consist of the following: Engine Size (L), Horsepower, Torque (lb-ft), and 0–60 MPH Time (seconds). Since these are all performance-based metrics, there are no biases influenced by scale, so I will be measuring similarity using Euclidean distances as opposed to something like cosine.

Porsche 911

The first car I wanted to analyze was the 2022 Porsche 911. Renowned for its engineering prowess, the 911 delivers unmatched performance and agility, enveloped in a timeless design that’s both iconic and evolving. Beyond its aesthetic and exhilarating drive, the 911 is celebrated for its superior build quality and reliability, offering a luxurious, driver-focused interior made from premium materials. Here is a list of the most similar vehicles to the 911:

The list starts by listing the make and model, along with its year of production. After, we have the “distance” between the vehicle and the 2022 Porsche 911. The lower the distance, the “closer” the vehicle is to the 911, and thus more similar. As we can see, the 2021 Mercedes-Benz AMG A45 is the most similar to the 911 with a distance of 23.216. This means that the AMG has the closest performance to the 911 and comes in at half the price!

Lamborghini Huracan

The next car I wanted to see was the 2021 Lamborghini Huracan. Lamborghini is synonymous with extreme performance, and the Huracan is no exception, offering exhilarating speed and razor-sharp handling that reflects its motorsport heritage. Its reputation for luxury and innovation means you’re not just buying a car, but an experience that combines state-of-the-art technology with bespoke comfort. Here are the most similar cars to the Huracan:

The Huracan is in another class of luxury, and its price point along with its competitors reflects that. The 2022 McLaren GT is the most similar, with a distance of 24.198 and coming in $60,000 cheaper at $213,195. While I am sure if you are purchasing a Lamborghini, price is not a problem, it is interesting to see that all but two of the most similar cars according to performance are significantly cheaper. This shows that you are not just buying the car for the performance, but also for the name and all that comes with it.

BMW M2

The last car I wanted to explore was the 2022 BMW M2. The M2 stands out as a more affordable entry into the world of luxury sports cars, without compromising on the driving excitement and quality craftsmanship associated with the BMW brand. Below are the most similar cars:

The top 3 most similar cars are the same model with just a slightly different spec and/or manufacturing year. Intuitively this makes sense as you would expect the same car to be very similar if the only difference is the year. Ignoring these vehicles, the next most similar car is the 2021 Mercedes-Benz AMG C43 Coupe. I find this very interesting because BMW and Mercedes-Benz are in direct competition and the AMG is priced very similarly.

Limitations

The main limitation of my analysis is the lack of data. After removing all duplicate entries, I was left with just about 300 sports cars. This is a very small sample size, to begin with, and it also just contains one type of vehicle from a predetermined time, strengthening the bias. Furthermore, I was limited to just performance metrics, so my analysis does not factor in other important aspects like safety rating, gas mileage, etc.

The code for my analysis can be found here.

--

--