Exploratory Analysis of Sports Cars

Danny Pham
INST414: Data Science Techniques
3 min readSep 15, 2024

Questions and Stakeholders

Over the summer I explored been looking more into automobiles. I was curious on learning more about sports cars while practicing my data analytics skills. I explored the automobile market with a focus on sports cars. This project aimed to deepen my understand of the car market as well as performance metrics while providing stakeholders with information to create data-driven decisions.

Key Stakeholders:

  • Car Enthusiasts: Interested in performance comparisons and price distributions. Comparing various models with each other in discussion.
  • Manufacturers: Gain insights into competitors’ offerings in the sports car segment and how they can stand out from the competiton.
  • Researchers and Analysts: Can use this data for further automotive market analysis with trends in the market.

Dataset Overview

To answer this question, I used a dataset containing key specifications of sports cars. These fields, which describe metrics of sports car, can be used to view the car market as a whole. These fields include:

  • Make: The manufacturer of the car.
  • Model: The car’s specific model.
  • Year: Model year of the car.
  • Horsepower (hp): The engine’s power output.
  • Torque (kg): The car’s total weight.
  • 0–60 mph Time (seconds): How long the car takes to reach 60 mph from a standstill.
  • Price (in USD): Retail price of the vehicle
  • Engine Size (liters): The displacement of the engine

Data Collection and Cleaning

  • I collected the my data from an online forum. The dataset was well organized but I ran into issues such as missing values and limited data. I used the Pandas and NumPy libraries for data wrangling and Matplotlib for visualization.
  • Challenges: Out of the 8,056 data points, there were 13 missing points. Since this EDA was based on ranking cars on various attributes, I decided to remove these units as they wouldn’t impact my analysis much due to their small size. If I were to continue this project, I could potentially predict these missing points with linear regression with the potential second axis being cost.
  • Data Range: This dataset had 1007 car models ranging from the year 1965 to 2021. However all of the cars in the dataset becept for the car made in 1965 were being sold from the years 2014 to 2022. This limited data excludes many cars that enthusiast may be expecting to see on the list such as classics or simply slightly older car models.
Price feature had a string data type that needed to be changed to an integer for further plotting.

Results

I was able to visualize the density estimate of car prices. Vast majority of these car prices were below six figures meaning that the cars in this list could be attainable with proper planning. In addition I was able to create a box and whisker plots showing the number of outliers in price present in the dataset. Looking at the sorted values of the data set, it seems that the vast majority of the top ten 0–60 times come form Rimac cars.

Limitations and Future Work

Limited Historical Data

  • The dataset excludes many classic cars, which might limit the analysis for certain enthusiast.

Handling Missing Data

  • In future iterations, I could spend time manually filling out missing values in the dataset, or create a program that scrapes the prices from the manufacturers website.

Broader Market Trends

  • Incorporating a wider range of cars as well as having incorporating models of older years would provide better insights into the data.

Github Repository

To take a look at the code used for this analysis visit the Github Repository.

--

--