Automotive Data Insights

Dante Caraballo
INST414: Data Science Techniques
8 min readApr 16, 2024

By Dante Caraballo and ChatGPT

In the evolving field of data analytics, supervised learning is a widely used technique, offering a structured pathway to uncover hidden patterns and predict future outcomes. This post delves into an intriguing application of supervised learning: predicting the fuel efficiency of vehicles based on their specifications. Such predictions not only shine a light on the relationship between a vehicle’s design and its performance but also pave the way for advancements in automotive technology and consumer choice.

Stakeholders and Impact

My research question was based around the performance of the vehicles “Can the fuel efficiency of a vehicle be accurately predicted using its specifications?” This inquiry captures the attention of diverse types of stakeholders, from automobile manufacturers striving for innovation in fuel economy to environmentally conscious consumers navigating the complexities of car ownership. Beyond its immediate relevance, the answer to this question informs broader discussions on energy consumption, environmental impact, and the future of transportation.

With the addition of the new prediction analysis and clustering this analysis also serves beginner car enthusiasts who are trying to understand the relationship between car statistics and performance. A consumer or car enthusiast may use this analysis when building a project car to understand the relationship between displacement and acceleration. My secondary research question for the extension of my analysis is “Can the acceleration of a vehicle be accurately predicted using its specifications?” and also what is the relationship between statistics like displacement, # of cylinders, weight, and horsepower, and acceleration?

The Dataset

The dataset serving as the foundation of the analysis is a rich source of information, featuring a wide range of vehicle attributes. Each entry in the dataset is a snapshot of automotive history, detailing the make and model of the vehicle, its engine specifications such as displacement and horsepower, physical characteristics like weight and acceleration, and the era of its manufacture. By using these features together, the dataset constructs a narrative of automotive evolution, making it ideal for this supervised learning endeavor.

Features:

  1. name: The name of the car, including its make and model (e.g., “chevrolet chevelle malibu”).
  2. mpg: Miles per gallon, a measure of fuel efficiency.
  3. cylinders: The number of cylinders in the car’s engine.
  4. displacement: The engine displacement in cubic inches, a measure of the engine’s size.
  5. horsepower: The power output of the car’s engine.
  6. weight: The weight of the car in pounds.
  7. acceleration: The time it takes for the car to accelerate from 0 to 60 miles per hour, measured in seconds.
  8. model_year: The year the car model was released.
  9. origin: The country of origin of the car (e.g., “usa”, “japan”, “europe”).

Data Collection and Preparation

I found the Automobile.csv dataset on Kaggle. In data science, the integrity of analysis is linked to the quality of data. Acknowledging this, I cleaned the data considering and removing any instances that were incomplete or missing. This process not only refined the dataset but also brought to light the challenges inherent in preparing real-world data for analysis, from dealing with missing values to ensuring the consistency of data formats.

In the clustering section of my analysis, I chose to deal with data inconsistencies in a different way. I chose to fill in the missing horsepower data points with the mean for the column. There were about 5 missing entries so cleaning the data in this way would not skew the data. I did this because I wanted to make sure I had a sufficient amount of clustering points

Next, I encoded the origin as a single categorical variable to transform it into a format most suitable for this analysis. I used label encoding to encode the country of origin to 0,1,2 representing the USA, Europe, and Japan. This approach was suitable because I wanted to use the country they were manufactured in to create clustering relationships. I used the sklearn LabelEncoder function to encode the origin.

Numerical variables like mpg, cylinders, displacement, horsepower, and weight were scaled to normalize their ranges. This was an essential step because it ensured that no one feature would skew or dominate the clustering process. I used StandardScaler from the sklearn library, which standardizes features by removing the mean and scaling to unit variance.

The Supervised Learning Model

The task of predicting a vehicle’s mpg and acceleration is a regression problem, given the continuous nature of the target variable. I chose a Linear Regression model, a decision underpinned by its simplicity, interpretability, and the linear tendencies observed in the preliminary data analysis. The model’s premise is straightforward, by understanding the relationship between the vehicle’s features its MPG, and acceleration, predictions can be made about fuel efficiency and acceleration. Yet, within this simplicity lies the model’s profound capacity to reveal the subtle dynamics of vehicle performance.

Training and Evaluation

The dataset was split into training and testing sets, with the former used to teach the model the complex relationships between features, mpg, and acceleration and the latter to evaluate its predictive abilities. The model’s performance was analyzed through metrics such as Mean Squared Error (MSE) and R-squared (R²), which provided quantitative measures of its accuracy and efficacy.

Insights from Prediction Discrepancies

Despite the model’s overall satisfactory performance, an examination of individual predictions unveiled instances of significant error. These outliers serve as reminders of the model’s limitations and the unpredictable nature of real-world data. By analyzing these anomalies, I gained valuable insights into the factors that may trip up the model, from unique vehicle designs to data recording anomalies, opening avenues for further refinement.

Actual vs Predicted Miles Per Gallon
Top 5 Prediction MPG Discrepancies
Actual vs Acceleration
Top 5 Prediction Acceleration Discrepancies

Clustering Analysis

Clustering: I applied K-means clustering and determined the number of clusters using the elbow method. I determined three as the optimal number of clusters for my analysis based on the graph below.

Elbow Method

To improve the interpretability of how cars are categorized based on their features, I used PCA for dimensionality reduction to show the clusters in a 2D space. The process repeats twice: first, it finds the closest cluster center (centroid) for each data point, and then it updates the centroids using the mean of the points in each cluster. Until the algorithm finds a solution where the centroids no longer shift noticeably — a sign that the clusters are as compact and distinct as possible — this procedure is repeated.

Clustering Visualisations and Insights

Cluster 0: Categorised by high fuel-efficient vehicles, as well as smaller displacements which suggest smaller more efficient vehicles. Additionally, cars in this cluster have lower horsepower, high acceleration, and lower weight. This suggests that the vehicles in this cluster are smaller and quicker than the cars in other clusters.

Cluster 1: This cluster is categorized with cars that have average statistics. Fuel efficiency, displacement, horsepower, weight, and acceleration are all at medium levels in comparison to the other clusters. This may indicate that the cars in this cluster are daily drivers for average consumers.

Cluster 2: This cluster consists of larger, newer, and less fuel-efficient cars with high power but slower acceleration. These are likely, powerful vehicles such as SUVs, trucks, or other vehicles for commercial use.

Mean Values of Features by Cluster
Mean Acceleration by Cluster

Higher fuel efficiency, lower cylinders (engine size), lower horsepower, and a lower weight all contribute to an increase in acceleration based on the data. Some of these metrics like horsepower vary however, if a car has lower weight and higher horsepower it will accelerate more quickly.

Implications and Next Steps

The implications of these findings stretch beyond the confines of this study. For manufacturers, the insights gained can inform the design of more quick and fuel-efficient vehicles. For consumers and car enthusiasts, a deeper understanding of how various factors influence mpg and acceleration can guide more informed purchasing decisions. Moreover, this exploration contributes to the broader conversation on sustainability and environmental protection.

The potential for incorporating more data, exploring nonlinear models, and applying machine learning techniques is large for this data and could lead to further insights. Each step forward offers the promise of deeper insights and more intricate understandings of the interplay between automotive design and performance.

Challenges

Throughout this analysis, I encountered and overcame a few obstacles including the handling of missing values and the standardization of the data types. This process, while often overlooked, is crucial to the success of any data-driven inquiry.

Limitations and Opportunities for Growth

No analysis is without its limitations, and this study is no exception. One of the most significant constraints lies in the dataset’s historical and regional focus, which may not capture the full spectrum of global automotive trends or the latest advancements in vehicle technology. Additionally, the assumption of linear relationships by the Linear Regression model may not fully encapsulate the complexities and non-linear interactions inherent in real-world phenomena.

Recognizing these limitations opens the door to numerous opportunities for improvement. Future studies could leverage more sophisticated models, such as Random Forest or Neural Networks, which can handle non-linearities and complex interactions more effectively. Moreover, expanding the dataset to include more recent data or data from diverse geographical regions could enhance the model’s applicability and robustness.

Conclusion

By leveraging the power of supervised learning, we have taken a significant step toward understanding how various vehicle attributes influence fuel efficiency and acceleration. This endeavor not only aids manufacturers and consumers but also contributes to the larger dialogue on energy efficiency and environmental responsibility. The clustering analysis of the automobile dataset has revealed distinct groupings of cars based on their specifications and performance metrics. Consumers and car enthusiasts can use these findings to purchase cars and identify the relationship between specifications and car performance.

For a deeper dive into the code that powered this analysis, visit the GitHub repository and view Module_6_Assignment_Extended.

--

--