Used car price prediction using different machine learning models

Edwing Jimenez
3 min readJul 14, 2024

--

In this project, I train three (3) Machine Learning algorithms: Multiple Linear Regression, Random Forest Regression, and XG-Boost to predict used cars prices. This project can be used by car dealerships to predict used car prices and understand the key factors that contribute to used car prices. Code is available here.

1. Objective

The primary objective of this project is to predict the market price of used cars using various machine learning models. By accurately predicting car prices, potential buyers and sellers can make more informed decisions.

2. Tasks

The project was divided into several key tasks:

  1. Data Import and Initial Exploration: Importing the dataset and displaying the initial features and structure.
  2. Data Cleaning and Preparation: Handling missing values and transforming data into a suitable format for analysis.
  3. Data Visualization: Generating visualizations to understand the relationships between different features.
  4. Feature Engineering: Creating and selecting relevant features for the prediction model.
  5. Model Training and Evaluation: Training multiple regression models and evaluating their performance.
  6. Comparison of Models: Comparing the performance of different models and selecting the best one based on key performance indicators.

3. Findings and Results

Data Import and Initial Exploration

  • The dataset was successfully loaded, containing various features like Make, Model, Year, Engine Fuel Type, MSRP, etc.
  • Initial exploration revealed some missing values, which were subsequently handled.

Data Cleaning

  • Missing values were identified and dropped, given their small quantity.
  • Price columns (MSRP) were cleaned by removing non-numeric characters to facilitate analysis.

Data Visualization

  • Scatter plots and histograms were created to explore the relationships between car features and their prices.
  • Visualizations indicated clear trends and strong correlations between features like car EngineSize,Cylinders, and Horsepowerwith our response variable MSRP.

Model Training and Evaluation

  • Linear Regression: This model was trained and evaluated, providing a baseline performance.
  • Decision Tree Regressor: Trained on the dataset but showed overfitting tendencies.
  • Random Forest Regressor: Provided better performance compared to the decision tree by averaging multiple trees.
  • XGBoost Regressor: Exhibited the best performance, leveraging gradient boosting techniques to improve accuracy.

Model Performance

  • Multiple Linear Regression: Moderate accuracy with room for improvement.
  • Decision Tree Regressor: High training accuracy but lower test accuracy, indicating overfitting.
  • Random Forest Regressor: Improved performance with better generalization.
  • XGBoost Regressor: Highest accuracy among all models, indicating its robustness and effectiveness for this task.
print('Multiple linear regression: %.2f' % accuracy_LinearRegression)
print('Decision Tree regression: %.2f' % accuracy_DecisionTree)
print('Random Forest regression: %.2f' % accuracy_RandomForest)
print('XGBoost regression: %.2f' % accuracy_Xgboost)

Multiple linear regression: 0.81
Decision Tree regression: 0.75
Random Forest regression: 0.81
XGBoost regression: 0.91

4. Conclusions

The XGBoost Regressor outperformed all other models in predicting the market price of used cars. This model’s ability to handle various types of data and its robustness against overfitting makes it a superior choice for this task.

5. Key Takeaways

  • Data Cleaning is Crucial: Proper handling of missing values and data transformation significantly impact model performance.
  • Visualizations Aid Understanding: Data visualizations help in comprehending relationships and trends within the data.
  • Model Selection: Different models have varying strengths; ensemble methods like Random Forest and XGBoost generally provide better performance for regression tasks.
  • XGBoost Superiority: XGBoost’s performance in this project underscores its effectiveness for regression problems in structured datasets.

--

--