Price analysis of 100,000 new and used cars from 2022
SHAP values of car age, mileage, brand, and more
Published in
6 min readSep 24, 2022
For a follow-up analysis of my previous story, Which factors form the used car price? SHAP values based on the Kaggle vehicle dataset | by Dmytro Iakubovskyi | Sep, 2022 | Medium, I have found a much larger public dataset that contains detailed data for more than 100 thousand cars available for sale in 2022. Full details of the analysis can be found in this public Kaggle notebook.
Can you recognise which car brand is the most expensive?
Step 1 — data preprocessing
Here, data preprocessing consists of the following steps:
- removing 1% (1%) cars with the largest (smallest) selling prices;
- log-transformation of car prices and mileages;
- selecting cars from brands that have at least 100 records in the dataset;
- grouping close numerical values onto larger buckets;
- replacing null values;
- finally, removing unused columns.