Building a robust price prediction model for used cars

Tobia Albergoni
ELCA IT
Published in
5 min readMay 5, 2021

Price estimation models have always gathered considerable attention from the machine learning community. The rise of large online marketplaces for all kinds of used objects has increased the need for automated tools to quickly and accurately predict reasonable price tags. Housing and vehicles are two classic examples of high-value items with complex characterizations that cause non-trivial pricing and depreciation trends. Precise prediction models are extremely valuable to both individuals and businesses and many digital marketplaces are investing heavily in their development, but without disclosing the details. In this short note, we present the motivations and goals of a project where we attempted to develop a price estimation machine learning (ML) model for used cars on the Swiss market.

Our vehicle evaluation tool for the Swiss used cars market

Used cars market

Car trading gives rise to extremely large and active markets all around the world. With around 4.5 million registered touring cars in 2018 and more than 300’000 new registrations in 2019, Switzerland is no exception, boasting a motorization rate of 543 vehicles per 1’000 inhabitants, higher than the European average. Such a large influx of new cars automatically generates a parallel market of similar magnitude for used vehicles. The car price estimation problem presents itself as a straightforward regression problem. The target quantity is the sale price of a used car, which we want to predict from a set of car features representing either original car characteristics or its wear level. We think that providing consumers with unbiased and practical tools to predict the value of their items is a useful contribution to the current ecosystem of digital marketplaces. This is why we developed our machine learning model with this goal in mind and equipped it with a simple user interface.

Data collection

Obtaining a large volume of data samples is a key challenge for most machine learning projects. Luckily, digital marketplaces are rich data sources for used items such as cars, allowing collecting car sale announcements as data samples. While websites differ in the layout and information contained in each offer, the most important car features are always present, along with the current price of the item.

An example of an online car sale announcement found on the AutoScout24 website

It is generally easy to collect large quantities of these announcements. We automatically collected three datasets:

  • The AutoScout24-CH dataset contains 119’414 car sale announcements.
  • AutoScout24-DE is a similarly structured set of 558’295 German sale announcements (extracted from the European website version).
  • A third Swiss dataset from another website (Comparis-CH, 111’972 samples) was collected in a second phase.

Challenges

The car price estimation problem poses several interesting challenges:

  • Item variety and complexity. The number of different car models produced by manufacturers is quite important and continuously grows larger as new cars are launched every year. Car models evolve, and the same car is available in variants belonging to different power classes, not to mention the huge range of possible options and accessories. There is a high number of factors that contribute to item pricing and it is thus hard to choose a modeling approach.
  • The environment evolves rapidly. Building a predictor that can stand the test of time and not be useful only in a very limited time window is a considerable challenge.
  • Imbalanced datasets. Any car dataset that reflects the distribution of real-world markets will be imbalanced in terms of manufacturers and car models. The scarcity of sale announcements for many types of cars makes it difficult to define a single machine learning model that performs reasonably well for a wide range of car models. Finding ways to leverage similarities between different cars is another necessity of the predictor, especially for smaller markets like the Swiss one.
  • Unreliable data. Digital marketplaces show us the seller’s proposed price. This estimation will in most cases adhere to real market trends and be a reasonable price tag, but we won’t know the final selling price of the item after off-platform negotiations. It is also difficult to detect offers that present large inaccuracies in the estimation process. For these reasons, we should be aware that these crowd-based datasets are inherently noisy. The ML model must hence account for high output variability and outliers.

Representation of used cars

According to our goal of keeping the model simple and interpretable, we limit ourselves to the following set of car features. The inputs to our ML model consist of cars represented with this information.

Outcomes and price prediction interface

Our machine learning price prediction model performs remarkably well if we consider the simplicity of our modeling approach. The mean average error is 2187 CHF, but this is considering all car models on the market, even luxury and very expensive ones. For the vast majority of cars, the average estimation error is less than 20% of the price variability for that particular car model in the training data. The following user interface allows interacting with our predictor. The user can insert the characteristics of the car and obtain a price estimation, along with a confidence interval and information about the number of same-model cars that the model has seen during training. Be aware that the model is trained on 2019 data and has not been re-trained or updated since.

At the top, an announcement for an ALFA ROMEO MiTo found on AutoScout24 in March 2021. At the bottom, the prediction obtained through the interface, which in this case is quite accurate.

Price estimation for complex items such as cars is challenging, mostly because the available data presents some key weaknesses, which require thought-out solutions and a holistic approach. This is especially true for small markets such as the Swiss one, even more so when the goal is to propose a user-friendly price estimation tool. It is desirable to work our way to the best possible results while keeping the feature set small and interpretable for the general public, and this requires refined data preparation and feature engineering processes.

We hope that this quick overview of the car price estimation task and our simple tool was a helpful read for anyone intending to tackle similar problems in the future. The presentation of our full model architecture will be the subject of a future article.

Resources

--

--