Predicting Car Auction Prices with Machine Learning

Look at this thing!
  • The layout of the site was dead-simple, and it was easy to see all listings of auctions,
  • All historical auctions and bid histories are left online,
  • All auctions have the same timeframe and same basic rules.

Building a model

I had a hunch that the BringATrailer.com auction market could be predicted to some degree, but I didn’t want to just burn a week of time finding out that the answer was no —as such, it was important to take steps only as big as were warranted along the way. If I took another step forward in modeling the market, and that step was not met with a demonstrably better model, it’d be time to walk away from the project. As a first pass model, taking just core vehicle features into account (i.e. features that would be present the second the auction was posted), the model did good but not great:

Scatterplot from STATA of output of first tpot model — actual final prices on auctions vs predicted
Regression output of first tpot model — 0.46 is good but not great
A much tighter scatterplot of actual — predicted final auction prices given 24 hours of data
R² is now 0.82 which is …. nice.

Building a Family of Models

Convinced that the results were promising, I decided to generate not a single model, but 14 models at 12 hour intervals starting the second an auction went online. Ideally, I’d train each model on data up to a particular t hours. Given that time-flexible models are always very tricky to deal with, I paused to implement a few pieces of code to help keep the guardrails on my models (e.g. not accidentally feed data from t=96 into a model that’s trying to predict based on t=48):

  • Access to all observations occurred through two methods Listing.test_cases or Listing.train_casestrain cases were cases where an MD5 hash of the BringATrailer.com listing URL ended in a digit, and test cases were the cases where the hash ended with a-f. This way, there’d never be any cross-contamination for any models in terms of training.
  • A standard feature generator was implemented which explicitly return features for a Listing at time t. All feature generation was centrally located, and queries explicitly omit time data greater than the specified t at that point.
Accuracy measures for 14 time-stamped models predicting auction outcomes.
Function for generating interpolated prediction at any t
Interpolated prediction for 1978 Ford Bronco, initial modeling effort
Estimated final auction price and error bars on final auction price for 1967 MGB GT
Timestamped prediction from the model

Productizing Models

Finally, I added a few nice touches to the model. I hate running in production in Python, and I prefer writing my “glue” apps in Ruby — as a result, all the prediction work is done in Python by loading my joblib’ed models. They receive work requests via a Redis queue, and respond with their predictions for given observations on an output queue. The Ruby code deals with database management and record reconciliation, and also with collecting new data from BringATrailer.com. Finally, I decided to add a front-end in Node that would allow for people to look up price predictions, and sign up for alerts on predictions for given makes and models:

Screenshot from BAT Predictor showing a sample prediction
BAT Predictor architecture diagram

Remarks

The point of sharing this whole explanation is severalfold:

  1. For car people, I want you to go try out the BAT Predictor for yourself and send it to people who’d want to subscribe to alerts for it.
  2. Much more for the ML people, I wanted to share my process of developing an ML model from a wager with a friend about whether or not something can be predicted all the way to a website allowing people access to those predictions.
  3. Auction prediction teardowns and academic literature was not particularly helpful when I was thinking through this project.
  1. Focus the majority of your time on getting good data and feature engineering. Packages like tpot go a long way in doing the more “machine learning engineer” parts of the job — the biggest way you can help yourself is not fussing over parameters and algorithms, but instead, coming up with features that are as informative as possible, and managing ways to keep those features flowing into the models.
  2. Build iteratively, and check for diminishing returns at each step. When I started this project, I built a model that took me about two hours to put together. Anything further could have ended up being a waste of time if the model ended up being poorly-performing anyways. Take steps in building your infrastructure that are in line with how successful things have turned out already.
  3. Build draft models that test end-to-end results before building final models that are as accurate as possible. Models can take a long time to optimize, and if you’re using rented machines in the cloud, that can cost money. If you’re pretty sure you’re going to go all the way with a project, set aside the model optimization question at first, and build an initial model in place of your final one that lets you focus on the rest of the architecture. In some projects, I’ve even stubbed a random number generator for a model, anticipating the replacement but wanting to focus on the rest of my stack
  4. Isolate the machine learning part of your application. Just like any other project, you don’t want to let concerns seep out of modules and work their way into the entire stack. What I like to do is build a clean layer of separation between my ML models and the rest of the system that uses the ML models as a feature. I don’t want to think about error residuals in the front end. I don’t want to think about how to deal with imputed values for missing feature values in the daily email I send to subscribers. Keep the concerns separated, and make decisions about how to make your interface with the ML model as minimal as possible. Ideally, you have a simple query point where you send in an observation, and get out a prediction, and that’s all it does. Do the rest of the triaging and finessing in some middle layer, and then when presenting the results, only concern yourself with displaying those results neatly.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store