Predicting Car Auction Prices with Machine Learning

Look at this thing!
  • The layout of the site was dead-simple, and it was easy to see all listings of auctions,
  • All historical auctions and bid histories are left online,
  • All auctions have the same timeframe and same basic rules.

Building a model

Scatterplot from STATA of output of first tpot model — actual final prices on auctions vs predicted
Regression output of first tpot model — 0.46 is good but not great
A much tighter scatterplot of actual — predicted final auction prices given 24 hours of data
R² is now 0.82 which is …. nice.

Building a Family of Models

  • Access to all observations occurred through two methods Listing.test_cases or Listing.train_casestrain cases were cases where an MD5 hash of the listing URL ended in a digit, and test cases were the cases where the hash ended with a-f. This way, there’d never be any cross-contamination for any models in terms of training.
  • A standard feature generator was implemented which explicitly return features for a Listing at time t. All feature generation was centrally located, and queries explicitly omit time data greater than the specified t at that point.
Accuracy measures for 14 time-stamped models predicting auction outcomes.
Function for generating interpolated prediction at any t
Interpolated prediction for 1978 Ford Bronco, initial modeling effort
Estimated final auction price and error bars on final auction price for 1967 MGB GT
Timestamped prediction from the model

Productizing Models

Screenshot from BAT Predictor showing a sample prediction
BAT Predictor architecture diagram


  1. For car people, I want you to go try out the BAT Predictor for yourself and send it to people who’d want to subscribe to alerts for it.
  2. Much more for the ML people, I wanted to share my process of developing an ML model from a wager with a friend about whether or not something can be predicted all the way to a website allowing people access to those predictions.
  3. Auction prediction teardowns and academic literature was not particularly helpful when I was thinking through this project.
  1. Focus the majority of your time on getting good data and feature engineering. Packages like tpot go a long way in doing the more “machine learning engineer” parts of the job — the biggest way you can help yourself is not fussing over parameters and algorithms, but instead, coming up with features that are as informative as possible, and managing ways to keep those features flowing into the models.
  2. Build iteratively, and check for diminishing returns at each step. When I started this project, I built a model that took me about two hours to put together. Anything further could have ended up being a waste of time if the model ended up being poorly-performing anyways. Take steps in building your infrastructure that are in line with how successful things have turned out already.
  3. Build draft models that test end-to-end results before building final models that are as accurate as possible. Models can take a long time to optimize, and if you’re using rented machines in the cloud, that can cost money. If you’re pretty sure you’re going to go all the way with a project, set aside the model optimization question at first, and build an initial model in place of your final one that lets you focus on the rest of the architecture. In some projects, I’ve even stubbed a random number generator for a model, anticipating the replacement but wanting to focus on the rest of my stack
  4. Isolate the machine learning part of your application. Just like any other project, you don’t want to let concerns seep out of modules and work their way into the entire stack. What I like to do is build a clean layer of separation between my ML models and the rest of the system that uses the ML models as a feature. I don’t want to think about error residuals in the front end. I don’t want to think about how to deal with imputed values for missing feature values in the daily email I send to subscribers. Keep the concerns separated, and make decisions about how to make your interface with the ML model as minimal as possible. Ideally, you have a simple query point where you send in an observation, and get out a prediction, and that’s all it does. Do the rest of the triaging and finessing in some middle layer, and then when presenting the results, only concern yourself with displaying those results neatly.




Niche machine learning models done neat. Currently serving and

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Quick Code Tutorial on how to import and display images for Neural Network Classification

Working of TF-IDF (Term frequency * Inverse Document Frequency).

Top 5 applications of Natural Language Processing

How to innovate in life insurance using Analytics and Machine Learning

Implementation of Gesture Control using Hand & Finger Tracking with MediaPipe

3 Types of Classification Problems in Machine Learning

Sentiment Analysis in Python with 3 Lines of Code

An Introduction to Supervised and Unsupervised Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Cognitive Surplus

Cognitive Surplus

Niche machine learning models done neat. Currently serving and

More from Medium

Is Artificial Intelligence a Smart Idea? | BSEtec

A Rough Guide to the Speech-to-Text Landscape

A neon speech bubble

What low-code platforms need to do to blend with the world of AI

Content analysis to explore Electric Vehicle Ecosystem