Validation with holdout sampling — why you (or your data provider) better be doing it

In our experience, the standard response models that major firms build for their clients are simple linear regressions whose performance is validated against the same data used for model training. This is one of the shortfalls with current industry standards that Leadscore addresses.

In this post we’re going to talk about (1) why validating with the training set is bad, (2) how holdout sampling works, and (3) what kind of evaluation methods your data provider should be using. If your data provider, marketing firm, or internal analytics team isn’t rigorously using holdout sampling to validate your response model performance you’re almost certainly missing out on higher campaign ROIs.

(For those of you who want a peek under the hood we’ve made our vowpal, R, and python scripts available on a public Google Drive link. Details below.).

What’s wrong with training set validation?

“Validation” refers to the quantitative metrics used to assess a model’s performance. It matters because a strong validation gives marketers confidence that their targeting model will actually deliver real insights that maximize campaign ROI.

When you train a model, be it a simple linear regression or a complex machine learning algorithm, that model fits itself to the patterns contained in the data set it uses for training. Many of these patterns are real. Other patterns, however, are particular to that specific data set and do not generalize to new data. We refer to the real underlying patterns in the data as “signal” and all the other patterns that are unique to that particular population as “noise.” When we fit a model to signal we have a relationship that generalizes well to new populations. When we fit a model to noise the quality of the relationship suffers, and now our model that performs well on the training data is considerably weaker when applied to a new population. These models are “overtrained.”

The remedy: holdout sampling

A holdout sample is a separate data set that we dedicate exclusively to model validation. We develop our model using a training data set and then use it to predict outcomes in the holdout sample. We only consider one model to be “better” than another if it has superior performance in the holdout sample. If your data provider or marketing firm is validating your response models with training data sets, odds are that your targeting is suffering and that you’re missing out on even higher conversion rates.

Methods for evaluating holdout sample performance

For this exercise we’ve used real customer data from multiple direct mail campaigns. The features are anonymized and the data is over-sampled to have heavier representation of the target variable (“buy”), but other than that this is exactly what we see when we work for this customer.

The forecasting models that we use are a logistic regression and a neural network, both powered by an awesome tool called Vowpal Wabbit. We like VW for a few reasons: (1) it’s super-fast, (2) it doesn’t load data into RAM, (3) it can train models over an incredible number of feature interactions, (4) its performance compares favorably to what we’ve seen in R and python. VW is quickly taking over our analytics stack, and we’re excited to share more about it with you in future posts.

The learners whose performance we’ll look at here are trained over every possible feature-feature interaction. VW’s algorithms learn by combing through the data one row at a time and are refined by making multiple “passes” through the data. On these graphs, when we say that we increase model complexity we mean that we’re increasing the number of passes through the data, i.e. the number of times that a learner gets exposed to the training data. After each pass we predict outcomes in the training and test sets and compare performance. Here’s what this looked like for the logistic model:

With each new pass through the data our model’s training set performance (blue line) improves, at first rapidly, and then by smaller and smaller marginal increments. The holdout sample performance, however, tells a different story: it improves up until the fourth iteration before steadily declining. The high point in the holdout sample performance is where we consider the model to have optimized the signal-to-noise tension we describe above, effectively training to the real patterns in the data that generalize to new cases.

The neural network reveals a similar pattern:

Even though we can continue to improve training set performance by continuing to make our model more complex, we should stop around iteration 4–6 to optimize the model’s usefulness as a scoring tool.

Your marketing agency or data provider should be able to clearly describe to you what scoring metrics they use to validate model performance and how they tune and select models based on holdout sample validation. They may not have the same pretty pictures that we do, but they should definitely have a process in place. VW’s modeling framework does this automatically — we actually had to turn its internal holdout validation routine off for this study — but the tools that most marketers are using do not natively incorporate holdout sample validation for model training. Our attention to this important process is part of what makes our models better than our competitions’.

For our fellow nerds:

Reproducing the analysis

All the data and code used in this post is available here. To get going, (1) download the file, (2) unpack in the download directory, and (3) open the file and execute the first few lines to create a new folder in your root directory and migrate the directory from your downloads folder. This should run fine on Mac OSX and Linux, though you may have to play with the file paths a bit. Once you’ve unpacked the file and set it as your working directory in your terminal window you will be able to reproduce this analysis by manually passing arguments from to the terminal command line. Note that you need to have Vowpal Wabbit set up, along with R and python. Setting up VW on Mac OSX can be tricky, but we’ve written a StackOverflow post about that to get you started. Email us at if you have any questions.

Scoring metrics

The question of what scoring metric you use to benchmark model performance is key to any validation exercise. For many of our applications we’re working on rare events, e.g. conversion to customer in response to a direct mail campaign, which usually occur less than 1% of the time. Rather than some minimization of loss, we favor the Normalized Weighted Gini Index (NWGI), which is what we’ll use here. In short, the NWGI measures how effective a model is at sorting an outcome variable of interest such that the positive outcomes (“bought something”) are at the top and the negative outcomes (e.g. “no response”) are at the bottom. A score of 0 means the model performs no better than random, and a score of 1 means that the model perfectly sorts all positive and negative events. The NWGI is our preferred scoring metric for models that forecast rare but important events.

Come visit and learn more about our work at or call +1 (619) 365–4231.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.