Matching Cities for Small Sample Experiments

We use a generalized form of propensity score matching to determine which characteristics of a city are most relevant to us for an experimental design.

Our marketing team wanted to run an experiment on the impact of a certain type of advertising. Because of the difficulties in cross-device attribution, we decided to isolate the test to certain geographies and look for the impact not at the user level, but rather at the geography level. This is a pretty straightforward experimental design, except for the wrinkle that we only have budget to run this test in a small number of geographies, maybe 3 or 4 at most.

There’s lots of literature on how to accurately estimate significance on a small-sample experiment, but in particular we chose to follow the example outlined by Bowers and Panagopoulos and use random inference to determine the statistical significance of the impact. Additionally, we want to match cities into pairs before assigning treatment to one half of the pair. This allows us to use a technique known as ‘block random assignment’ that minimizes the standard error of the response, allowing for a much more precise estimate of the impact of the treatment. (Gerber and Green’s Field Experiments, Chapter 3 provides an excellent treatment of this subject.)

Outside of the statistical benefits associated with this matching design, it’s also simply easier for key stakeholders to understand and immediately ‘eyeball’ the impact of the experiment, which in many business applications can be very valuable.

While we’ll save the implementation of random inference for another post, here we’ll discuss how we chose pairs of cities to select for the experiment. We want to choose cities that 1) minimize the variance in the outcome of interest and 2) maximize the likelihood of observing a statistically significant effect. To do this, we want to find cities that are as similar as possible to be pairs. But what does it mean for cities to be as similar as possible?

For us, it means we want them to have similar demographics and similar population sizes. We used data from the ACS to pull the following data for every metropolitan / micropolitan statistical area in the united states1:

  • Total population
  • Share of population that is male
  • Median age (of males)
  • Race (specifically, we calculated share white, share black, share asian, and share other)
  • Employment status (share of total population employed, and share of total population unemployed)
  • Median income

Once we pull together this data, we have to create a measure of ‘similarity’. This is actually trickier than it seems at first glance because measuring the similarity between two cities in total population is easy, but measuring the similarity between two cities in terms of both population and median age is harder.

If you just look at the simple multi-dimensional distance which you might remember from high-school math, you’ll end up treating all of the attributes the same. But that doesn’t make sense in this situation. Not only are the cities’ attributed denominated differently (percents vs total counts vs dollars), but we have many reasons to believe that these variables shouldn’t be treated equally.

Which is more important, population or age? What if there’s one city that’s closer in total population but farther apart in median age, and for another city it’s reversed? As you add more variables, this problem gets harder and harder, because you have to trade-off between ‘similarity’ in each of these different variables.

Rather than trying to guess which variables are most important, _we can let the data tell us which variables are most important to us_. The demographics of a city that matter most to us (selling men’s grooming products) might be different than what matters to a company like Casper (selling mattresses) or Glossier (skincare marketed to women). So, the trick here is that we can use our own data to estimate the appropriate weights for each of these variables to determine the overall similarity between cities. This technique is very frequently used in healthcare analytics to match patients who are similarly likely to experience an adverse health event, and is often called ‘propensity score matching’.

While you can imagine a lot of algorithms for estimating these weights (pretty much any form of regression will work), we went with the most easily interpretable — linear regression. It turns out, that the coefficients estimated in a linear regression are exactly what we’re looking for. They can be interpreted as how the variation in a given variable impacts the variation in the outcome of interest (while fixing variation in the other variables) which tells us precisely how much we should weight differences in that variable when trying to determine overall similarity.

So we estimate the linear model, predict the responses, sort the data by prediction, and voila! All of the cities in our database are ranked by their similarity, with their characteristics weighted according to their relevance to Harry’s. In particular, the number of orders we sell on average per week in a given city is a very important metric to us, so we use that as independent variable in our regression, and use the predicted values as our key distance metric.

Note: the data in this example are completely fake, but you get the idea.

From here, there are a number of algorithms one can imagine for choosing pairs of cities (these algorithms become increasingly important as the sample size you need to pair grows), but we simply ‘eyeballed’ it to choose cities that were adjacent in weighted similarity scores and within a tolerance range of actual observed orders 2. Since we are only choosing a few cities for our experiment, we didn’t see a strong reason to over-complicate this piece.

So, what we end up with are 3 or 4 pairs of cities that we know are very similar in terms of their demographics, weighted by their relevance to Harry’s. From here, all we have to do is run the experiment (and evaluate the results, but that’s a story for a later blog post).


1: The ACS package for R was indispensable for the munging process, here. As was the census reporter

2: In this example, we want to minimize both the difference in similarity scores as well as actual observed orders. We need to be mindful of the ‘error’ between the similarity score and the actual observed number of orders, because an error could indicate that one city is particularly over-saturated or under-saturated with Harry’s, which could impact the results of our experiment.

Originally published at on June 21, 2016.

Harry's Engineering

The engineering blog of Harry's

Michael Kaminsky

Written by

Harry's Engineering

The engineering blog of Harry's