Recovering from Selection Bias with Hierarchical Bayes

Published in

ResearchGate

7 min readApr 15, 2020

Surveys are a valuable method for gathering data on user experience. However, how can we be sure that the views expressed by those who give feedback are representative of the broader population of users? In this article, I explain how we are doing this at ResearchGate.

For example, if we hear in a user interview that a user finds the research stats overview feature on ResearchGate helpful, and that they would be interested in getting more detailed stats, a survey can help us to quantify for how many users a more detailed stats overview feature might be helpful.

There is however a challenge with surveys: survey respondents are usually not representative of the overall user base. They typically suffer from a self-selection bias, with more active users being more likely to respond to surveys. Since we display surveys on-site, it requires users to log in to be able to respond to the survey, so more active users are more likely to log in while the survey is running.

Another challenge is that survey response behavior differs across cultures. For example, native English speakers might have a better experience on an English-only platform than non-native English speakers. Users from different cultures might also not be represented in accordance with the overall population of interest, which introduces another bias problem.

If we do not take those biases into account, we might draw misleading conclusions from our survey and consequently make decisions that do not serve our users’ needs. To correct this bias in user survey data, we applied an approach to adjust the survey responses using a hierarchical Bayesian model.

Who responds to surveys?

Online survey responses are sometimes collected from whoever happens to respond first (convenience sampling). Our more engaged users are around more often and are more willing to respond to surveys.

On the other hand, a lot of insight can be found in the undersampled, medium-engaged segment since they get enough value to use the platform but not enough to use it as a core part of their workflow. The highly engaged users are typically also already more enthusiastic about the product, so gauging product satisfaction of the userbase from raw survey data alone would overestimate satisfaction.

This issue can appear in different forms. For example, different countries have different social and academic cultures, and at ResearchGate we consistently see certain countries with higher response rates than others.

A simple solution

The difficulty is that the raw survey responses are not a faithful representation of the average population response: a survey respondent is more likely to be more engaged, and thus to have more favorable views of the platform. A simple solution is just to collect data without selection bias. This requires some preparation before launching the survey:

Define your population of interest: for example, questions about experiences with the peer-review process of academic publishers are only relevant to those with peer-reviewed articles. This can a be a subset of your total user base.
Discuss relevant population features: the researcher’s current position or discipline is relevant to many academic survey questions, such as funding needs.
Find out which users love to respond: as described above, factors such as engagement level and country are often relevant factors in ResearchGate surveys.

Now you can write a query for your population of users to get the relative proportion of each demographic cell above, then collect responses from those cells in those proportions. This method often goes by the name of ‘quota sampling’.

This is effective and simple and can get you pretty far, but there are a number of situations where it is not ideal.

Certain demographic cells can take a long time to fill up, which can be a problem in a fast-paced startup. It’s also possible that somebody else collected the data without some of the planning described above. Or maybe you would simply like a model-based approach to explore the implications of different assumptions.

In such situations, you’ll need some method of correction.

MRP

MRP is a model-based framework to adjust for the known differences between sample and population. It was originally developed by Andrew Gelman in the context of U.S. election polling and is illustrated very nicely in the context of an XBox survey. It consists of two steps:

Multilevel Regression: use a regression model to estimate the probability of agreement in each cell; and
Post-stratification: average the estimates in each cell, weighted by how common each cell is in the population.

For example, suppose we want to estimate satisfaction, Y, and the engagement level, E, is the only relevant feature. The regression model gives us estimates of ℙ(Y | E, S = 1), the probability of being satisfied given we know the engagement level. We indicate selection into the survey by S = 1. If the cells are sufficiently fine, then the selection bias will be negligible, i.e.

ℙ(Y | E, S = 1) ≅ ℙ(Y | E).

However, if the levels are too fine, we could end up with very noisy estimates due to the low sample sizes in each level. Multilevel models offer a good compromise between the two extremes by modelling the standard deviation of the effects.

When the standard deviation is close to zero, each level will have the same estimate of zero, just as if we hadn’t included the factor at all. This constraint gets loosened the more we increase the standard deviation, σ.

With our cell-level estimates in hand, we can then post-stratify engagement away:

ℙ(Y) = ∑ ℙ(Y |E = e, S = 1) × ℙ(E = e)

The post-stratification weights, ℙ(E = e), are simply the population-proportions.

Causal diagrams

MRP gives us the tools to estimate these effects well. To help reason about these kinds of problems in general, it can be useful to turn to causal diagrams. Bareinboim, Tian, and Pearl even have a paper devoted to precisely the problem of selection bias, and is discussed in more detail by Adam Kelleher.

The general idea is to create a graph where the vertices are the features relevant to your problem and the edges indicate the direction of causality. The simplest graph representing selection bias could look like the following:

Here, job position J and engagement level E are the causal factors for satisfaction Y. Engagement is the only factor causing selection S into the sample. We colored the S-node to indicate that our data is implicitly conditioned on selection. This is important because engagement is a confounder for selection and outcome in this graph, so conditioning on selection changes the statistical properties of the outcome. In order to estimate the probability of the outcome, we need to block the effect of selection. Causal theory tells us that we do this by conditioning on engagement, i.e. ℙ(Y | E, S = 1) = ℙ(Y | E). This is, of course, consistent with the intuition behind MRP above.

Results

We’ll use the powerful brms package to create our multilevel model. The formula below can be read as “model the (binary) outcome with an overall mean level of satisfaction, with offsets for each position, for each country, and for each engagement level”. The notation `(1 | engagement)` tells brms that we want multilevel effects. The effectiveness of MRP will depend on the suitability of your model.

model <- brms::brm(
  family = bernoulli(),
  
  formula = outcome ~ 1 + 
    (1 | position) + 
    (1 | country) + (1 | engagement),
  
  prior = c(
    prior(normal(1, 2), class = "Intercept"),
    prior(normal(0, 1), class = 'sd')
  ),
  
  data = df,
  
  cores = 4,
  chains = 4,
  warmup = 1500,
  iter = 3000,
  control = list(adapt_delta = 0.99), # get rid of divergences
  seed = 59658 # for reproducibility,
  file = 'mrp' # cache the trained model
)

As this is a Bayesian model, we’ll need some priors. The normal(1, 2) prior on the intercept roughly translates to the expectation that the overall satisfaction rate can be pretty much anything, but most likely somewhere around 73%. We use the same half-normal(0, 1) prior on all the multilevel effects since a unit increase (on the logit scale) is typically a large effect in this context.

Now we take our population weights, add in predictions from our model, then post-stratify.

draws <- weights %>%
  tidybayes::add_predicted_draws(
    model, 
    allow_new_levels = TRUE, # some cells not observed!
    prediction = 'p' # name the cell-level probability
  ) %>% 
  group_by(position, .draw) %>% 
  summarise(probability = sum(weight * p)) # poststratification
}

It is important to include allow_new_levels = TRUE as there can be cells in your population for which you have no respondents. This is another area where the multilevel model shines: having modelled the dependence between the observed levels of a factor, we can often make reasonable estimates for unseen levels.

As expected, adjusting the raw responses for selection bias has decreased our estimate for the average satisfaction rate.

Recovering from Selection Bias with Hierarchical Bayes

Written by Brian Callander