Practical approach to evaluating client profiles by population survey data.

Surveys with nonresponse. Horvitz-Thompson estimator. Surveys of subpopulations. Nonresponse bias. Selection bias.

Alexey Bogatyrev
3 min readMay 5, 2022

--

TLDR

In this article I propose an approach I used to estimate characteristics of clients from a population survey. I’m tackling a problem of estimating entire population characteristics by a random subsample. The subsample is obtained by conducting a customer survey. I’m trying to come up with a simple and practical solution to nonresponse and selection bias. I encourage everybody to discuss this approach and share your thoughts about its validity and correctness.

Motivation

The problem with clients is that in most cases we don’t really know who they are, what they want, how they behave, how satisfied they are. It is critical for business to understand its clients. It might be hard and expensive to profile all customers. The good news is, that we can ask some customers to tell us about themselves and then propagate the survey to the entire population. There are two things to consider:

  1. Different client have different willingness to tell us about themselves. This is known as nonresponse bias.
  2. Since we only observe a subpopulation, we need to know how to propagate results correctly to the entire population and avoid bias towards selected individuals. This is known as selection bias.

Solution

Generally we are trying to estimate some population characteristic y by a population subsample. The subsample is acquired by a customer survey. A number of clients are randomly selected from the entire customer population. These customers are asked to provide some information about themselves. The answers constitute the subsample of the population. The goal is building an unbiased estimate of some statistic of y by the subsample. Probabilities of a response and being selected for a survey are different for each customer. For instance, there is plausible to be a different number of males and females among customers of a cosmetics online store, thus likelihood of the selection differs. As well, it might be that males and females would have different response rates. Though, probabilities are different, we assume that there are groups in the population with equal probabilities within. For instance, one can assume that people of the same gender share the same probability of response. Let’s name these population groups classes. Of course, in real life scenarios these groups have more sophisticated structure. It is worth discussing with business what these groups should be. It is also important to check response distributions within groups to make sure that the above assumption holds. For instance one can randomly divide a group to two subgroups and run a test whether response probabilities are equal. It can be done several times to avoid hitting the tests results by chance.

The problem of nonresponse is known in literature. Here I go for the Horvitz-Thompson estimator[1] approach.

Formally:

Conclusion

An advantage of this solution is simplicity. Though, it would be nice to hear other opinions with pros and cons. It would also be nice to know what other options people use for this kind of problem. Let’s discuss this in the comments.

References

[1] — The propensity score and estimation in nonrandom surveys — an overview

--

--