Estimating Counts of Events in Behavioral Product Testing

Choosing the right statistical model can affect the life and well-being of millions or even billions of people

Consumer protection and industry product testing both rely on statistics. When people get food poisoning or are injured in a car crash, statistics help us discover if the problem is widespread and preventable. We also use statistics to test the effects of ideas to reduce those risks. When problems persist, government regulation and legal cases sometimes rely on statistical methods that shape the fate of entire industries and the people who work in them.

With such high stakes, choosing the right statistical model can affect the lives, communities, and well-being of millions or even billions of people.

In 1956, Consumer Reports found that 2/3 of seatbelts they sampled across the auto industry failed under simulated crash conditions. Source: Manion et al (2006) Consumer Reports. Arcadia Publishing.

In this post, I describe seven approaches to modeling counts of incidents, a kind of outcome that is central to digital product testing and consumer protection. I then compare the performance of three of those models on pragmatic and ethical grounds. This is an open question I’m investigating as I develop standard operating procedures for my research, and I’m looking for a statistician to do further work on this question, so I welcome your feedback, ideas, and references. Thanks!

Counting Events in Behavioral Product Testing

In an era of behavioral products like mobile phones, online ads, and social media, I’ve argued that we have an obligation to test products for behavioral and algorithmic consumer protection. At Princeton, I teach a class on the craft and ethics of field experiments, where students learn research methods to test the impacts of digital products and policies.

Tests of behavioral products often involve counts of incidents–discrete occurrences in a place or over time. For example, I’ve tested efforts to reduce how many people engage in online harassment and how many harassing actions they take over time. Widely-read studies by other researchers have estimated differences in rates for eating disorder activity, junk news, and attacks on immigrants, just to name a few.

The stakes for these studies are high: when industry-independent researchers are wrong, we misdirect policymakers and companies from real risks, siphon resources away from ideas that actually make a difference, and can ultimately make things worse for people. When we do get things right and companies disagree with us, we need analyses that stand up to reasonable scrutiny.

when industry-independent researchers are wrong, we misdirect the public, policymakers, and companies from real risks and can ultimately make things worse

I personally encountered the risks of flawed methods when studying online learning several years ago. Researchers had concluded that learners on MIT’s Scratch became less fluent at coding over time. Their peer reviewed study was celebrated as a pioneering example of big-data analysis of online learning. When I revisited the study, I learned that their statistical methods had led them astray. The population answer was the opposite from their initial conclusion–what researchers call a “Type S error.” Rather than un-learn how to code, participants expanded their coding depth and breadth over time.

Requirements for Modeling Counts of Incidents

Why standardize the analysis of experiments with count outcomes? In my academic work and nonprofit CivilServant, I’m scaling industry-independent behavioral research. I hope eventually to support hundreds of studies every year. To achieve that goal, we need reliable, standardized modeling approaches at a level of rigor that matches the high stakes.

Standardizing models has two other benefits. First, since I pre-register statistical models before data collection, I have good reason to choose good models beforehand. Second, standardized models make it easier to pool discoveries, replicate findings, and extend knowledge systematically.

Is a standard approach possible, and if so, how can we choose models? Here are some provisional goals for choosing a model in public-interest behavioral experiments online.

Kinds of data: A workable model needs to handle counts of occurrences. Examples include counts of the number of actions a person takes over time, the number of incidents in a discussion or a region, or the number of people who respond to content.

Many studies look at incidents where the count is often zero but can get very high. Examples of this include the number of edits a Wikipedia contributor makes in the month after their first day, the number of views/likes received a post, or the number of harassing comments posted to an online discussion.

Kinds of results: Models need to estimate several basic things in a way that is explainable to a public audience:

  • the direction of any effect– does it increase or decrease the outcome on average?
  • magnitude– just how large is the effect? This is especially important for field research on pragmatic risks and benefits. Some real effects are inconsequential, and decision-makers often need to set priorities based on these magnitudes.
  • confidence intervals– what range can we expect any effect to fall within, given the current state of evidence? Confidence intervals can also guide decisions about further research.

Research quality: Model decisions should manage the following priorities:

  • minimize type S errors. How much we worry about false positives and false negatives will depend on the context of a study. Across this research, we want to avoid being confident about results where the true effect is opposite from our finding.
  • support clustered experiment designs. In experiments, we often assign an intervention to a discussion, group, or region–but actually observe individuals or events within those groupings. Models should include ways to adjust our confidence accordingly.
  • adjust results for interference. In complex digital environments, we cannot always reliably assume that the outcomes we measure are independent from each other, an assumption that could lead to inaccurate results. While we can sometimes reduce this problem in experiment design, it’s often important to account for interference in analysis.
  • if possible, support adjustment for imbalances in the random sample
Illustration of Type S and type M errors from the blog post for a paper by Gelman and Carlin

Statistical Ethics: Models need reliable, precise methods for choosing sample sizes on ethical grounds. With high-risk studies, ethical considerations provide strong reasons to include enough participants to detect a meaningful effect, but not more than necessary.

If the sample size is too small, researchers could expose people to a risk without any chance of learning what works. If the sample size is too large, researchers slow down the pace of learning and also limit the number of people who can benefit from a successful intervention. In some cases, over-conservative sample size calculations could lead researchers to think that potentially-valuable ideas cannot be tested.

Seven Ways to Model Count Outcomes in Online Field Experiments

In conversations with industry researchers and in academic papers, I’ve seen seven common frequentist modeling approaches to analyzing experiments with count outcomes:

Logistic regression on grouped categories of activity: In this approach, researchers convert counts into binary values and conduct logistic regression models on those values. For example, instead of estimating the number of actions taken by an account, researchers look at whether they took any actions at all.

  • Usage note: If your main question is whether something happens at all, this model could work. If you need to know how many times it happens, this model will be useless.
  • Pro: using this model, researchers can estimate the direction, magnitude, and confidence intervals, but only for binary questions
  • Pro: calculating statistical power from observed data is very simple, and there are high quality open source experiment sample size calculators

Log-transformed linear regression: Here, researchers log-transform the dependent variable and conduct an OLS linear regression. This is the most common approach I’ve seen in industry, where A/B testing systems typically support two kinds of models: logistic regression and linear regression.

  • Pro: researchers can estimate the direction, magnitude, and confidence intervals, with adjustments for interference and cluster experiments
  • Pro: linear regression is taught in introductory statistics classes, so a large number of managers, policymakers, and engineers can interpret the results
  • Pro: the software for linear regression is very mature, reliable, and well-documented, making it very attractive for people who wish to create reliable software
  • Con: linear regression relies on assumptions about the distribution and variance of residuals– assumptions that are often violated by count data. If your intervention has a meaningful effect on the rate of incidents, the assumption of homoscedasticity will likely be violated, since your treatment group will have greater variance and consequently more error variance than the control group. This can cause systematic errors in estimation. Just what are those errors and how bad are they? That’s one of my open questions.
  • Con (maybe?): if power calculations based on these assumptions are inaccurate enough, this approach could fail the ethics requirement by leading to studies with sample sizes that are too small or too large (see below)

Weighted Least Squares regression: This variant on OLS regression, as I understand it, allows for the error to be heteroscedastic–i.e. for the variance in the error to be inconsistent between treatment and control groups. David Yokum, Anita Ravishankar, and Alexander Coppock used this method in their evaluation of the effects of police-worn body cameras in Washington DC. I haven’t used this method enough to evaluate its pros and cons.

Poisson regression: This generalized log-linear model is estimates incidence rates of occurrences, making the assumption that mean of the distribution is equal to its variance (source: Long, 1997).

  • Pro: researchers can estimate the direction, magnitude, and confidence intervals, with adjustments for interference and cluster experiments
  • Pro: this model is actually designed for the kind of outcomes being considered
  • Con: many real-world situations violate the assumptions of this model, since the mean of a distribution is often lower than its variance (overdispersion)
  • Con (maybe?): in practice, the standard errors for this model are often deceptively small, leading researchers to overconfidence and potentially a high rate of type S errors. At Google, data scientists manage this by bootstrapping their standard errors. (see below)
  • Con (maybe ?): maximum likelihood estimation sometimes (rightly) fails unlike OLS, which sometimes makes it tricky to create reliable, automated power analysis and estimation software. This might not be a disadvantage. Statistical software that silently recovers from errors can mislead

Negative binomial regression: This generalized log-linear model also estimates incidence rates but does not assume that the mean is equal to the variance (source: Long, 1997).

  • Pro: researchers can estimate the direction, magnitude, and confidence intervals, with adjustments for interference and cluster experiments
  • Pro: the assumptions of this model, in principle, are less often violated than with poisson or log-transformed OLS, with counts of incidents online
  • Con (maybe?): if this model is too conservative and leads to larger samples than would be needed for credible research with other models (such as poisson), it might be less desirable on ethical grounds. I test this below.
  • Con (maybe ?): maximum likelihood estimation sometimes (rightly) fails unlike OLS (see above)

Wilcoxon-Mann-Whitney U test (also called Wilcoxon rank-sum): With two-sided nonparametric test, researchers test if the distributions are greater or lesser than each other. As a non-parametric test, it doesn’t make assumptions about the distribution of the variables or the errors. This method compares medians rather than means. I haven’t used this method often enough to evaluate its pros and cons.

Two recently-created but not well documented R packages claim to satisfy some of the requirements. The clusrank R package estimates clustered analysis. The MultNomParam R package provides code for multivariate tests, which in principle could allow for adjusting models for interference and imbalance in samples.

Zero inflated and hurdle models: These models consider special cases where a separate process influences whether something happens at all, in contrast with the process that determines how many incidents occur. I used a zero-inflated model in an experiment on reddit, where a moderator’s decision to remove a top-level post for being off-topic is theoretically independent from the behavior of commenters. Without modeling this independent process, I would have under-estimated the average treatment effect.

These models have a major risk: if the cause of the zeroes (or your measurement of the cause) is entangled with something influenced by the treatment, the analysis could overestimate or underestimate effects due to conditioning results on post-treatment variables.

Measurement Issues when Estimating Effects on Counts of Events

This post focuses primarily on modeling, but I should point out two common measurement issues that also affect the modeling decision:

  • reliability and comparability of the measures. Behavioral data can sometimes have serious problems with reliability. For example, measurement is a hard problem in studies of harassment, where some cases are never reported and many instances of non-harassment are also reported. If an intervention increases reporting rates while reducing the true rates, an experiment could disastrously conclude that an effective intervention had failed.
  • unit of observation. For time-based and geography-based estimates of count data, results can be highly sensitive to differences in the unit of observation. (A) Small adjustments in the observed time period can lead to divergent results (do you observe events over one day, two days, a week, a month, a year, etc). (B) Small adjustments in geographical boundaries can also lead to divergent results

Comparing Models for Experiments with Count Outcomes

How can we choose between these modeling approaches? At CivilServant, we’re still deciding what approach to take, and I’m looking for a statistician to help answer these questions (email or tweet me if you’re interested!).

Update, Jan 2019: I have now published example code on Github

As a start, I did some simulations over the weekend that compared OLS to poisson and negative binomial models, using observed and simulated data. I wanted to compare three things at different sample sizes:

  • Type S rate: The rate at which the model reported a statistically-significant effect opposite from the true effect
  • False positive rate: The rate at which the model reported a statistically significant effect from data where the distributions were equal
  • Statistical power: the minimum sample size required to have an 80% chance of observing a statistically-significant result, for a given model

To compare models, I created four simulated datasets, using data from one of the experiments in my PhD dissertation, where one dependent variable was the number of newcomers in an online discussion. I used two different methods to simulate an experiment dataset (full code here):

  • Sampling directly from the experiment data (I use this for the charts below): I generated a per-observation treatment effect from a normal distribution centered around a pre-specified average treatment effect. I then added these treatment effects to observations that were randomly sampled into a treatment group.
  • Generating two negative binomial distributions for treatment and control: I did this by modeling the observed data with a negative binomial model and using the parameters (μ and θ) to simulate control and treatment groups. When drawing data for the treatment group, I added the average treatment effect to μ.

For each method of generating data, I simulated two experiments: one where treatment and control were drawn from the same distribution (no effect), and one where the average treatment effect was a 20% increase in the incidence rate on average. I considered 40 different sample sizes from 10,000 to 100,000 observations, simulating 50 experiments at each sample size.

Comparing Type S Errors

Models in this simulation rarely resulted in findings that were opposite from the true effect, perhaps because I chose a fairly large magnitude effect for the research question (a 20% increase). While the type S error rate for negative binomial and OLS models were both mostly at zero, the rate for poisson models was as high as 6% at one point.

Comparing False Positives

Poisson models have a huge false positive rate, with 68% of models showing a statistically-significant effect when the samples are from the same distribution. Across all sample sizes, negative binomial models have a 5.7% false positive rate and OLS has a 5.2% false positive rate, within each others’ confidence intervals.

Comparing Statistical Power

While OLS and negative binomial models in this simulation have similarly low false positive and type S error rates, they differ substantially in statistical power. Working from observational data, I found that the OLS model required a sample size of 35,000 to meet the goal, where the negative binomial model required close to half the sample size, around 19,000. At that sample size, the OLS model had an estimated 25 percentage point lower chance of observing a statistically-significant effect than the negative binomial model.

Incidentally, I also got to see the value of calculating statistical power from observed data where possible. When using simulated data, the same code suggested far larger sample sizes (47,000 and 70,000). In both cases, the negative binomial model consistently required a smaller sample size.

Disclaimer: While a negative binomial model was the clear winner here, please don’t adopt negative binomial models based on this narrow simulation. Your needs will likely vary.

The Ethics of Choosing the Right Statistical Model

Over the next six months, I’m hoping to develop a standardized approach to model selection and power analysis that will allow me and the CivilServant nonprofit to navigate those issues wisely.

When people’s lives, communities, and well-being are on the line, statistical decisions affect what questions can be asked, how quickly society learns, and the rate of risky mistakes

That’s why I’m looking for an experienced statistician to investigate these questions in depth and work with us to create open source software for model selection, power analysis, and experiment pre-registration. If you or someone you know has the experience and interest to help, please email or tweet at me.

I opened this post by pointing out the high stakes for industry-independent research. When people’s lives, communities, and well-being are on the line, statistical decisions affect what questions can be asked, how quickly society learns, and the rate of risky mistakes. I’m hopeful that with careful methods, we can serve the common good and earn the public’s trust on these issues.