Small Data Still Matters

John Lewis
Upstart Tech
Published in
5 min readAug 10, 2020

Small Data Will Always Matter

Small data matters. It will always matter. No matter how awesome it is to implement machine learning models that rely on (dare I say?) “big” data, distributed processing, and hours of training, small data still matters. It always will.

Why will it always matter? Simply put, important business questions sometimes arise quickly, don’t come with a ton of data, and must be answered yesterday. We often must assemble data quickly to provide at least a partial answer to the question at hand. This process typically results in ‘small’ data as measured by the ability to easily do the analysis on a standard laptop.

Think Hierarchically

No matter how many fancy models you learn about, to answer the types of business questions I describe above, you need to understand the problem you are trying to solve and how the data you have relates to it. Simply put, you need to think about the data generating process and how it relates to your question. This will often mean you must think hierarchically.

What do I mean by “think hierarchically”? First, thinking hierarchically means you recognize that data often come in groups with similarities among the groups. For example, geographical regions often provide natural groupings. Here, we often expect some group level differences but also some similarities due to the fact the same phenomena is being studied. Additionally, there can be nested groupings like cities within states. In this example, cities within the same state are expected to be more similar with each other than with cities in different states. The grouping for your particular analysis will often be obvious if you understand the data generating mechanism and the different variables involved.

Second, thinking hierarchically means you recognize the varying levels of information you have on the various groups. If our groupings are by cities and states, we’d expect to have many data points from large cities (i.e. lots of information) and fewer for small cities (i.e. little information).

Analyzing Hierarchical Data

Once you recognize a hierarchical structure in your data, it often makes sense to account for it in your analysis. For example, instead of city-level means and standard deviations, you could take state-level means and standard deviations. While increasing sample sizes for smaller groups, this has the unfortunate property of coarsening your analysis. You treat all cities within the same state the same, which is not quite correct. We’d like our analysis to be flexible enough to detect meaningful differences between the groups.

Given this, the biggest level-up to your modeling skills is to know how to specify and fit hierarchical models.

Main Benefit of Hierarchical Modeling

The main benefit of hierarchical modeling is often stated as allowing individual group estimates to borrow (or pool) information from similar groups. This is most beneficial in groups with small sample sizes because of their highly variable sample means. Instead of using this (‘unpooled’) sample mean as the group estimate, the hierarchical model uses a ‘pooled’ mean. Without going into too much statistical detail, you can think of this as using a weighted mean across similar groups. In small groups, other similar groups are weighted more. In large groups, they are weighted less, converging to the sample mean as the sample size grows.

Let’s try to demonstrate this with some simulated data. I generated data from a bunch of different groups with sizes ranging from 3 to 200. The data were generated so that certain subgroups would have similar values. The figure displays the difference between the hierarchical estimates and sample means versus group size. For small groups the two estimates can differ a lot, but for large sample sizes they get closer together.

The two estimates are clearly different, but different doesn’t mean better. The advantage of the hierarchical estimates is their smaller variance. And this makes sense because they use more information from similar groups. We show this in the following graphic which plots the standard errors of the two estimates against each other. The size of the points represent the group size and the line is where the standard errors are equal. The cluster of points in the upper right side are all from small groups. The standard error is much larger in these cases. The cluster of points in the lower left are bunched around the line. These are larger groups and the standard errors are roughly the same between the two estimates.

No Free Lunch

It’s not all good right? Well, no. It’s not. I can think of two main trade-offs. One, you have to specify a good model. This takes some practice, but if you understand the data-generating process, it is often doable. The second is that the pooled estimates may lose some information at the level of each group. We partially combine the groups to reduce variance of the estimates at the cost of not getting completely independent estimates. Reducing variance is often worth it when there are similarities between the groups and countless examples show better predictive performance of hierarchical modeling when this is true.

Further Reading

Hierarchical modeling is not new, there are many good texts and articles that discuss the subject. I will undoubtedly leave some good ones out, but some of my favorites are below. I tried to rank them roughly in order of accessibility.

  1. Gelman, A. (2006). Multilevel (hierarchical) modeling: what it can and cannot do. Technometrics, 48(3), 432–435.
  2. Kruschke, J. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan. Academic Press.
  3. Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models (Vol. 1). New York, NY, USA: Cambridge University Press.
  4. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. CRC press.

Software

There are a lot of good software packages to fit these models using either frequentist or Bayesian methods. In R, good packages include lme4 (for linear models) and nlme (for linear and non-linear models). Many Bayesian versions can be fit using rstan. rstan is R’s interface to Stan which is a highly flexible statistical modeling software that can be used to specify models and generate samples from their posterior distributions using a wide variety of Markov chain Monte Carlo (MCMC) techniques. Stan also interfaces with many other languages including Python, MATLAB, Julia, Stata, Mathematica, and Scala (last time I checked here). Its high flexibility can make it somewhat intimidating to learn, but the documentation is great. In addition, R has both rstanarm and brms that also interface with Stan but have familiar R-like modeling syntax to fit many common models.

--

--