“What about bias in the SafeGraph dataset?”

Ryan Fox Squire
SafeGraph
Published in
3 min readOct 18, 2019

Quantifying Sampling Bias in SafeGraph Patterns

This post is available as an interactive Google Colaboratory notebook. Click here to see the full post, see the results, and play with the code yourself.

Below we’ve copied the Introduction and Highlights

Introduction: “What about bias in your dataset?”

SafeGraph Patterns measures foot-traffic patterns to 3.6MM commercial points-of-interest from over 45 MM mobile devices in the United States and provides a monumental window into American commerce. SafeGraph data users look through this window to ask detailed questions about consumer behavior (e.g., What is a brand’s true customer demographic? How far do people travel to go grocery shopping? What is the impact of opening a national brand coffee shop on all the other coffee shops in a neighborhood?).

A common type of question we hear from SafeGraph Patterns customers is “What about bias in your dataset?”. “Does your panel really represent the true American public?”. “How do we know that your panel isn’t oversampling wealthier people?”. This is the kind of sophisticated data skepticism we love to hear. A key part of SafeGraph’s vision is to “Seek the Truth About the World…Of course, data can never be 100% true, but we should strive to make it 100% true.”

And although SafeGraph Patterns aggregates data from ~ 10% of devices in the United States (a very impressive sample, if we don’t say so ourselves!) this sample is not a perfect representative subset of the population. Like all samples, the SafeGraph dataset has sampling error.

SafeGraph’s sampling correlates very highly with the true Census populations.

For example, USA Counties (Pearson correlation coefficient, r=0.97), Educational Attainment (r=0.99), and Household income (r=0.99).

Above is a scatter plot of the Census population on the x-axis and SafeGraph sample on the y-axis. Each point is a different census category of Household Income. The highly linear correlation indicates that the SafeGraph data is sampling each of these groups at a proportion very similar to the true population.

Curious what this means, exactly how we got these numbers, and how to test other census sub-populations? See the full notebook.

This is a technical blog post designed for data scientists and analysts to answer the following questions:

  • What is Sampling Bias and why is it a problem?
  • How biased is the SafeGraph Patterns dataset? How do we quantify sampling bias?

We also include a short preview of a future post:

  • How do we correct the sampling bias to answer questions about consumer behavior? (e.g., What is a brand’s true customer demographic? Or how far do people travel to visit my stores?)

See the full post on Google CoLab

To read the full post, see the results, and play with the code yourself, click here!

Want to see a different question answered with SafeGraph data?

Please send us your ideas, feedback, bug discoveries, and suggestions to datastories@safegraph.com or as a comment below.

--

--

Ryan Fox Squire
SafeGraph

Neuroscientist turned Data Scientist, former DS @Lumosity, @SafeGraph. Owner Lembas Data Science.