A Framework and Repository for Easy Wrangling of Census Data and Data Visualization

Part 3 of a 4-part series exploring how to analyze customer demographics using SafeGraph Patterns data.

Ryan Fox Squire
SafeGraph
5 min readNov 15, 2019

--

This is Part 3 of a series of blog posts exploring how to analyze customer demographics using SafeGraph Patterns data. In Part 1, we explained the power of SafeGraph Patterns for demographic analysis and showed a simple example of how to turn SafeGraph Patterns into powerful insights. In Part 2: Measuring and Correcting Sampling Bias, we discussed how to control for sampling bias using post-hoc stratified re-weighting.

In Part 1 and 2, we used a very simple example of analyzing a single demographic dimension (Ethnicity) with only two demographic segments (Hispanic or Latino Origin vs Not Hispanic Or Latino Origin). But we want to analyze many demographic dimensions across many locations and many brands.

How do we easily scale our approach to analyze many demographic dimensions?

This is not a deeply complex problem, and data wrangling is not particularly glamorous. Here in Part 3, we introduce some helper functions in python and a framework to make this as easy as possible.

Open Census Data makes analyzing Census Data 1000% easier

95% of the challenges of wrangling Census data are already solved simply by using Open Census Data. We’ve extolled the virtues of Open Census Data before. In short, we’ve taken 7500+ demographic attributes tracked by the US Census and American Community Survey all tied to the highest-resolution geography reported by the census, the census block group, and bundled it into a single convenient download of CSV files.

Census Data is overly granular for many use cases

Some of the most common demographic dimensions SafeGraph customers care about are:

  1. Age (or Sex By Age)
  2. Race
  3. Ethnicity aka Hispanic Or Latino Origin
  4. Education aka Educational Attainment For The Population 25 Years And Over
  5. Income aka Aggregate Household Income In The Past 12 Months (In 2016 Inflation-Adjusted Dollars)

After the help from Open Census Data, one of the key remaining sources of complexity for wrangling Census data is that the Census often reports data at a higher granularity than you need or want. For example, the census reports on 16 different levels of Annual Household Income and 24 different levels of Educational Attainment. At least for my initial analysis, I’d rather reduce this to 3 groups and 5 groups, respectively:

  1. Less than $59,000
  2. $60,000 To $99,000
  3. $100,000 Or More

And

  1. Less than High School Diploma
  2. High School Diploma or GED
  3. Some College and/or Associate’s Degree
  4. Bachelor’s Degree
  5. Master’s, Doctorate, and/or Prof School Degree

This isn’t complicated, but re-aggregating census variables in code takes up some space, and there are tedious details and census table_ids to keep track of.

In the repository safegraph_demo_profile I have pre-organized all of the table_id codes for all of the relevant Census measures into aggregations that I think are a useful starting point. For example, here is the relevant function for aggregating the Census data into my 3 levels of Household Income, called get_household_income_groups().

The framework is easy to adapt to different and new aggregations, as you like.

Easy to analyze a single locations or 1000s of locations

Often users are interested in comparing different brands or sets of POI to each other. The helper functions also make it easy to simultaneously analyze multiple brands and multiple locations using brands whitelists or safegraph_place_id whitelists.

To show off exactly how easy this is, let’s consider the following analysis question:

What are the differences in the demographics of customers to Target vs Walmart?

The top-level helper function is called master_demo_analysis() and is designed to interchangeably read data from (i) Open Census Data and Patterns data from your local machine (ii) mounted Google Drive (in Google Co-Lab) or (iii) to read from a public Google Drive containing demonstration data hosted by SafeGraph.

If you want to see the helper functions fully explained inline with coded examples, see the Teacher Jupyter Co-Lab Notebook: Wrangling Census Data.

If you want to just use the helper functions without pedagogy, use the Workbook for Demographic Analysis Jupyter Co-Lab Notebook.

The above charts were produced from the code below (logos and annotations were added manually in Google Presentation).

Other data views are also possible, such as non-stacked bars or lines, which can make individual comparisons easier

Together, SafeGraph’s free Open Census Data and open code repository safegraph_demo_profile remove 99.9% of the challenges of wrangling census data.

We’d love your feedback on any and all of this. What can we do to make this even more helpful?

Taking Demographic Profiles to the Next Level

If you are following our series of blog posts closely you may notice that the above charts include hints of error bars. Where do those come from? Part 4 discusses the details of how to quantify your certainty on these demographic analyses and produce rigorous confidence intervals on your estimates.

Thanks for reading! If you found this useful or interesting please upvote and share with a friend.

You are strongly encouraged to try out a sample of SafeGraph patterns data for free, no strings attached at the SafeGraph Data Bar. Use coupon code AnalyzeDemographics for $200 worth of free data!

Contact:

--

--

Ryan Fox Squire
SafeGraph

Neuroscientist turned Data Scientist, former DS @Lumosity, @SafeGraph. Owner Lembas Data Science.