Do Your Civic Duty and Help Out a Data Scientist

Erica Mason
Doma
Published in
4 min readMar 30, 2020

Census Day!

April 1, 2020 is census day in the United States, when a constitutional mandate for every county in the country to count its population is fulfilled. The estimated cost of $15.6 billion is a wise investment that enables local governments, businesses, and researchers to use the census data to make informed decisions about how to serve the population. Further, the results are used to allocate the number of seats each state has in the House of Representatives and to draw the lines of the congressional districts.

Even in the face of the coronavirus pandemic, the Census Bureau is ploughing ahead to collect results as this is exactly the information the government needs to adequately prepare for disasters. But even if all those reasons don’t get you super excited to fill out your form, I’ll give you one more reason: to help out your friendly neighborhood data scientist!

Three Census Data Use Cases for Data Scientists

  1. Market Penetration Analysis
    Every startup investor deck includes a simple TAM (Total Addressable Market) estimate, but census data allows an analyst to go further to precisely assess how deeply the business has penetrated the market at a more granular level. For example, imagine your business is selling widgets targeted at women over 55 with a household income of over $60,000, and you have received 4,000 orders from unique customers in both Miami-Dade County, Florida and Fillmore County, Nebraska. The census data tells you that you are 0.1% penetrated in Miami-Dade and 31% in Fillmore. Depending on your marketing strategy, this may tell you that you should stop spending in Fillmore and start spending more in Miami-Dade, or that the marketing creative you’ve been using isn’t resonating in Florida.
  2. Geographic Lead Generation
    If you really want to get fancy, you can take your analysis even further to find ‘geographic leads’ for your business. Continuing from the previous example, you’ve just found out that your product is really rocking it in Fillmore County, Nebraska, so the question becomes, are there similar regions of the country where your product would do really well? Using census data, you can build a nearest neighbors model to find counties that are similar demographically to Fillmore County and rank these leads for your marketing/sales team. Depending on your business, you may want to aggregate this analysis up to the DMA (Designated Market Area) level rather than at the county level because this is more actionable from a media buying standpoint.
  3. Features in Machine Learning (ML) Models
    A hot topic in the real estate tech world over the last few years are AVMs, or Automated Valuation Models, which are ML models used to predict property values, removing the need to wait for manual appraisals. The most commonly known AVM is Zillow’s Zestimate product. While these models heavily rely on property level data (such as square feet, number of bedrooms, and parking availability) we all know that the most important thing about real estate is Location, Location, Location! And how better to encode locational features into your model than using census data that describes the neighborhood, such as the percentage of owner occupied homes, the average education level, and the population density? This blog post does a great job of explaining why this approach is much better than simply relying on ZIP codes in your models.

With Great Power Comes Great Responsibility

This past Christmas, my mother gave me a book called Weapons of Math Destruction, How Big Data Increases Inequality and Threatens Democracy. (I haven’t read it yet but something tells me I’ll have some extra time to stay in and read in the upcoming weeks…) The title is pretty self-explanatory and beyond this book there have been many high profile cases of biased ML models including the recent Apple Card Sexism Scandal. With the wealth of demographic information the census provides, it is important to check for systematic biases in our models and also to be mindful that the underlying data is not perfect. The Census Bureau itself warns about nonresponse bias.

Logistics

Logistics on how to complete the 2020 Census can be found at https://my2020census.gov/. In terms of how to access the data for use in your analyses and models there are multiple options:

  • The CensusData Python package and accompanying documentation allow you to pull census data from 2010 as well as the American Community Survey (ACS) yearly updates without having to deal directly with the census API.
  • Similarly, the tidycensus R package returns census data as tidyverse-ready dataframes.
  • Of course, you can also integrate directly from the source using the API documentation provided by the Census Bureau.

--

--