Got zip code data? Prep it for analytics.

Using fine-grained U.S. Census data and Jupyter Notebooks to better understand your customers

Published in

Center for Open Source Data and AI Technologies

6 min readAug 15, 2017

Who are those people lurking behind the statistics in your data? Whether you are looking at retail shoppers, insurance policy holders, banking customers or political constituents, the more you can flesh out the lives of the people behind the numbers, the better you will do at deriving useful insights into how to serve them. This is why demographic market segmentation is such an interesting industry.

Block party

Market segmentation is the process of dividing a target population into groups, or segments, based on some common characteristics. The strategies for creating these groups range from the simple — age, sex, race, income — to the sophisticated — “Uptown Individuals” or “Cozy Country Living.” Products such as Tapestry Segmentation from Esri or PRIZM from Claritas/Nielsen live at the sophisticated end, and carry a price tag to match. If you are not ready to take the plunge, however, you can do a lot on your own with U.S. Census data, some basic analytics skills, and a Jupyter notebook.

The U.S. Census is a treasure trove of free demographic data, as I’ve written about before. You can find detailed statistics on age, income, race, housing, and occupation from the national level down to the block group (a very small area consisting of about 2,000 people in most places). That’s just the tip of the iceberg. There are many more interesting statistics you can tease out of Census data with a little bit of analytics skills.

“Block groups are statistical divisions of census tracts and generally contain between 600 and 3,000 people.” Source: U.S. Census Bureau.

The core of the problem

Some cities are denser than others. But where are those dense cores so you can finely target them?

One statistic I find really interesting is how urban a person is. Do they live in the dense city, the suburbs, or out in the rural countryside? Depending on your question, location can be a more useful fact to know than age or income or family size.

You might think that it’s pretty easy to figure out what places are city, suburban, and rural, but it turns out to be a bit of a challenge. For example, take the map of eastern Massachusetts below. The City of Boston is shaded in gray in the center of the picture. That’s a pretty poor representation of urban, as many towns around Boston are just as urban as the city (Cambridge, Somerville, and others).

The Census has a place type called “Urban Areas,” which for the Boston area is the red line you see in the picture. It stretches waaaaaay out from the city to even go into New Hampshire to the north, and almost to Cape Cod to the south. This may make some sense when you look at the country as a whole, comparing Massachusetts to Minnesota for example, but it does a poor job of capturing true urban-ness. The dashed gray line is an even less useful designation from the Census called “Metropolitan Statistical Areas.”

Depending on your definition, “urban” can mean a lot of different kinds of places. For instance, Boston’s urban core is mostly walkable; however, if you’re in Phoenix, you’ll need a car.

Now look at the map below derived from the data I’ve prepared. Instead of using the most detailed level of Census data — block groups — I use zip codes because you’ll always have a zip code for your customers.

Data geek note: these are actually “zip code tabulation areas” (ZCTAs), not true zip codes. ZCTAs are a zip code-esque structure the Census created to make zip code data better for mapping and spatial analysis.

It shows most of Boston, and some neighboring zips, in red — true urban areas, places where people live primarily in multi-family housing, condos, or apartments. Toward the south, you can also see little red spots in Providence, RI; New Bedford, MA; and Fall River, MA.

The orange color depicts areas called “Early Suburban.” Here you’ll find people living primarily in single-family homes, but lot size will be usually around a 1/4 to 1/2 acre. Then in light orange, you’ll see areas that are closer to rural with single-family homes on 1 acre lots or larger. Finally in a light tan color, is everything else — truly rural areas consisting primarily of 1+ acre residential lots, farms, and forests.

Picking the core urban areas out of wider, more suburban metro area.

Methodology: before and after the car

The methodology used to build this model comes from an academic article, “From Jurisdictional to Functional Analysis of Urban Cores & Suburbs” in New Geography. From that work, my notebook uses the following classifications for urban-ness:

Urban (pre-auto urban core): density > 2,900 sq. km
Auto suburban, early: median house built 1946 to 1979, density < 2,900 sq. km and density > 100 sq. km
Auto suburban, later: median house built after 1979, density < 2,900 sq. km and density > 100 sq. km
Auto exurban: all others

From the requirements above, the key data needed to reproduce the model are population and the median age-of-home in an area. We can easily get these data from the U.S. Census American Community Survey. The instructions for doing this yourself, if you are so inclined, are in the Jupyter notebook referenced below.

Show me the data

If you are less interested in the details of the analysis, and just want the data to use in your own work, we’ve provided a public download of the CSV file in this GitHub repo. If you want to see the details of how it was built, read on.

Oh, the urbanity!

I analyzed the data using Python in a Jupyter Notebook called urbanity.ipynb in the same GitHub repo. It uses the Pandas read_csv function to extract statistics on zip code areas, population counts, and median housing age from three larger data files. In the notebook, I then join those statistics into a single DataFrame and calculate population density per square kilometer.

From there it’s a simple matter of running some SQL-like queries on the DataFrame to classify the zip codes into the four categories of interest. That’s it for the initial analysis.

Looking around the U.S.

The Jupyter notebook goes on to create an interactive map using Mapbox technology, which I’ll describe in detail in a forthcoming post. For now, I want to focus on what this map can tell us.

As with the Boston example, other views from around the country each tell different stories about the composition of urban-ness, which when combined with your own data, can lead to deeper insights into customers or constituents.

The dense Mid-Atlantic region from New York City to Baltimore. Contrastingly, urbanity in the South shows almost no dense urban areas. Combining both extremes, Los Angeles to the San Francisco Bay shows large swaths of rural areas.

If you find the data useful, or want to know more about how to use it to build a custom analysis, please leave a comment here. Whether you’re in a Pre-Auto Urban Core or an Auto Exurban municipality, thank you for reading!

Please ♡ this article to recommend it to other Medium readers.