Wrangling Census Data without the API

Ashley White
Jan 28 · 4 min read

After taking on a team project to examine the economic impact (via wage loss) of natural disasters on a specific locale, our hypothesis was to use a Seasonal Auto Regressive Integrated Moving Average (SARIMA) forecast to determine what historical employment, unemployment, and wage levels would have been sans a natural disaster. By comparing our predictions to the actual effects post-disaster, we could approximate the economic impact of that event.

Given the time crunch, I immediately panicked thinking about the possibility of having to navigate a clunky SQL interface or manually download disparate csvs for each state to pull the information I needed. Furthermore, I was’t even quite sure of the metrics or time frame hat I would ultimately need. With all this uncertainty, I figured my best option would be to leverage an API or webscraper to get the data I would need. However, my API & BeautifulSoup skills still need some work, so I began to search for an alternative.

Fortunately, the Python community is full of creative developers who have created libraries and wrappers to more easily interact with data like CenPy. According to the creators, “CenPy (sen - pie) is a package that exposes APIs from the US Census Bureau and makes it easy to pull down and work with Census data in Pandas.” While I will explore a couple of features that I used, I would encourage you all to check out their GitHub and introductory notebook for more information.

Getting Started

The only imports I used for this exercise were CenPy and Pandas:

import pandas as pd
import cenpy as cen

By calling explorer.available, you can access a list of the identifiers of all the APIs that cenpy knows about. Fortunately, I knew which API/data series I would want to access, but if you wanted to see all the available APIs you can do so via the dictionary output or convert to a Pandas dataframe for easier readability.

# Call list of available datasets, verbose = True to include dataset title
datasets = list(cen.explorer.available(verbose=True).items())
# Convert dictionary to dataframe
datasets = pd.DataFrame(datasets, columns = ['code', 'dataset name'])

code is important here, because this will allow you to specify to which database you’d like to establish a connection. The connection class allows you to construct a query string and make requests from the Census server. The result is then parsed into JSON and returned. I began my initial queries with the Quarterly Work Indicators (QWI), a set of 32 economic indicators including employment, job creation/destruction, wages, hires, and other measures of employment flows.

qwi_connection = cen.base.Connection('QWISA')

Constructing the Query

According to the Census API documentation, there are a couple of parameters / variables requirements you must use when calling their API: Endpoint, Indicator, Geography, and Time. Default values for other categorical variables are assumed if not specified. Here is an example of the minimum requirements for a fully-formed basic query:

api.census.gov/data/timeseries/qwi/sa?get=Emp&for=state:02&year=2012&quarter=1&key=[userkey]

I started by setting some default parameters:

# Specify all counties
g_unit = 'county:*'
# Start with one state (chosen arbitrarily)
g_filter = {'state': '01'}
# Specify time period
time = 'from 2003-Q1 to 2018-Q1'
# Uses .varslike to pull in all indicator names
cols = qwi_connection.varslike('Emp') # Employment
hir = qwi_connection.varslike('HirA') # Hiring
earns = qwi_connection.varslike('Earn') # Earning
payroll = qwi_connection.varslike('Pay') # Payroll
firm = qwi_connection.varslike('Frm') # Firm Job Stats
sep = qwi_connection.varslike('sep') # Seperations
# Extend cols to add additional variables
cols.extend(hir)
cols.extend(earns)
cols.extend(payroll)
cols.extend(firm)
cols.extend(sep)

Looping over all States

I wanted to construct a loop that would allow me to iterate over all states and concatenate all the results into one master dataframe. I started by creating a ‘master’ dataframe upon which I could append the other states, versus creating a blank data frame to ensure the column order would match and the new queries would join appropriately.

# Create the first query / dataframe (with state 01)
master = qwi_connection.query(cols = cols, time = time, geo_filter = g_filter, geo_unit = g_unit)

I then leveraged this state FIPS codes dictionary to create a list of state codes over which to iterate:

state_codes = {
'WA': '53', 'DE': '10', 'DC': '11', 'WI': '55', 'WV': '54', 'HI': '15',
'FL': '12', 'WY': '56', 'NJ': '34', 'NM': '35', 'TX': '48',
'LA': '22', 'NC': '37', 'ND': '38', 'NE': '31', 'TN': '47', 'NY': '36',
'PA': '42', 'AK': '02', 'NV': '32', 'NH': '33', 'VA': '51', 'CO': '08',
'CA': '06', 'AL': '01', 'AR': '05', 'VT': '50', 'IL': '17', 'GA': '13',
'IN': '18', 'IA': '19', 'MA': '25', 'AZ': '04', 'ID': '16', 'CT': '09',
'ME': '23', 'MD': '24', 'OK': '40', 'OH': '39', 'UT': '49', 'MO': '29',
'MN': '27', 'MI': '26', 'RI': '44', 'KS': '20', 'MT': '30', 'MS': '28',
'SC': '45', 'KY': '21', 'OR': '41', 'SD': '46'
}
# Extract numerical state codes
states = list(state_codes.values())

Finally, I created a for loop to iterate over all state codes and joined the results with my existing master dataframe:

for s in states:  

print(f'Scraping {s}')

try:
# Iterate over states 's'
g_filter = {'state': s}

df = qwi_connection.query(cols=cols, time=time, geo_filer=g_filter, geo_unit = g_unit)

# Concat new df with master df
master = pd.concat([master, df])

except requests.exceptions.HTTPError:
pass

Please note you can include up to 50 variables in a single API query and can make up to 500 queries per IP address per day. More than 500 queries per IP address per day requires that you register for a Census key. That key will be part of your data request URL string that you specify with you establish connection.

I found that my query of all the counties of all the states exceeded the overall IP limit, even after using my API key, and it timed me out of the system. This forced me to split up my query into two parts over two days.

I hope this has been a helpful primer for you, and that you are able to explore all the possibilities of CenPy!

Ashley White

Written by

Single black female with attention to detail

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade