American Community Survey Data in Python

Internet Access Denied (part II)

Acquire data via API to explore digital poverty in the U.S.

9 min readNov 20, 2020

Photo by Stephen Phillips — Hostreviews.co.uk on Unsplash

In the previous article, we obtained CSV data via download to approach an understanding of United States Census Bureau data tables with the ultimate goal of learning about discrepancies in broadband internet access. In the process, we saw that working with data sometimes requires many steps before we even begin typical data preparation tasks. The present article continues our study in advance of using geospatial visualization to explore our data and consider opportunities for predictive analysis.

Here we will focus on the following:

recalling the serialized data dictionary and variable objects we identified as being relevant to broadband in order to target specific subsets of data
obtaining an authentication key for the United States Census Bureau’s application programming interface (API)
understanding and building customized API calls to the Census Bureau’s American Community Survey (ACS) data source
importing datasets via API
visualizing data statistics relative to our subject

In our first notebook, we used Python’s .pickle module to serialize and save-to-file our data dictionary and a list of variables from our downloaded dataset related to computers and internet use. We can restore these objects for reference, as we build-out a query to retrieve the data via the Census Bureau’s Application Programming Interface (API).

# printing object-restoration message
print("Result of restoration attempt:\n")# restoring the data dictionary and broadband data list from the serialized file
metadata_dict = read_pickle("pickles/metadata-dict.pkl")
broadband_vars = read_pickle("pickles/broadband-vars.pkl")

Result of restoration attempt:
Object restored from pickles/metadata-dict.pkl
Object restored from pickles/broadband-vars.pkl

We can view the loaded objects.

First, our list:

# verifying list restoration
broadband_vars

[‘DP02_0153E’, ‘DP02_0153M’, ‘DP02_0153PE’, ‘DP02_0153PM’]

Note that only the suffix is different among the four broadband variables. The “E” suffix represents a numeric estimate, and “M” refers to the margin of error. The percent of the total and the margin of error percentage are indicated by “PE” and “PM” suffixes, respectively.

Next, our dictionary:

# verifying data dictionary restoration
metadata_dict

{‘NAME’: [‘Geographic Area Name’],
‘DP02_0001E’: [‘Estimate!!HOUSEHOLDS BY TYPE!!Total households’],
‘DP02_0001M’: [‘Margin of Error!!HOUSEHOLDS BY TYPE!!Total households’],
‘DP02_0001PE’: [‘Percent!!HOUSEHOLDS BY TYPE!!Total households’],
‘DP02_0001PM’: [‘Percent Margin of Error!!HOUSEHOLDS BY TYPE!!Total households’],
…

In total, there are more than 600 variable/label pairs in the dictionary. Viewing the data dictionary gives us the option to plan which variables we require to build a relevant dataset prior to calling them from the data source and loading them into memory.

API Requests

Any user can publicly request data via the Census Data API, however, there are limits. Without a registered key, API data retrieval is restricted to 50 variables per call. In addition, making more than 500 queries in one day from a single IP address requires a registered API key.

Before making use of the API, we will request a key. We accomplish this by signing-up for a key from the census.gov site.

The Census Bureau’s simple API key request page contains only two text fields, a check-box, and a submit button. — Screen capture of the Census Bureau’s simple API key request page. All images by the author unless otherwise indicated.

The sign-up page (http://api.census.gov/data/key_signup.html) requires only a name, an email address, and acknowledgment of the terms of service. After this information is successfully submitted, the key will be sent to the provided email address in the form of a long, alpha-numeric string. We can save the string to a local environment variable, a key manager, or simply to a text file that will not be made public.

We will copy our key string to a file as CensusDataKey="<key>" and save it in the main project directory as “.env”. Later, we will load the key using the dotenv library to execute an authenticated API request.

Constructing an API Request

We previously obtained data for “Selected Social Characteristics in the United States” via direct download from the data.census.gov site. Sharp-eyed readers may have observed, from our screen capture of the download dialog that the API link for our dataset was inactive.

Screen capture of “Download / Print / Share” dialog with “Embed”, “Share” “API”, and “Print” links grayed-out. — (source)

Let’s figure out how to obtain the same (or similar) data via the Census Bureau’s API.

Census Data API queries are structured as follows:

The start of the query references the API that is being addressed: “https://api.census.gov/”.
The next segment describes the data source. The 2019 ACS table we have been working with (DP02) is a 1-year estimates table. Since this is a “Data Profile” table, we extend the API call as follows:(“https://api.census.gov/data/2019/acs/acs1/profile”).
We then add the “Get Function,” which announces that we are about to indicate the variables we wish to include in the retrieved dataset: “https://api.census.gov/data/2019/acs/acs1/profile?get=”.
Our specific variables make up the “Variable List”. We use commas to separate the variables we wish to retrieve. For a start, we will request “GEO_ID” and the “NAME” variable describing our geographies: “https://api.census.gov/data/2019/acs/acs1/profile?get=GEO_ID,NAME”.
After the variables, we can add a “Predicate” clause to indicate how the records should be filtered. The Predicate starts with an ampersand (&) and accepts “for” and “in” modifiers.
A “Geography” tag may be added to filter for specific geographic area(s) of interest. We can use FIPS code “51” to limit results to the State of Virginia: “https://api.census.gov/data/2019/acs/acs1/profile?get=GEO_ID,NAME&for=state:51”. We can alternately select a specific county in the state by adjusting our Predicate and Geography (for example, “&for=county:015&in=state:51” to filter for Augusta County in Virginia). To include all counties, we can use an asterisk as a wildcard to produce the following query: “https://api.census.gov/data/2019/acs/acs1/profile?get=GEO_ID,NAME&for=county:*&in=state:51”.

The API query will return a JSON format string. You can follow the above hyperlinks to view the data in-browser. We will read our test query into a Pandas dataframe in Python.

Making the Call

We can preview the data in Pandas, as follows:

# previewing data returned from the api test-query in pandas
pd.read_json("https://api.census.gov/data/2019/acs/acs1/profile?get=GEO_ID,NAME,DP02_0153E,DP02_0153M,DP02_0153PE,DP02_0153PM&for=county:*&in=state:51").head()

First 5 rows of data returned from the API — columns labeled sequentially, 0–7, with variable names in the 0th row. — The first 5 rows returned from the API call.

We see the requested variables in the 0th row of each dataframe column, including columns for the “state” and “county” geography filters. The query result suggests that we might consider performing the following tasks:

Instantiate session variables for the API URL and the authentication key.
Define row 0 data as column headers.
Leave the column header case-styling as it is, to remain consistent with our dictionary.

We instantiate our key and our URL to build an authenticated query to submit to the API.

# load saved key `CensusDataKey` from local file ".env"
print("Environment variables loaded.")
load_dotenv()# instantiating the stored api key
key = os.environ.get("CensusDataKey")# instatiating the constructed api call
api_url = "https://api.census.gov/data/2019/acs/acs1/profile?get=GEO_ID,NAME,DP02_0153E,DP02_0153M,DP02_0153PE,DP02_0153PM&for=county:*&in=state:51"# appending the private key to the api URL
keyed_query = api_url + f"&key={key}"

The following function submits the query, reads the returned ACS data into Pandas, and updates the column headers:

def api_json_to_df(keyed_query):
    ''' This function returns the JSON result of an 
        authenticated API query as a dataframe.    
    
    - INPUT:    keyed_query = formatted query string with key appended 
                <keyed_query = api_url + f"&key={key}">
    - OUTPUT:   a dataframe with column headers
                called with <"df = func>": 
    '''
    
    # instantiating the dataframe
    df = pd.read_json(keyed_query)    # extracting column headers from first row values
    headers = df.iloc[0]    # creating a new dataframe with extracted column headers
    ex_df  = pd.DataFrame(df.values[1:], columns=headers)    # return the new dataframe
    return ex_df

As documented, we can create the dataframe with the following line:

# executing the function to acquire and instantiate data as a dataframe
api_df = api_json_to_df(keyed_query)

We call the assigned variable to view the result:

# viewing the dataframe
api_df

Screen capture: dataframe of the first 5 rows of data returned from the API with ACS variable headers. — First 5 rows of data returned from the API with ACS variable headers.

There are a total of thirty rows in the dataset, representing thirty counties in Virginia with populations equal to 65,000 or more.

Planning for Exploratory Data Analysis (EDA)

Earlier we restored our data dictionary containing variables available in our data source. Scrolling through the dictionary, we find variables that we can add to our dataset for more targeted exploration. For example, the variables DP02_0011(E/M/PE/PM) makes it possible to focus on statistics for the households of single mothers with children under the age of 18.

Note: Variables in other tables can be used to narrow our focus even further (e.g., school-age children — those between the ages of six and seventeen).

Keep in mind that our table enables us to view counts and percentages relative to the county populations. While we can explore data relationships, we will not be able to determine from this example the number and percentage of single-mother households with children under the age of 18 that have a broadband internet connection.

To add this subcategory to our dataset, we first update the API URL and ‘recalculate’ keyed_query:

# updating the api url
api_url = "https://api.census.gov/data/2019/acs/acs1/profile?get=GEO_ID,NAME,DP02_0153E,DP02_0153M,DP02_0153PE,DP02_0153PM,DP02_0011E,DP02_0011M,DP02_0011PE,DP02_0011PM&for=county:*&in=state:51"# updating the keyed_query
keyed_query = api_url + f"&key={key}"

We then can execute our function as done before to view the resulting dataframe.

All columns in our dataframe are string objects. We can convert “E” and “M” columns to integers. The “PE” and “PM” columns represent percentages. We can convert them to floats.

This particular dataset is small enough to visibly see that none of the data in columns we need to convert are empty, and none of them contain values that cannot be converted to numeric data. We can create modules in the helpers package to perform our conversion tasks. My modules use .astype functions — df_cols.astype('int64') for ‘estimate’ and ‘margin of error’ columns and df_cols.astype('float') for percentage columns — to convert data types, but there are other options.

Since I will use these functions again, I add them to our “helpers” package. After appending the functions to “helpers_func.py”, I append import-statements to the package’s “__init__.py” and also import the modules into the notebook.

%%writefile -a C:\\Users\\jamel\\myprojects\\acs-api\\helpers\\__init__.py
# from .helper_func import string_to_integer
# from .helper_func import string_to_float"# importing conversion modules to the notebook from "helpers"
# from helpers import string_to_float,string_to_integer

Of course, you skip this bit of engineering and just include the functions inline by listing the relevant variables and including them as function arguments:

# listing variables to convert to int/float
int_vars = ["DP02_0153E", "DP02_0153M", "DP02_0011E", "DP02_0011M"]
float_vars = ["DP02_0153PE", "DP02_0153PM", "DP02_0011PE", "DP02_0011PM"]# converting to integers with function
api_df[int_vars] = string_to_integer(api_df[int_vars])# converting to floats with function
api_df[float_vars] = string_to_float(api_df[float_vars])

Once the column data types have been converted, we can us the describe() method to view summary statistics for numeric columns. Those statistics may be more meaningful, if we first convert our column headers to reflect the labels in our data dictionary.

# view column names
api_df.columns# rename columns by mapping `metadata_dict` labels to column headers
dict_df = api_df.rename(columns={'DP02_0153E': 'broadband_est', 'DP02_0153M': 'broadband_margin', 
                                 'DP02_0153PE': 'broadband_pct_est', 'DP02_0153PM': 'broadband_pct_margin', 
                                 'DP02_0011E': 'single_mother_est', 'DP02_0011M': 'single_mother_margin', 
                                 'DP02_0011PE': 'single_mother_pct_est', 'DP02_0011PM': 'single_mother_pct_margin'})
dict_df# viewing summary statistics
dict_df.describe()

Screen capture of statistics for numeric columns generated via the pandas describe() method, with updated column headers. — Detail: statistics for numeric columns.

A Quick Plot

We now have a few analysis pools that we can dip our toes into. We can already see that, among these thirty largest counties approximately 88.18% of households have broadband access and the minimum is 77.2%. We can plot and compare the percentage estimates of broadband connected households by county or visualize the percentage of single-mother households.

Let’s take a quick look at broadband and single-mother estimates.

# instantiating plotting variables
head_df = broadband_sm_df.head()
x = head_df['NAME']
y1 = head_df['broadband_pct_est']
y2 = head_df['single_mother_pct_est']

We will create two plots in a single column and two rows.

# creating the broadband plot
plt.subplot(2,1,1)
plt.ylabel('Broadband percentage')
plt.bar(x, y1)
plt.xticks([],[])# creating the single-mothers plot
plt.subplot(2,1,2)
plt.bar(x, y2)
plt.ylabel('Single-mother household percentage')
plt.xticks(rotation=70, ha='right')# titling and displaying
plt.suptitle("Broadband and Single-Mother Percentages")
plt.show()

Bar chart subplots: percentages for broadband-connected and single-mother households, in the datasets first five rows. — Visualizing percentages for broadband-connected and single-mother households, in the first five rows.

Can we draw any meaningful conclusions, yet (other than, “that plot looks like a troubling dental exam” and, “there must be better ways to visualize data”)?

Well, maybe not yet.

Next steps

This is a good place for us to break.

We now have an authorization key for the U.S. Census Bureau’s API, and we know how to acquire custom datasets related to the American Community Survey. In addition, we can serialize and recall any local variables we wish to save for future use.

Now, we can actually get around to our purpose of exploring discrepancies in broadband internet access. We will import additional tools and explore data relationships more meaningfully in the next notebook.

Heck, we might even make it pretty.

References