American Community Survey Data in Python

Internet Access Denied (Part I)

Obtaining data to explore digital poverty in the US

Jamel Dargan
The Startup

--

This article is the first in a series looking into strategies for exploring United States Census Bureau public data with Python. We will first preview data via online tools and then obtain CSV data via download to gain an understanding in Pandas. In a subsequent article, we will build a customized API call, obtain a key, and import data via API. We will then use geospatial visualization to explore our data and consider opportunities for predictive analysis.

Much of what I have discovered or experienced in the areas of data science and analysis might have been much more difficult for me to realize without reliable access to the internet. Downloading datasets, collaborating and sharing-work via online repositories, and experimenting with open source tools have all been made possible via access to the internet. And while it pains me to no end when weather or a maintenance event interrupts my online-access for even one hour in a single month, many have it much worse.

Overview

Reliable, affordable internet access is not a given. Differences in availability can be significant across various continents, countries, or cultures. Differences may also be notable between and among lower-level geographies.

My own household heavily depends on internet connectivity for everything from communication and entertainment to learning and livelihood. This has been especially true since responses to the Coronavirus pandemic escalated the necessity of virtual meetings for distance learning and working from home.

When quarantine efforts during the ‘first wave’ of COVID-19 caused businesses and government offices to reduce hours or temporarily close, there was a lot of discussion about what such conditions would mean for workers and for the economy. I do not recall hearing nearly as much about citizens who already were jobless or those who already relied on public computer labs and free Wifi.

I started to wonder where I could look to get a sense of what ‘internet access’ looks like.

In this article we will describe the following:

  • Locating the data source
  • Identifying the dataset we want
  • Obtaining the data for exploration in Python
  • Subsetting relevant variables with metadata labels
  • A preview of the next article’s tasks

Locating the data source

My curiosity led me to the United States Census Bureau’s American Community Survey (ACS). The bureau collects survey data on an ongoing basis and releases it each year. The data includes 5-year counts and 1-year estimates of households with “a broadband Internet subscription” and a report of the population percentage represented by that number.

The Census Bureau provides several avenues for obtaining its data and for exploring the data online. Among them are a Quick Facts data tool that provides high-level overviews of select statistics for states and counties, full Detail Tables available via FTP that are composed of more than 31,000 variables, and Data profiles “presented as population counts and percentages” that contain datasets with more than 1,000 variables. In addition, the Census Bureau maintains API pages for its ACS datasets.

The API seems like a natural choice for a method that is parametric, programmable, and Python-friendly. However, ACS table names and data labels are notoriously confounding. Variable names change over time and vary based on the geography being analyzed. To be clear about the tables we retrieve and the data they represent, we will have to perform a bit of preliminary research.

The tables

The Census Bureau maintains a Table IDs Explained webpage. The page provides an example table ID and an explanation of how it is built to describe specific content and format.

Image of Census.gov’s interactive explanation table for its Table ID numbering system (source).

ACS table IDs comprise up to five elements. The example table B06004APR can be deconstructed, as follows:

  1. The initial character(s) indicates the table type, and already we have an exception. IDs that begin with “B” or “C” indicate the detail table type: base tables start with “B”, while “C” indicates a collapsed table with lower-level detail grouped — unless the ID starts with CP, which is a different table type. Other types of tables include subject tables(“S”) with estimates and percentages for a particular subject, such as veterans, and data profile tables (“DP”) containing high-level social, economic, housing, and demographic information. There are a total of nine types of tables, including comparison tables (“CP”). Five of the types are profile tables, which I mention because…
  2. The second element denotes the subject, but not for profile table types. This element is made up of a two-digit number (01–29, 98, and 99). Since “06” is the code for the ‘place of birth’ subject, we can expect our example table “B06…” to be a detail type table with birth statistics.
  3. Element 3 is a three-digit number described as uniquely identifying “the table within a given subject.” That does not tell us much, but it is represented by “004” in our example.
  4. The fourth element is an alphabetic suffix, “A” through “I”, for the “nine major race and Hispanic or Latino groups.” Somewhat confusingly, “A” indicates “White Alone”, whereas H represents “White Alone, Not Hispanic or Latino”. So far, our example appears to reference a detail type table of birth statistics for the “White Alone” population.
  5. Element 5 is an additional suffix used for select tables. The suffix is specific to tables from the Puerto Rico Community Survey questionnaire. We finally can describe our example table as a detail type table of birth statistics for the “White Alone” population in Puerto Rico.

It seems that it will be useful to have a sense of these table-naming conventions when the time comes to construct an API call. Honestly, we cannot know until we have some understanding of the columnar data referenced by the table elements. So let’s refocus our attention on the data we want to retrieve?

Data Preview

Recall that we are concerned with disparities in broadband internet access. We will use the Commonwealth of Virginia to define the scope of our geographic interest. Since we would like to compare differences between and among lower-level geographies within the state, we need to know how granular is the level of data available from the ACS.

Element 2 of the table ID does not contain a code for “Broadband,” and the Table IDs Explained webpage’s interactive description does not indicate what are the Element 3 subjects. Fortunately, we can get closer to identifying the sources we require for working with ACS data in Python by turning to the data.census.gov website.

Entering the search term “broadband” in a dialog on the data.census.gov site.
Entering the search term “broadband” on the data.census.gov site.

Searching the term “broadband” brings us to a results page with links to tables, maps, and pages related to the search term.

Detail: first three table cells in a linked list of search results on data.census.gov.
Source.

The first result looks promising. Table S2801, which we can expect to be a subject type table based on the “S” prefix, is identified as “Types of Computers and Internet Subscriptions” and is available for the years 2015–2019. Table B28002 also appears relevant, particularly if we plan to perform feature engineering since the “B” prefix suggests that it is an uncollapsed detail type table. Useful 5-year counts and 1-year estimates should be accessible from either of these two tables. Selecting the first results takes us to a page where we can preview labels and summary data for the most recent year.

Detail: preview table for a selected search result at https://data.census.gov/cedsci/table?q=broadband&tid=ACSST1Y2019.S2801
Detail of preview table for a selected search result (source).

For results spanning multiple years, a specific product may be selected from a drop-down selector at the top of the page. The left sidebar reports that there are a total of 39 results for our search term. We select a different result and preview its data.

The Presence and Types of Internet Subscriptions in Household detail table for 2019 appears to report the data with which we are concerned. We also have an option to filter the results. The filter gives us options for specifying topics and geography in addition to years, surveys, and codes. These should get us closer to the level of geographical detail that interests us:

  • For surveys, we will select the ACS 1-Year Estimates Data Profiles
  • We will choose the year 2019
  • We can specify all counties in Virginia for our geography (1-year estimates data granularity is limited to areas with populations of 65,000 +)

Selecting Done returns us to the table preview, filtered to our requirements.

We can see that the table is identified as DP02. For labels (many of which have been collapsed for the above screen-capture) we see count estimates, a margin of error, percent (of estimated population), and percent margin of error.

We can take note of the preview URL: https://data.census.gov/cedsci/table?q=broadband&g=0400000US51.050000&y=2019&d=ACS%201-Year%20Estimates%20Data%20Profiles&tid=ACSDP1Y2019.DP02&hidePreview=false. Later we can see how this URL reflects the API call required to import the data with Python. For the moment, we will acquire the dataset by selecting the sidebar’s “DOWNLOAD” link and completing the dialog prompt that appears as shown below.

View of data.census.gov data download dialog, with selections for a 1-year CSV file of 2019 data.
(source)

The resulting download is a compressed file containing the CSV data, a CSV metadata file, and an additional text file. The metadata provides a data dictionary for the dataset columns. The included text file is actually a markdown file with descriptions, definitions, and hyperlinks to explanatory information supporting the data table.

Into Python

Before we work out the syntax needed to obtain the data we ultimately wish to explore, we will use our downloaded files to come to an understanding of our data and the features relevant to our investigation. Decompressing the ZIP files in Python would be a low-value exercise for our current purposes, so we can simply extract them to the project data folder in the operating system.

We begin by importing the os and pandas packages.

import os
import pandas as pd

We are working in a Jupyter Notebook environment, from a ‘notebooks’ subdirectory. The data files are located in a sibling directory (see illustrated directory structure example, below, with truncated filenames).

.
└── main/
├── notebooks/
│ ├── 1-web-data
│ └──
└── data/
├── ACSDP1Y2019.DP02_data_with_overlays...csv
└── ACSDP1Y2019.DP02_metadata_...csv
└── ACSDP1Y2019.DP02_table_title_...txt

We then read our downloaded dataset and metadata into Python and view its summary information.

# instantiating the data file as a dataframe
data = pd.read_csv(‘../data/ACSDP1Y2019.DP02_data_with_overlays_...csv’)
# instantiating the metadata file as a dataframe
metadata = pd.read_csv(‘../data/ACSDP1Y2019.DP02_metadata_...csv’)
# viewing dataset info
data.info(), data.shape

Viewing dataset info reveals that there are 31 rows and 614 columns in this dataset. Viewing the first few rows of the dataframe can give us a sense of how the data is structured.

# viewing first rows
data.head()

Each row appears to represent a county or county-equivalent. There are 133 such geographies in Virginia. The fact that our data consists of only 31 rows tells us that only a portion of those geographies are included in the 1-year estimates. Name indicates the name of the county.

The other column labels are coded. The first row contains what seem to be descriptions for each column of data. For example, in the column labeled DP02_0001E, “Estimate!!HOUSEHOLDS BY TYPE!!Total households” is the first entry. Let’s verify that this matches our metadata file.

# viewing first rows from metadata
metadata.head()

The entries match.

Data Subset

Let’s make a copy of datawithout the 0th row. We will also reset the index.

# deleting the 0th row
update_data = data.drop([data.index[0]])

# resetting the index to increment from 0
update_data.reset_index(drop=True, inplace=True)

update_data

We now have one fewer row in our dataset. We can create a subset of our metadata columns containing the string “broadband”.

# searching the term “broadband” in metadata’s `id` column
filtered_meta = metadata[metadata[‘id’].str.contains(“broadband”)]
filtered_meta

The filtered dataframe has only four rows. GEO_ID shows us which data labels are relevant to broadband. Let’s list the labels.

# listing the filtered `GEO_ID` column labels
broadband_labels = filtered_meta[‘GEO_ID’].tolist()
# viewing the list of filtered labels
broadband_labels

We can use the broadband_labels list to create a subset of our data containing only columns relevant to households reporting broadband internet connections.

# subset `data` to include only columns with labels in `broadband_labels`
update_data[broadband_labels]
Detail: first five rows of `update_data` dataframe with broadband-relevant columns.

We can create a dictionary from metadata to more easily reference descriptions for the columns in our dataset. This will be useful for identifying columns that may interest us for analysis as we go forward.

# creating a metadata dictionary
metadata_dict = metadata.set_index('GEO_ID').T.to_dict('list')
# viewing the dictionary
metadata_dict
Detail: first 5 items of the data dictionary.

This is an improved visual reference. We could remove the first entry and break the string descriptions out of their lists. We would not gain much-if-anything for our trouble, so we will leave the dictionary as it is.

At this point, we can begin to explore our data and engineer useful subsets in Python.

Next steps

Now we know where we can obtain relevant data, how to interpret its column labels, and what are the variables central to questions regarding access to broadband. In the next article, we will obtain custom data via calls to the Census Bureau’s API and perform exploratory data analysis.

--

--