Entering Data USA

Before working on Data USA, I had not heard of the acronyms ACS, PUMS, CBP or O*NET. In terms of my daily life, the only way I felt somewhat in tune with US government data was that every so often I would hear something on the news about employment statistics. For some reason it never occurred to me to investigate the source of the data or try to learn more. It just seemed like something boring and remote.

At the outset of the DataUSA project, the collaborators met together in Cambridge. Among the early discussions, I was surprised to learn about the available “Census data” from the American Community Survey (ACS). Again, highlighting my data ignorance, as far I as I knew, the Census was something that was conducted once every ten years. Turns out, that’s only the decennial census — ACS produces data every year for public consumption. A lot of data.

The first meeting discussing Data USA

After getting acquainted with a portion of what was out there, we set out to establish what stories we wanted to tell, and what could be told through the data. We worked in an iterative process to hammer out what we ideally wanted to show across four main dimensions: locations, occupations, industries and educational courses.

The second team meeting

Once we established what stories we wanted to tell, we started the process of actually acquiring the requisite data. A big part of creating compelling visualizations is not just showing the data as is, but also illustrating the context of available metadata. In many of our tree map visualizations we group similar industries by color. To achieve this, we rely on the nesting of the industrial classification system (NAICS). Sometimes we’ll also generate the metadata ourselves. For example, we utilize latitude-longitude data on institutions from the Department of Education in an attempt to geospatially join Department of Education (IPEDS) data with ACS data. This allows us to create summaries of the education data at geographic levels that are beyond those explicitly listed in the data. A big part of creating beautiful and easy to understand visualizations is having the right data.

One of the trickier aspects of building the site was that the datasets were not all cleanly linked. For instance, some agencies would use their own customized version of standard classification codes for identifying industries, occupations and even educational courses. Furthermore, not every dataset had data available at the same levels of geographic resolution (and some only had national geographic resolution). This is sometimes done to respect privacy concerns. PUMS data is intentionally released in restricted geographic resolution so as to protect individual’s anonymity as the smallest PUMA (or Public Use Microdata Area) in size is meant to be around 100,000 people.

To deal with these issues and others, we built in a feature in the API that was capable of redirecting queries based on what the user asks for and what is available. Here’s a quick example: in the County Health Rankings (CHR) data, the data are only available at the state and county levels but if a user asks for the data at a place level, we are able to convert that place into a county — see here (note the “subs” field). We also use a similar technique to crosswalk between ACS PUMS and BLS versions of the NAICS industrial classification system.

These ostensibly small technical features, such as geospatially joining datasets or crosswalking industry classifications, have a large effect on what we are able to deliver to users and help us paint a more coherent story. By not merely taking the data as is, but by structuring it, cross-linking, and enriching the data we are able to tell millions of data-driven stories about the economy, education and demographics of the USA.