Aroostook County, ME profile photo c/o U.S. Fish and Wildlife Service Northeast Region Flickr

An Introduction to the Data USA API

Published in

Datawheel Blog

5 min readJun 20, 2016

An API or Application Programming Interface is the umbilical cord that feeds raw data from the database to the front-end. The Data USA API is the backbone of datausa.io. This data stream is then used to build the visualizations dynamically in the browser using D3plus. Separating out these two aspects of the site provides the following benefits:

Modularity: API logic is able to be maintained independently of front-end codebase.
Reusability: the same data calls can be used for multiple different visualizations.
Interactivity: building the visualizations dynamically on the font-end (vs sending all data required with the page request) allows coding interactive functionality like tooltips and other mouse behaviors.
Performance: visualizations are only created when they are viewed by the user, reducing unnecessary data transfer.

⚠️ Warning: Deprecated ⚠️

The following article discusses the original Python-based Data USA JSON API, which is no longer in production. Many of the links and examples in this article may be broken, and for up-to-date usage information of Data USA API please reference the following page: https://datausa.io/about/api/

Constructing a Query

Making 7 different datasets available under one roof was no small feat. To do this we had to encode all the possible combinations of filtering the data into the query parameters of a URL. The root of all API requests start with:

http://api.datausa.io/api/

Followed by some combination of predetermined URL encoded query filters:

?column_name=value&column_name2=value2

For futher information on what it means to filter filter the data this way take a look at the documentation. An example data call like the following:

http://api.datausa.io/api/?show=naics&sumlevel=0&where=num_ppl:>10000000

Means “show” or return NAICS (industries) at the “0” sumlevel (explained later) where the “num_ppl” column is greater than 10 million.

Attributes

On all of the visualization sites we build we make the distinction between “attributes” and “data”. Some people refer to this as the difference between data and metadata. Imagine an API request for the top counties that employ the most Web Developers. The data that is returned contains only the IDs for the locations requested but if a user were looking to show any greater detail about these locations (such as their full names or an associated image) they would need to load the location attributes. Separating these 2 different types of data allows for economy of bandwidth since the metadata is provided only when requested. This can be a significant performance boost if a page is requesting 10 different data calls that all show the same attributes.

The following is an example of fetching the geography attributes, 1 of 4 different types of attributes available on Data USA (the other 3 being occupations, industries and higher education courses).

Geographies

http://api.datausa.io/attrs/geo/

Example result:

{
  data: [
    [
      "beverly-ma", 
      "Beverly, MA",
      "geo/16000US2505595.jpg", 
      "https://flic.kr/p/5BsPCU", 
      "160", 
      null, 
      "Mr.TinDC", 
      "Beverly", 
      "16000US2505595", 
      "Beverly, MA"
    ]
    ...
  ]
  headers: [
    "url_name", 
    "display_name",
    "image_path", 
    "image_link", 
    "sumlevel", 
    "image_meta", 
    "image_author", 
    "name_long", 
    "id", 
    "name"
  ]
}

Sumlevels

Summary levels or sumlevels, as they’re often abbreviated, are the different depths by which the data is aggregated. For geography, Data USA supports the following sumlevels:

Nation
State
County
Public Use Microdata Area (PUMA)
Metropolitan Statistical Area (MSA)
Census Designated Place

Some nest cleanly like Nation > State > County while others don’t like Nation > MSA. In the first example we say they nest cleanly because counties can only ever be found in a single state whereas MSAs don’t respect state lines and often cross them. Boston’s MSA (Boston-Cambridge-Newton, MA-NH Metro Area) includes parts of Massachusetts and Southern New Hampshire. Some MSAs even include an entire state as well as parts of another. For example the Washington D.C. MSA (Washington-Arlington-Alexandria, DC-VA-MD-WV Metro Area) includes all of D.C. and parts of Virginia, Maryland and West Virginia.

We support filtering attributes in the API with the “sumlevel” query param.

http://api.datausa.io/attrs/geo/?sumlevel=state

The Data USA API wiki has further documentation on this.

Data

Constructing a data query is slightly more complicated due to the nuances of each specific dataset being queried. For example we have demographic data from The American Community Survey (ACS) broken down by location and language. Constructing a query to find the top non-english languages spoken in Beverly, MA would look something like the following.

http://api.datausa.io/api/?show=language&year=latest&geo=16000US2505595&sumlevel=all

Breaking down each of the query parameters used will help us evaluate what’s happening here. First finding the table we want to query in the documentation will help find the variables we are able to use. The table is called acs.ygl_speakers, which, although a bit obscure, tells us that the underlying data source is The American Community Survey (ACS), the primary keys are Year — Geography — Language or YGL and the central data values are Number of Speakers.

show=language
A ‘show’ parameter is always required to indicate which attribute is being shown.
year=latest
Sets the year of data being fetched, ‘latest’ will always map to the latest year available.
geo=16000US2505595
The specific geography to filter the data on, in this case Beverly, MA. A full list of geographic IDs can be found in the classifications section of the data documentation.
sumlevel=all
This tells the API to look at all available sumlevels for the variable being shown. In this case, languages do not have summations so we can just use “all”.

And here is a sample of the data returned:

{
  data: [
    [
      35636.0, 
      590.0,
      1.23418, 
      2014,
      "16000US2505595", 
      "002"
    ]
    ...
  ]
  headers: [
    "num_speakers", 
    "num_speakers_moe",
    "num_speakers_rca", 
    "year", 
    "geo", 
    "language"
  ]
}

Since the “show” in our query parameters was “language” we know that each data point here corresponds to a different language spoken, along with the returned number of speakers values. In the documentation we can find a list of the IDs that correspond to each language.

Data Fold

One nuance of the way the API returns data (that astute readers may have already noticed) is that the rows and headers are separate. This is an optimization made to reduce the bandwidth footprint when requesting large numbers of rows. By only stating the headers once and returning arrays of arrays the file size can be significantly smaller, this does come at the cost of requiring the user to “fold” their data with the headers if they prefer to work with a JSON object. Below is a reusable function written for this express purpose.

function datafold(json) {
 return json.data.map(function(data){
   return json.headers.reduce(function(obj, header, i){
     obj[header] = data[i];
     return obj;
   }, {});
 });
}

Conclusion

We’ve now seen how the Data USA API is structured, how to formulate both attribute and data requests and looked at samples of returned JSON data. Spending a bit more time looking at the API documentation and inspecting various API requests in the console while browsing the Data USA front-end would all be great next steps.