Data USA Launch: A Developer’s Perspective

Alex Simoes
Datawheel Blog

--

Now that the veil has been lifted from Data USA, a project that’s consumed the majority of my waking hours over the past 14 months, I wanted to share a bit of personal perspective on the project.

Going Big

Here at Datawheel our bread and butter is building massive open data visualization engines. Building a visualization engine versus a single data visualization changes the technical landscape of what’s being accomplished enormously. It puts a much heavier burden on the user interface and overall site design as opposed to any single visualization itself. This was the reason we built D3plus in the first place — to abstract and modularize all of the visualization code away from the user interface concerns.

The goal with the visualizations used throughout the site was to make the data as simple and digestable as possible. We spent a lot of time thinking of the best way to do this and would often times eschew a more complex design for one that was simpler and easier to understand. Take for example the first visualization shown on the location profiles, where we display a very simple bar chart comparing the selected location to a relevant comparison group. The Chicago profile shows Chicago’s median household income as compared with that of the counties it is in, the metro area, the state and finally the US as whole.

6 Different Datasets All Under One Roof

From a purely technical standpoint the biggest challenge was marrying the 6 datasets used under one database. Each of these datasets are maintained by different organizations and bureaus in the US government so to get them to work together was novel territory. The way we solved this was by developing a logic layer that sat in between the API and front-end to always return the most relevant datapoint.

This logic layer is able to do smart replacements on attributes so that data is always returned for a given request even if the specific query didn’t return any direct results. To give an example, we have wage data from The American Community Survey (ACS) Public Use Microdata Sample (PUMS) available for every state and every Public Use Micro Area (PUMA) but not for individual cities. So, when a user visits the site and navigates to say, the Beverly, MA profile, instead of removing wage data, we substitute this missing datapoint with the data from Salem, Beverly, Gloucester & Newburyport Cities PUMA.

The Importance of Design

As best as possible we tried to separate out the different aspects of the design of this tool. First thinking about the data design — how will this data be exposed to the front-end for use in visualizations. Next was giving consideration to the content design because what you choose to show as well as how you show it can paint a very different portrait of the underlying data. Lastly the graphic design was incorporated into the tool, giving prominence to specific navigation features and highlighting the data visualizations as clearly as possible. Due to the fact that a website like Data USA had never been built before we had to think about these different aspects of the tool iteratively and see how they could work together. We determined early on that because the visualizations rely so heavily on the use of color, we needed to make the overall site use as little color as possible. Yet, because the visualizations all use the same palette (rearranged differently based on the underlying data) we landed on the idea of using photography in the splash image at the top of all the profiles. This enabled us to give each page a distinct feel and unique characteristic based on that locations geography.

Learning From Past Mistakes

One of the concepts we learned from building a predecessor project, DataViva (dataviva.info) was the idea of structuring the site around profiles of different types and providing cross links between them. Users can begin navigating the site by searching for something specific they may be interested in (say their hometown or current occupation) and then begin exploring other profiles related to their original search via these cross linkages. So the true value of the site comes in this unique user experience of not overwhelming visitors with the massive volume of data but instead showing them the most relevant topics to their interests and allowing them to begin exploring this network of interwoven profiles of data themselves.

SEO

It turns out this methodology of building profiles that all link between one another is also great for search engine optimization. Another important lesson learned from previous sites was the importance of writing coherent narratives for each of the visualizations we were showing. One of the technologies built for Data USA that enabled this was a custom YAML configuration text parser. This allowed us to write “mad-libs” styles sentences like:

The most common jobs in <<geography>> by number of employees, are <<top show=occupations|order=num_emp|sort=desc|limit=3>>.

This sentence then gets run through a parser on the backend to fill in the blanks and ship a fully constructed dynamic narrative to the page.

The Value of Open Data

The value of making data open is that you are now putting it back into the hands of the citizens who were responsible for it in the first place. The problem is that making it open is only one part of the solution. Many of the datasets used in Data USA have been open and publicly available for years but until they have been curated and organized coherently for a non-technical user they are useless. We’ve made non-technical users the priority when designing the site so that the data is organized in a way that a human can understand and navigate through. This means giving visual representation to as much of the data as possible as well as textual descriptions in the form of coherent sentences. This not only provides a narrative to the visualization you are looking at but helps search engines find and index the data being shown so that when users on the web search for “most common jobs in Boston Massachusetts” will be provided with an answer and not a spreadsheet of data.

--

--