Making Census Data Open, Accessible and Useful for Real People

Published in

Nightingale

13 min readOct 12, 2020

Dot density mapping of housing types in London, house = blue, apartment = magenta (Credit: Ahmad Barclay/ONS)

This year and next are seeing 10-year national population and housing censuses taking place in scores of countries around the world (India, Brazil, Nigeria, USA, Germany, Turkey, etc.), and with these, we’ll be seeing huge volumes of data to try to understand and communicate.

Back in April this year, I joined the data visualisation team at the Office for National Statistics (ONS) in the UK as the first person working full-time to explore the kinds of visual outputs we will be developing for the 2021 Census here, contributing to a broad aim of “making census data open, accessible and useful for real people.” This has been a kind of deep-dive effort to get to grips with a bewildering array of possible formats and technologies available to us.

A couple of months after starting this effort, I had the opportunity — along with my ONS colleague Henry Lau — to host a Data Visualization Society Slack discussion to open up our own internal conversations with members of the DVS community.

This article shares some of the insights from this discussion, and also from the exploratory research and prototyping that we’ve been doing within the ONS. My aim here is to write the article that I wish I could have read all those months ago when I started work in my present role, to offer some starting points for people interested in visually communicating census data, and to identify where things may be headed in this space over the next couple of years.

Slack poll results that shows that the majority of respondents have made use of census data in their work.

Why should we care about census data?

Unlike the kind of national indicator data that we get from sources like the UN, World Bank and OECD, census data drills down to a much more granular level, allowing us to explore demographic, social and economic data right down to a neighbourhood level.

Navigating through the hierarchy of UK census geographies (Credit: Ahmad Barclay/ONS)

The level of granularity of the census data released by different countries varies a lot, but in the case of the UK, the smallest areas amount to roughly 100 households, or 200–300 individuals, and each of these areas may have upwards of 50 tables of data, covering age, gender, ethnicity, social class, employment, health, housing and a range of associated indicators.

In some countries under-counting of communities in the census may lead directly to a reduction in central government funding (Credit: *Mona Chalabi*)

Why do we gather all this data? Well, one primary driver for collecting such granular data is in order to properly allocate government resources in an evidence-led manner, for example, to make sure that there will be enough primary school places in an area that’s seen a recent baby boom, or that health services are properly funded where the population is getting older.

What does it mean to have this data? Aside from allocating government resources, census data helps us to build up a picture not only of how a country is changing over time, but also of the huge variation within a country, from city to city, rural to urban, or even from neighbourhood to neighbourhood, and what this tells us about our societies, for good or bad… And, without doubt, there is an important role for data visualisation in comprehending and communicating such a complex picture.

The art of the possible: formats

One of our main motivations in reaching out to the DVS community was to harness an unparalleled collective dataviz “hive mind” in seeking out some of the most interesting and effective examples of big population data visualisation already out there on the web, and the emerging technologies and design patterns that are soon likely to be at the forefront.

In terms of formats, broadly, we saw five main typologies highlighted in the discussion, of which I’ll describe and share examples of below:

Exploratory platforms
Micro-outputs
Guided experiences
Personalised/localised outputs
Games/gamified outputs

1. Exploratory platforms

“Exploratory” data visualisation tools and platforms typically aim to offer users a means to visually navigate large and complex datasets in an intuitive way, while allowing them to find their own meaning. These will typically incorporate a range of different data visualisation typologies often within a dynamic dashboard-type interface.

Left: DataShine UK census mapping (Credit: Oliver O’Brien & James Cheshire) | Right: Data USA area profile (Credit: Datawheel/Deloitte)

One of the biggest platforms for me was DataShine. You could map pretty much any question that was asked in the UK’s 2011 Census. — Henry Lau

It should be noted that the more effective exploratory tools and platforms tend to be designed around the kinds of insights that users are likely to gain from the data (e.g., trends, correlations or comparisons), rather than offering an entirely agnostic “chart builder”-type interface.

2. Micro-outputs

Static data visualisation may feel a little old school alongside some of the emerging formats, but there is still huge value in the work of synthesising a large dataset to pull out the little nuggets of information that can help to communicate the big picture.

*Left:* ”A day in the life of Americans” *(Credit:* *Matt Hong) | Right: “200 Years of American Immigration” (Credit:* *Pedro Cruz, et al.*)

Matt Hong’s data comic on “A day in the life of Americans” stands out to me. It’s not just because of the unconventional format, but it’s also how it humanised the statistics in the time use survey. — Alexandra Khoo

These micro-outputs are also valuable as a social media and mobile-friendly entry point into more exploratory formats. However, one related format that does seem to be in decline — at least on the web — is the large-scale static infographic, which is no longer accessible to the majority of web users that browse primarily on mobile devices.

3. Guided experiences

This typology combines the narrative-driven nature of data journalism and static infographics with the exploratory possibilities of dynamic and interactive data visualisation. At present, guided experiences are virtually synonymous with the “scrollytelling” format, where one or more embedded data visualisations respond dynamically to a user scrolling down a web page, with synchronised text, captions — and occasionally audio — along the way to tell the story, sometimes with an opportunity to explore the data more freely at the end.

Left: “Rich School, Poor School” (Credit: *ABC Australia) | Middle: College mobility in the US* (Credit: Raj Chetty/NY Times) | Right: “The Evolution of the American Census” (Credit: *Alec Barret/The Pudding*)

The ABC in Australia are the masters of turning our census data into scrollytelling pieces. — Darragh Murray

Scrollytelling has been pretty ubiquitous in dataviz for a while now, and we should probably be expecting new successor formats to emerge. But — regardless of the specific format — guided, narrative-driven dataviz is effective for its potential to bridge the divide between the freedom of “exploratory” dataviz, and the instant accessibility of “explanatory” formats like traditional infographics.

4. Personalised/localised outputs

This is a kind of catch-all title for dataviz products that harness some form of “automation” (described in the the next section) to generate outputs localised to different areas and/or personalised to the characteristics or interests of a particular user.

Left: “How Many Households Are Like Yours” (Credit: NY Times) | Right: SALCO “semi-automated local content” (Credit: BBC News Labs)

This category includes “robo-journalism” — or “semi-automated journalism” — where data- and logic-driven article templates are used to produce hundreds or thousands of localised variants of a given story. There are also some interesting examples in this category of projects that take a big population dataset (like census) and allow users to find out more about people/households that are like them.

5. Games/gamified outputs

*An interactive called “Spotlight” helped people to understand how Australia’s 2011 data census related to them as individuals (Credit:* *Australian Bureau of Statistics*)

This is another broad category, covering data-driven games and quizzes, and also other kinds of outputs that have game-like qualities. These kinds of outputs may not be part of a core informational offering, but can often prove to be a valuable way of engaging a broader public/audience that might not see the census as relevant to them.

For example, my ONS colleagues had an experience where the release of a “how well do you know your area?” style quiz overloaded the website’s API due to unexpectedly heavy usage.

The art of the possible: Technology

Moving on to technology, we discussed three main areas for innovation:

Front-end libraries
Back-end infrastructure
Automation

1. Front-end

In terms of front-end technologies, the two that stood out from the discussion for me were emerging WebGL-based visualisation libraries and Javascript frameworks that pre-compile code to optimise performance (Svelte being the most well known example).

I use Svelte with d3 and it’s a blast […] Svelte has a ton of things built in that I would have needed to import a bunch of libraries for . — Sandy Guberti-Ng

WebGL offers the possibility to render hundreds of thousands of data points in the browser, and even to animate them in 2D or 3D with the benefit of hardware acceleration. This offers a potential step-change for big dataviz compared to SVG or Canvas rendering, and is already baked in to a few visualisation-focused libraries, for example Mapbox GL and Vis.gl (which powers Deck.gl and Kepler.gl), as well as the more all-purpose Three.js.

Left: High-performance rendering with Mapbox GL (Credit: Morphocode) | Middle: Mapping individual daily commutes in New Zealand using Three.js (Credit: Karim Douieb) | Right: Dot density mapping calculated on-the-fly in Mapbox GL (Credit: Ahmad Barclay/ONS)

In the case of pre-compiled Javascript frameworks, the logic is that they do away with a “virtual DOM” abstraction layer of frameworks like React, Vue and Angular, and only compile the raw “vanilla” Javascript code your app needs to run. This means that, in most scenarios, they can offer faster performance and a smaller file size. So you get the benefits of a component-based framework — importing all the bits you need from dataviz libraries like D3.js, plus a bunch of built-in features — but at the same time it rarely feels like an overkill approach, even for smaller projects.

2. Back-end

In relation to back-end data delivery, our discussion on the DVS Slack remained fairly superficial, so I thought better here to look at some of the challenges we’re trying to tackle within the ONS, in delivering data on-demand to tens of thousands of concurrent users.

First, in the UK Census, we’re looking at a raw dataset of over 50 million “rows” of data — one for each individual resident — and each containing scores of observations (on age, gender, employment, social class, ethnicity, health, housing, etc.), all of which have the potential to be cross-tabulated. Furthermore, this data always needs to be aggregated in a way that doesn’t disclose data that could identify individual people or households, so can’t just be delivered raw.

Within ONS, there are at least three parallel data architecture solutions being explored for the 2021 Census, with each one playing some kind of balancing act between speed and flexibility (e.g., allowing highly customised queries). I should note that these platforms aim to deliver data for a lot of different purposes, with dataviz apps only being one of them.

At one end of the spectrum, we are investigating pipelines (from simple batch-scripts to cloud-based solutions) to pre-packaged chunks of data into flat files (JSON, CSV, etc.) that would allow us to return common datasets instantly without the need to query an API. At the other end of the spectrum, a custom-built “flexible table builder” would allow people to access pretty much any cut of data they’re interested in — with built-in controls to prevent individual households from being identified — offering maximum flexibility with some inevitable degree of tradeoff on speed.

There are also efforts being invested in solutions somewhere in the middle of this spectrum, for example an API that can flexibly aggregate small chunks of data from a graph database, while attempting to square the performance circle by caching results, so that any common query (like names/codes for geographic areas) can be returned almost instantaneously.

3. Datasets for automation

*Datasets for area profiles on Census Reporter are pre-generated to reduce the need for multiple API calls (Credit:* *Census Reporter*)

Another topic that we’re exploring is how to deliver data formats suitable for semi-automated journalism. A typical use case here, to generate data-driven “area profiles” for hundreds or thousands of similar geographies, requires us to produce a data file for each of those places, which might include a series of indicators on population age, density and ethnicity, along with housing, health and labour market conditions. It should also show how these indicators have changed over time, and how they rank alongside all the other similar geographies and/or compare to regional- and national-level indicators.

Producing these datasets will almost certainly require “pipelines” (as referred to above) to pre-package data, since aggregating this data on-the-fly would mean requesting data from many different tables at once for an essentially static output, which simply doesn’t make sense at scale. The Knight Foundation-funded Census Reporter project adopts this kind of approach to formatting datasets for the tens of thousands of areas covered by its area profiles.

The art of the useful

But, to return to the core aim of “making census data open, accessible and useful for real people” — whether they be policy makers, journalists, NGOs or activists fighting for their communities — and, given that time and resources are finite, what kinds of outputs should be prioritised?

*Detroit census response rate map (Credit:* *City of Detroit*)

While most of DVS would consider this to be a very simplistic visualization, the overlay that includes internet access SIGNIFICANTLY repositioned the narrative on the digital divide in the City of Detroit. — Mary Aviles

We ran a poll on the Slack discussion, admittedly not the most representative of samples, but the results were interesting. Almost nobody favoured either exploratory platforms — which have long been a staple of “big data” visualisation — and, equally, static “micro-outputs” won little support. Instead, the favourites were “guided experiences” and “personalised/localised outputs”.

Slack poll that shows respondents prefer guided experiences and personalised/localised outputs.

I think this response speaks to a growing realisation that there is a relatively low level of data literacy in the public at large, and that it is important to offer accessible stories and insights that can connect with the specific issues and places that matter to individual people.

It is also worth noting here that our early user testing suggests that the expert users capable of engaging with a more open-ended data exploration experience often prefer to download the raw data and visualise it themselves using whichever one of the vast array of dataviz tools on the market best suits their specific needs. Thus, it’s quite possible that custom-built exploratory dataviz platforms and chart builders may be serving an ever-diminishing audience.

However, at the other end of the spectrum, I tend to disagree with the apparent writing off of made-for-mobile/social media dataviz, which I would see as playing an important role in bringing insights to a wider general public audience. In this age of “fake news”, I would see clear and relatable visuals from trusted sources as an essential means to disseminate accurate information, as we have seen during the COVID-19 pandemic.

The future?

I hope this article has helped to highlight the broad spectrum of possibilities in the realm of census dataviz in 2020, and some of the formats and web technologies that we may see becoming more commonplace in the near future.

In relation to my own work, this is only the first step in a pretty long journey. In the UK, the next census is planned to take place in March 2021, with the data to be released sometime in mid 2022, and I think it’s fair to say that we can expect to see important innovations in this space between now and then.

Acknowledgements

A big thank you to everyone who contributed in one way or another to this article. Firstly, my ONS colleagues Henry Lau, Rob Fry and Darren Waters, who helped to frame the questions for the Slack discussion, to Molly Pettit, convener of the DVS “Topics in DataViz” Slack channel, and to Nightingale editors Jason Forrest and Isaac Levy-Rubinett for their patience with this write-up.

And finally, of course, a big thank you to the many DVS members who contributed to the Slack discussion, including Mary Aviles, Alexandra Khoo, Amanda Makulec, Ben Oldenburg, Andrew Whitby, Nicole Edmonds, Darragh Murray, Phil Hawkins, Alok, Karl Tryggvason, Sandy Guberti-Ng, Guillaume Meigniez, Jan Zak, Steffen Kühne, Alan Wilson and Darren Lim.

Ahmad Barclay is an architect and UX designer presently working on census data visualisation with the Office for National Statistics (ONS) in the UK. He was previously a founding partner with the award winning social impact dataviz studio Visualizing Impact and co-initiated the historical mapping and open data project Palestine Open Maps.