Nothing Standard About It: Age and Open COVID-19 Data

Nicole Contaxis
digitalinterests
Published in
3 min readApr 29, 2022

In the summer of 2020 we investigated state level COVID-19 open data including examining how age was reported. Age, for all states, was reported out in categories. So, for example, a state would report positive cases for people within age ranges, and not each individual’s age. While this is a reasonable and expected provision to help protect the privacy of the people represented in that data, we were surprised to see how much these age groups varied across states. In Arizona, everyone aged 0–20 is reported out in one group while in Utah, the first bin for age is 0–1. In Hawaii, everyone aged 60 and older is included in one category, a deeply concerning choice considering the clinical implications of age for COVID-19.

The graph below demonstrates how each state separated age groups and the variety across states. For example, in this graph, South Carolina has 22 separate bins for age while Hawaii has 3.

Figure 1: Created by Hope Muller

The above graph of our findings wonderfully demonstrates the amount of variety in age groups, but it does not help us understand why this variety exists. As a data librarian in an academic medical center, I have some experience with clinical data standards for collection and reporting, and as our findings became apparent, I began to wonder if any of this variety could be tied to clinical data standards.

Clinical data standards make collecting, aggregating, and reporting on clinical and biomedical research data more uniform. The standards limit variability in data reporting, including age groups. Although the word ‘standard’ suggests otherwise, there can be multiple data standards used to describe the same phenomenon. Navigating these standards and understanding which to use when is a challenge, even for the most informed researchers.

I investigated whether the age categories found in the state level open COVID-19 data corresponded to any well-known clinical data standards. First, I looked to the Medical Subject Headings (MeSH) from the National Library of Medicine. MeSH provides definitions for medically-relevant terms, so they provide age ranges for specific developmental stages (e.g., Adolescent, Aged). Yet, after tracking all age-related terms in MeSH, not one seemed to correspond to how any of the states created their age categories.

I then decided to look to the Center for Medicare and Medicaid (CMS) data as well as the Surveillance, Epidemiology, and End Results (SEER) dataset from the National Cancer Institute. While neither of these truly count as standards, they do contain large amounts of data on human subjects. In other words, these datasets, controlled by federal agencies, need to protect the anonymity of the subjects and accurately demonstrate patterns of health and illness nationally. As such, they seem like good touchstones for a state trying to report COVID-19 data yet respect the anonymity of its residents. Yet, neither the CMS now the SEER data used age categories that corresponded to any of the states open COVID-19 data.

When I began to look into how the FDA reports age, I saw perhaps part of the issue with employing standards for age categories. In their guidance on reporting age for medical device clinical studies, they write, “When evaluating age-specific data, clinical studies should plan to group subjects by age groups as appropriate for the disease condition.” In other words, the FDA’s stance is that age categories need to be generated in such a way that reflects context. Studies on a device that works only for the geriatric population, for example, may require more granular categories than a device that is intended for use by the public at large.Considering the sheer amount we still need to learn about COVID-19, it therefore seems unreasonable to set a standard for age categories. Unlike, for example, a device for heart health, we are not yet sure what information is clinically relevant.

Our ignorance about COVID-19 thus makes age categories far more complicated than the team originally realized. For this reason, I would argue that, as opposed to creating a standard for age categories, it would be helpful for analysts to provide explanations for their age categories. These explanations could help those re-using the data to contextualize their own analysis of the data and could assist in later attempts to understand national level patterns.

--

--