A Crash Course on the Census Bureau’s TopDown Algorithm

A Primer for Local and State Census Data Users

Izzy Youngs
Georgetown Massive Data Institute
6 min readMar 29, 2022

--

The US Census Bureau is responsible for counting the entire population of the United States — and ensuring that individual responses are confidential, as required by Title 13 of the US Code, which prohibits the Census Bureau from releasing identifiable data:

“furnished by any particular establishment or individual.”

— Title 13, Section 9, of the U.S. Code

That means that everything that the Census Bureau releases must meet rigorous standards of confidentiality before being published. The US Census Bureau produces many data products from the Decennial Census program that satisfy several needs, including apportionment, redistricting, and distributing federal funds. The data also reveal how demographics in our communities are changing over time and provide critical information about housing and households. Beyond the decennial census, additional intercensal data files, such as the American Community Survey (ACS) and population estimates, provide even more information.

The TopDown Algorithm (TDA) is one method used in the Census Bureau’s disclosure avoidance system, often referred to as DAS. The US Census Bureau used TDA for the redistricting data file (PL 94–171) and will use it for the demographic and housing characteristic (DHC) file. The TDA was not used on the apportionment file. The released apportionment counts were not altered with any disclosure avoidance algorithm.

What is the TopDown Algorithm?

The Census Bureau applied TDA to census data to balance accuracy and privacy. If the data were perfectly accurate, it would be easy to re-identify respondents in the data, and if it were perfectly private, the data would be unusable for a variety of important applications, including public health, local planning, federal and state programs.

Finding the acceptable balance between privacy and accuracy involves setting the privacy-loss budget. In other words, more privacy budget spent means more accuracy. The US Census Bureau has a multi-step process to implement the TDA:

  1. As part of the pre-processing step, the microdata from the 2020 census are tabulated. Also, certain statistics are set as “invariants” — they do not have noise added to them. Invariants for the Redistricting Data File included the total population at the state level, the number and type of occupied group quarters facilities at the block level, and the number of housing units (whether occupied or not) at the block level.
  2. To reduce re-identification risks, the US Census Bureau implements a mathematical definition of privacy called differential privacy. A method that satisfies differential privacy adds a small amount of noise to statistics that will be released in order to preserve the privacy of respondents.
  3. After adding positive and negative values to the statistics, some of the tabulations need further adjustments to avoid impossible scenarios. For example, there cannot be negative counts or subpopulation totals at the county level inconsistent with the state totals, so post-processing rules further adjust the statistics.

This process produces statistics that are most robust at higher levels of geography (less affected by random noise added) and least robust at the smallest levels of geography (more affected by random noise added). Previous censuses used different methods of preserving privacy, such as data swapping. But, in the last few decades, the advancements to standard computing technology have increased re-identification risks. At the same time, new privacy-preserving frameworks have been developed. For the 2020 decennial census, the Census Bureau used one of these new frameworks: differential privacy.

Did the Census Bureau Count Everyone?

The US Census Bureau’s primary obligation is to produce counts of the population for apportionment. The Census Bureau did NOT alter the published counts for apportionment. However, the US Census Bureau also produces data with richer geographic and demographic detail, which has much higher re-identification risk for cities, local governments, public health agencies, and other groups. The Census Bureau will apply TDA to some of these datasets, altering statistics slightly so that individuals cannot be reidentified. The TDA was used or will be used on the following datasets:

  • PL 94–171, used for redistricting
  • Demographic Housing and Characteristics File (previously the Summary File 1)
  • Demographic profiles

Where Can the Effects of the TopDown Algorithm be Seen?

The 2020 Census Disclosure Avoidance System using the TDA injects small ± changes to the statistics that will be published. TDA does not change the underlying microdata. These small changes are made to protect the privacy of individuals in the statistics while maintaining the accuracy of the overall statistical distribution. There are a few invariant counts that the TDA does not inject noise into. There are far fewer invariants in the 2020 decennial census than in the 2010 and earlier decennial censuses. This means that noise was added to more measures in 2020 than in previous censuses. For example, comparing categories between the 2010 and 2020 Redistricting Data Files, the following table shows that many more state-level statistics are noisy in the 2020 census results relative to 2010, where the primary disclosure avoidance method was swapping.

In more rural and homogenous communities, individual outliers are more likely to be re-identified. While noise may be added to all statistics, post-processing will be needed more in rural areas and among small or homogenous populations. It is easier to preserve the privacy of individuals in larger, more dense, and more diverse communities.

However, like whack-a-mole, tamping down errors through post-processing in one area will cause distortions in another area. If population counts are shifted to rural areas, slight drops in population from more urban counties are necessary to ensure that the overall state population remains constant. These changes to substate census counts may complicate resource allocation and planning. For example, a change in under 18 or over 85 populations may affect school or aging services planning, and shifts among racial categories may distort actual community conditions.

Is the TopDown Algorithm Good or Bad?

These changes in reported statistics are intended to reduce re-identification risks, so that the US Census Bureau’s federally-mandated requirement to preserve respondent confidentiality is met. TDA distorts some counts in some lower-level geographies, particularly impacting the utility of block level data in the Redistricting Data File, and likely affecting demographic and household data accuracy in the forthcoming DHC. A growing number of researchers question whether the Census Bureau’s claims about the risks of re-identification. The methods and assumptions used in their risk assessments have not been published or peer reviewed. Legacy disclosure avoidance systems using suppression and swapping also affected data quality, and there was even less transparency into those methods. In past censuses, extreme outliers were swapped through an ad hoc, behind-the-scenes method. Differential privacy and the TDA are transparent, so the methods are being more openly discussed. This means the amounts of privacy protection added to the data can be “tuned” to provide more privacy protection to certain geographic or demographic statistics.

The full impact of differential privacy, other aspects of the TDA (like post-processing), COVID-19 disruptions, college closures, presidential executive orders, and changes to the census data collection and imputation methods is unclear. However, please check out the Massive Data Institute’s Census County Assessment Tool to explore local conditions and indicators which may help to explain the local impacts on data quality in the 2020 decennial census.

--

--