Lifting the Fog on the Census, Differential Privacy, and Swapping (Greenwood)

This is a guest post from Ruth Greenwood:

Civil rights groups, redistricting litigators, political scientists, and computer scientists largely fell into two camps when it came to privacy protections for the 2020 census: Team TopDown and Team Swapping.* The acrimony between the groups arose over concerns that the 2020 decennial census data would either not protect privacy, not produce sufficiently accurate and unbiased estimates, or both. This led to lawsuits and a general mistrust between the teams and the Census Bureau (“Bureau”).

The concerns on both sides were legitimate and had real-world consequences: if the Bureau cannot guarantee the privacy of its data, then people won’t answer the census and it will produce inaccurate and biased estimates. If inaccuracies are random and small then census data can still be fit for use. But, if population counts, and, in particular, if racial demographic data, are inaccurate in a biased way, then everything from federal funding to political representation could be skewed. And, as we all should know by now, if political power is skewed it is almost inevitable that it will tilt in favor of white people and away from people of color.

Jeff Zalesin and I had a hunch that with a bit more data we could find out whether the two methods produced accurate and/or unbiased estimates. So, we dipped our toe into the pond, and after many discussions with experts like Cynthia Dwork, Gary King, Terri Ann Lowenthal, and Terry Ao Minnis, we decided that getting the intermediate files used to create the decennial data, the noisy measurements files (“NMFs”), could allow really smart people who know how to use the data (i.e. not us) to investigate the accuracy and bias questions.

It turns out that getting the files was a little harder than we thought: first we (Cynthia, Gary, and I) asked publicly; then we (us three plus around 50 academics) asked directly; then we (the Election Law Clinic, on behalf of Prof. Justin Phillips) asked formally via a FOIA; then we (the Election Law Clinic and Selendy Gay Elsberg PLLC) filed a lawsuit to enforce that FOIA; then we found out that half the data we needed had been deleted by the Bureau; and finally, once the Bureau had recreated and released that first half of the data, we settled the lawsuit on the condition that the other half of the data would be forthcoming.

Thankfully, this Odyssean adventure seems to have paid off.

Kenny et al. have analyzed the first NMF released and answered the questions of whether TopDown and Swapping are accurate and unbiased. Happily, what they find is good news for both teams. They report that TopDown and Swapping are “similarly accurate in terms of . . . bias and noise,” and that “[t]hese patterns hold across census geographies with varying population sizes and racial diversity.” In terms of racial demographics, TopDown and Swapping produce data with almost identical (and extremely small) levels of inaccuracy and bias. It also turns out that the area where TopDown was known to have higher error rates (racial demographics for small geographies, like census blocks), also applies when Swapping is used.

The main concern raised by the Kenny et al. paper is that people who select Hispanic/Latino for their ethnicity, or who select multiple races, tend to get much noisier (less accurate), but not necessarily more biased, numbers regardless of whether TopDown or Swapping is used. This is a problem associated with the separate ethnicity and race categories. And there is a whole separate debate about how that should be resolved.

Two other comments from the Kenny et al. paper are that TopDown introduces errors that can be relatively large in geographies with small populations (while Swapping does not add these errors), and that the NMF itself has too much noise to be used in place of the final decennial data at any level.

The 2020 NMF was just released today, so, provided it doesn’t show some errant result occurred with the application of TopDown in 2020 (I look forward to the next in the series of Kenny et al. papers), we can rest easy knowing that the 2020 decennial census data is as accurate and unbiased as prior decades’ data, while still protecting privacy.

Does this mean that Team TopDown v. Team Swapping was all a lot of sound and fury signifying nothing? To the contrary. It seems likely that Kenny et al.’s earlier paper, along with detailed submissions from groups like MALDEF and AAJC, put pressure on the Bureau to revise their TopDown process to improve accuracy and reduce bias. And without the use of TopDown, there were real concerns that increasingly sophisticated external actors could have launched a successful reidentification attack on the new census data.

The real mystery here turns out to be why the Bureau took two years and a lawsuit to release data that would have quelled fears and improved relations with all involved. Why didn’t they listen to groups like the Leadership Conference on Civil and Human Rights when they sought clarity on whether the proposed DAS changes would cause people of color to be even more underrepresented than they already are in the decennial census data? Why didn’t they respond when over 50 academics asked for the data they needed to verify claims the Bureau had made?

Perhaps if the Bureau is a little more transparent and responsive in the leadup to the 2030 census, we can all be more confident it will produce a fair and accurate count.

* A quick note on terminology: The Census Bureau refers to its work to meet the statutory requirement of privacy protection for census responses as its “Disclosure Avoidance System” (DAS). The DAS for 2020 is referred to as “TopDown” and includes both the application of a differentially private algorithm and post-processing. The term “Swapping” refers to the DAS used by the Bureau in 1990, 2000, and 2010 (whereby census blocks with information likely to lead to the identification of individuals are swapped with nearby census blocks). A recent paper notes that Swapping can “satisf[y] the classic notion of pure differential privacy.”

Share this: