To Reclaim Data Anonymity, Give Up The Proofs

Researcher Paul Francis tells Teb’s Lab why we might have to give up on mathematically certain methods of data anonymization in order to advance the field.

Courtesy of Paul Francis

There is an axiom in security that your security doesn’t need to be that strong, it only needs to be stronger than the next guy’s.

No one really knows how to do static anonymization except on a case-by-case basis. The Census Bureau for example releases statically anonymized data, but to do that they go through a very manual process of deciding how much they can get away with. For example, they have to choose how much to obscure each column. Plus, the census data is typically very simple. Most columns have just a small number of choices such as gender or race.

You can say with mathematical confidence that something is differentially private, but the data often loses its utility in the process.

Even if this specific story isn’t true, stuff like that can definitely happen. It’s hard to do good data analysis under normal circumstances and people make mistakes all the time. It takes a lot of skill just to use raw data properly and anonymization only makes it harder. So, if you don’t really understand what’s going on under the hood you can make a big mistake.

Nobody wants to write a headline like, “Company Able To Do Anonymity a Little Bit Better.”

People gave up on mathematical proofs and certainty in order to go faster, and this is similar to what we have in anonymity. We want high utility in our data and this means we have to give up on the proof, just as we gave up the proof in encryption to gain speed. In most cases we will not get the utility we need from the data using these mathematically certain methods.

A curious human on a quest to watch the world learn. I teach computer programming and write about software’s overlap with society and politics.

