Using Simple Arithmetic to Fill Gaps in Categorical Data

And potentially circumventing privacy protections in the process.

One of our clients is using employment figures from the Bureau of Labor Statistics (BLS) in a machine learning model we’re helping them build. We recently noticed that there are frequently gaps in the data:

BLS Employment Data for Phoenix, Arizona is missing data for Oilseed and Corn Farming, among other industries.

Walk-through

We can use IPF if we have data we can organize in a matrix or table, and we have row totals and column totals. In the BLS example, this might look like (this is a fictional example, and the missing value is 8):

To perform IPF we need a two-dimensional table of data with row and column totals.
Seed each gap with a random guess.
Divide the seed by the current row total to turn it into a proportion.
Now we multiply proportion with the true row total.
Then we’ll perform the same operation for each column.
  1. For each row, adjust the seeds to be proportions of the current row total.
  2. Multiply each proportion by the true row total.
  3. For each column, perform #2 and #3.
  4. Go to #2 until you converge.

Let’s try it!

First let’s generate some fake data:

Generating fake data we’ll use to test IPF.
Removing small values to simulate BLS suppression.
Filling suppressed values with a random seed.
The distribution of errors of our random seeds.
The meat and potatoes of IPF.

Intuitions

You can think of this process as solving a system of equations. The row and column totals are giving you constraints you can use to solve for the missing values. If there are too many missing values, it’s possible you won’t be able to converge [2].

The number of iterations needed increases with the number of missing values. This experiment was capped at 50K max iterations. Generally we were able to converge on up to 1% missing values in a 100x100 matrix.

Ethical Ruminations

In 2007, researchers were able to de-anonymize individual users in the Netflix Prize dataset by linking movie ratings on Netflix to public data on IMDB. This is yet another reminder that even if you take pains to protect the privacy of your users, there is potential for sufficiently motivated parties to break your privacy protections in new and interesting ways. In this case, it’s not even particularly difficult. There are companies that have been selling de-suppressed versions of BLS data on the market for years (presumably leveraging techniques like IPF) [3].

End Notes & Thanks

Many thanks to Nishan Subedi for careful reading and feedback.

Related Works

Related Works is a boutique consultancy in NYC that…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store