Using Simple Arithmetic to Fill Gaps in Categorical Data

And potentially circumventing privacy protections in the process.

One of our clients is using employment figures from the Bureau of Labor Statistics (BLS) in a machine learning model we’re helping them build. We recently noticed that there are frequently gaps in the data:

BLS Employment Data for Phoenix, Arizona is missing data for Oilseed and Corn Farming, among other industries.

It turns out that if a particular combination of Industry and Geography, like Corn Farming in Phoenix, has a small number of establishments (places of employment), the BLS will suppress data for that Industry+Geography, in order to protect the confidential data about the firms there. Otherwise you could do things like figure out how many employees a private competitor has and how much they are paying them on average.

While this makes sense in the abstract, it makes the data tremendously difficult to use. Imagine a small town that has a single large employer, like a private hospital or university. They are a critical part of the local economy, but the BLS can tell you nothing about them! What if you want to calculate year-over-year growth rates to see which industries are up-and-coming in a particular geography? If healthcare jobs were suppressed last year, there’s no way for you to know how much they’ve grown even if the figures are published this year!

Suppressed categorical data is a common problem with public datasets. Iterative proportional fitting (IPF) is a dead-simple numeric technique you can use to fill gaps like these.

Walk-through

We can use IPF if we have data we can organize in a matrix or table, and we have row totals and column totals. In the BLS example, this might look like (this is a fictional example, and the missing value is 8):

To perform IPF we need a two-dimensional table of data with row and column totals.

The column totals in this toy example would be the total number of jobs in a given industry in the state of New York, and the row totals would be the total number of jobs in each city.

We start by taking a guess for the value in each gap. You can try to leverage other data sources to take an informed guess, or just select one randomly. We’ll call these guesses seeds.

Seed each gap with a random guess.

For each row, we’ll calculate the current total. Then we’ll divide each seed by the current row total to turn it into a proportion:

Divide the seed by the current row total to turn it into a proportion.

Then we’ll multiply each proportion by the true row total:

Now we multiply proportion with the true row total.

Congrats! We’ve successfully nudged our initial guess closer towards the missing value (8). Now perform the same operation for each column:

Then we’ll perform the same operation for each column.

That completes one iteration. Just repeat the process until it converges (i.e. the seeds stop changing). To summarize:

  1. Seed each missing value with a guess.
  2. For each row, adjust the seeds to be proportions of the current row total.
  3. Multiply each proportion by the true row total.
  4. For each column, perform #2 and #3.
  5. Go to #2 until you converge.

If you need whole numbers (like in the case of jobs), you’ll need to do a little rounding once you’re all done.

Let’s try it!

First let’s generate some fake data:

Generating fake data we’ll use to test IPF.

To simulate confidentiality policies like the BLS’, we’ll remove any data that is less than 10.

Removing small values to simulate BLS suppression.

Now we’ll initialize each gap with a random seed between 1 and 10:

Filling suppressed values with a random seed.

To see how well IPF works, we can compare our final result for each gap to the original values we removed. Here’s a graph of the absolute errors (i.e. abs(actual — random_seed)):

The distribution of errors of our random seeds.

On average, our random guesses are off by 2.5, though you can see some outliers that are much worse. How much closer can IPF get us?

Here’s what the whole thing looks like:

The meat and potatoes of IPF.

After ~30k iterations, our error looks like this:

The average seed is within 0.4 of its actual value. Pretty rad for 25 lines of code and ~30k iterations.

Intuitions

You can think of this process as solving a system of equations. The row and column totals are giving you constraints you can use to solve for the missing values. If there are too many missing values, it’s possible you won’t be able to converge [2].

The number of iterations needed increases with the number of missing values. This experiment was capped at 50K max iterations. Generally we were able to converge on up to 1% missing values in a 100x100 matrix.

You can also do this in more than two dimensions, assuming you have totals along those additional dimensions. For instance, imagine having employment figures broken down by a combination of industry, geography, and race. Each additional dimension gives you another constraint to triangulate against, and thus potentially improves your fit.

Ethical Ruminations

In 2007, researchers were able to de-anonymize individual users in the Netflix Prize dataset by linking movie ratings on Netflix to public data on IMDB. This is yet another reminder that even if you take pains to protect the privacy of your users, there is potential for sufficiently motivated parties to break your privacy protections in new and interesting ways. In this case, it’s not even particularly difficult. There are companies that have been selling de-suppressed versions of BLS data on the market for years (presumably leveraging techniques like IPF) [3].

There is undeniably tension between protecting the privacy of users (or in this case, companies) and making data usable for statistics, machine learning, and transparency. As it stands, leveraging BLS data to do anything interesting is remarkably difficult (ironic for an organization whose raison d’etre is publishing data for people to analyze).

As we enter the GDPR Era and try to recover from the jaw-dropping disclosures of how tech companies like Facebook have grossly abused our trust, navigating this tension has never been more important. But I don’t think the solution is to stop disclosing information, either publically, or within our organizations. Doing so would be akin to deciding in 1900 that the whole “industrial revolution” thing really isn’t for us and and turning back to horse-drawn carriages.

Differential Privacy is a field focused on the idea that you can introduce noise into a dataset, such that it’s provably impossible to know anything about a particular individual or data point, but you can still leverage the dataset in aggregate. I can imagine a scheme where rather than suppressing figures, noise is introduced to BLS data such that the confidentiality of individual firms is protected, but the data is still usable for time-series analysis. Apple’s embrace of this approach [4] gives me hope, but it’s still early days, and it’s not clear that their techniques will be directly applicable to data at the BLS’ much smaller scale, and fairly different set of analytic requirements.

Ultimately, I think we’ll need to pour way more resources into these sorts of problems if we hope to live in a just and data-empowered world.

End Notes & Thanks

Many thanks to Nishan Subedi for careful reading and feedback.

[1] — IPF is typically concerned with adjusting all the cells in the matrix wrt to column/row totals, and there are many different strategies for seeding cells. This blog post presents a variant where we hold known/true values constant, and only adjust unknown values.

[2] — Because we’re doing a weird variant of IPF, it’s not clear that typical assertions about convergence apply.

[3] — In the BLS’ defense, they have used software to audit the feasibility of re-generating suppressed data, and in those tests found that no suppressed cells could be regenerated within 20% of their true values. In all likelihood, the feasibility of using IPF-like techniques for BLS data depends highly on your choice of seeds, your ability to leverage other datasets for seeds, or your level of error tolerance.

[4] — Great nuts and bolts blog post here.