Aggregation isn’t a privacy guarantee. Here’s what we do instead.

Allison Bishop
Proof Reading
Published in
22 min readFeb 3, 2022

--

The average American believes they are exceptional. So why should they believe that aggregation alone achieves data privacy? If you are a tall tree in a world of short bushes, there’s no way to hide in a crowd.

And yet, a common practice in the financial industry is to assure clients that their sensitive data will only be used in an “aggregated” fashion. This is supposed to prevent revelation of individual clients’ secrets, but there is no guarantee that it does so. Aggregation is just a mechanism, not really a privacy definition in itself. Many critical questions go unaddressed. Questions like: what could someone learn from the aggregated data? What harm might come to participants from the shared aggregated data? What are the pros and cons of opting in or out?

TL/DR: Proof is instituting a much stronger standard for client privacy in our handling of data. Many aspects of this follow naturally from our conception that client data fundamentally belongs to the client rather than Proof, making policies like our prohibition against selling the data in any form a no-brainer. These kind of policies were recently covered here. Now we’re going to flesh out the part of our privacy policy that is more mathematically delicate — how can we release basic statistics about our algorithms’ performance while still assuring client privacy in a meaningful way? Spoiler alert: we’re not just going to say “it’s aggregated.” Instead, we want to ensure that any statistic we release about our overall data set would not qualitatively change if we removed the data from any one particular client. Defining exactly what “would not qualitatively change” should mean and understanding the nature of this privacy protection is the subject of the rest of this post.

As part of our general culture of accountability through transparency, Proof would like to publicly release statistics about our trading performance. Of course, any such statistics are computed on the aggregated trading data of our clients. This raises the question: how do we make sure that the statistics we release do not compromise client privacy? And what does “compromise client privacy” mean, anyway? There are some things that are clearly compromises of client privacy (like sharing client data directly with others — don’t do that!), and some things that are clearly not (like simply saying the total number of clients Proof has). But there’s a lot of gray area in between. Is it ok to release an aggregated stat that is heavily influenced by one client’s data, because they are an outlier? We think not. But aggregation alone does not prevent this.

An effective privacy definition should address: if a malicious party is trying to infer information about our individual clients by looking at our released statistics, what barriers do they face in doing this? Are we protecting against damaging types of inference in all reasonable circumstances?

Many commonly used standards can fail this test. For example, saying data has been scrubbed of any “personally identifying information” does not prevent a malicious party from recognizing individuals in the data set through fields that were not classified as “personally identifying”. Think that removing names counts as “anonymous” data? Latanya Sweeney showed in 2000 that about 87% of the US population could be identified uniquely by only their 5-digit zip code, gender, and date of birth. Think that removing zip codes, gender, and birthdates will do the trick? Arvind Narayanan and Vitaly Shmatikov re-identified some users in the supposedly anonymized Netflix data set using their reported movie preferences alone!

These kind of examples show: we have to stop thinking from the perspective of data providers who want to do the bare minimum and check a box that they legally achieve “privacy.” We have to start thinking like adversaries who want to mine any ill-advised data disclosures for all they’re worth.

Better privacy standards can be developed if we start from this adversarial mindset. Let’s put ourselves in the shoes of a hostile party who wants to learn something about one of Proof’s individual client’s data. We’ll call our hostile party “Company C” (where the “C” is for “creepy”) and we’ll call our targeted client “Client T”.

A natural question is: what can the company C learn about client T in a world where we release our overall statistics that company C can’t learn in a world where we don’t? Unfortunately, having a fully satisfying and airtight answer to this question is largely impossible, due to a phenomenon of auxiliary information. That is the fancy academic way of saying — it depends on what company C already knows about client T. As an extreme example, imagine that before we release some aggregated performance stat for all of Proof’s trading, company C already knows that “the individual performance of client T was twice as good as the average trading performance.” They just don’t yet know the value for the average trading performance. This piece of knowledge relating the unknown value of client T’s individual performance to the unknown value of the overall performance is called “auxiliary information.” In this case, when we release the overall average statistic, company C can precisely infer the individual performance of client T, something they couldn’t do before.

This kind of example motivates a really fascinating and active area of scientific research known as “differential privacy.” But before getting into that a bit, I do want to note — this kind of example is also why people hate academics. Is it bizarre and incredibly unrealistic to imagine that company C would know this ahead of time? Yes. Is this a case we realistically need to protect against? No. But ruling it out and narrowing our scope to more realistic scenarios requires defining boundaries on the kinds of auxiliary information company C might have. This can be difficult, risky, and highly domain specific.

Differential privacy takes a different approach. Instead of comparing the world where the overall stat is released to the world where it isn’t, differential privacy compares the world where client T participates in the overall stat to the world where client T opts out of their data being included. Crucially, in both worlds an overall stat is released. We can imagine (as we at Proof like to do!) that Proof has many clients, and so the overall stat may be quite similar in both of these worlds. In which case, our extreme example goes similarly badly for client T in both worlds. Company C can infer new information about the performance of client T based on the combination of their auxiliary information and the averaged released stat, but client T opting in or out of inclusion has little to do with it.

This suggests we could target a privacy definition that says something like: “whether you participate in the overall stat or not, an adversary will be able to infer roughly the same information about you, so you may as well participate.” We’d like to be able to say that this holds no matter what auxiliary information the adversary might have. However, there is a major barrier to this remaining. There’s another type of auxiliary information that can be quite a pain to deal with.

What if, for example, company C knows nothing about the relationship between client T’s data and the average, but they do know the exact performance of everyone who is not client T. In a world where an overall average is released with client T’s data included, company C can then infer the exact performance of client T. In a world where client T opts out, company C learns nothing new about client T from the released average (which is just an average over individual performances that company C already knows).

To grapple with this, differential privacy requires that some random noise be added to an overall statistic before release. The magnitude of this noise needs to be great enough to provide plausible deniability for the inclusion/exclusion of client T’s data. As a concrete example, suppose we are computing a basic statistic like the percentage of shares that traded passively. Perhaps if we include client T’s data, we get an overall percentage of 62%, and if we exclude client T’s data, we get a percentage of 65%. [*note: all numbers in this blog post are made up hypotheticals and have nothing to do with Proof’s actual trading data.] It is pretty unavoidable that these exact numbers are going to be different, so if we release either number in its raw form, we are creating a divergence between the worlds with and without client T’s data that company C can notice and exploit with certain kinds of auxiliary information. But let’s suppose that we sample a random value r that tends to fall within plus or minus 5%. If r is +4% for example and we release 62% + r = 66% as an approximate statistic in the world where we client T’s data is included, we could still explain this value’s release in the world without client T’s data being included, just with a different value of r, namely r=+1% so that 65% + 1% = 66%. Since we don’t want the noise to be large (or our so-called approximations will become wildly inaccurate), we’ll want to place higher probability on smaller values of r, so this released value of 66% won’t be equally likely in both worlds, but it will at least be plausible in both. More quantitatively, we can place bounds on how different the probability of each possible released value can be when we compare the two worlds (the one with client T’s data and the one without).

This is ultimately where the name “differential privacy” comes from. Using this technique of adding sufficient noise, we can assure data contributors that for any possible outcome, the probability of this outcome in the world with your data is not that different from the probability of this outcome in the world with your data. So contributing your data can only raise the probability of any undesired event by a small controlled amount. Personally I think this is pretty cool. Many people agree with me and have used differential privacy in practice, including in the data releases for the 2020 US census. Some people disagree with me and think that adding noise to data in any fashion is sketchy. Some such people challenged the census use of differential privacy in court, which was fun because I got to cover it as “breaking news” in the data privacy course I teach at City College. In case you’re curious about this, an excellent discussion of the issues at play written by several of the scientists involved in the development of differential privacy techniques can be found here.

Blanket objections to the addition of noise aside, however, there is still a major challenge to applying this kind of framework to trading performance statistics. It lies in the determination of how much noise is “sufficient” to bridge the divide between the two worlds. Intuitively, this is a function of how much the raw values for the statistic can differ when we remove one client’s data. This is called the “sensitivity” of the statistic. Differential privacy was largely designed with non-financial data in mind, where common statistical queries tend to have low sensitivity. For example, if you have a database of patient medical data, and you ask, “how many patients are diagnosed with disease D?”, this is a number that can only change by +-1 when you remove a single patient. If your overall population of patients is quite large and the disease D is reasonably common, adding noise that is sufficient to cover a change of +-1 is likely to be a lower order term that doesn’t meaningfully affect the quality of your analyses. (In many cases, it’s probably smaller than the kind of errors that are likely to be present in the data recording process or the kind of issues that arise from sample biases).

But many basic trading performance stats have high or even infinite sensitivity. Take slippage vs. arrival, for instance. This is just a weighted average of numbers that can individually be arbitrarily high or arbitrarily low. So there’s no bound on how high a positive value or how low a negative value we can get when we compute this on trading data. And if a certain client’s performance is a relative outlier, there’s no bound on how much their individual value can change the overall averaged stat. So if we want to add noise that is capable of covering potential outliers, our noise will be so high that our results will likely be meaningless.

At this point, let’s rewind back to the motivation we had for adding noise in the first place. The direct example we considered was when company C knows literally everything else about the data except client T’s performance. Protecting ourselves against such an effectively evil and enterprising company C would certainly be nice, but it’s probably way over-kill for our use case. One could argue that it’s over-kill in all use cases, but I think that’s considerably less true in some domains, where auxiliary information is widely abundant through social media and a vast ecosystem of shadowy data brokers. But in the context of trading performance at least, most companies are pretty paranoid about their data, and things are generally kept pretty close to the chest. So company C having super granular knowledge about the performance of all other clients is not realistic.

We should be careful though, because company C having very granular knowledge about at least one client is realistic — company C could be a client of Proof themselves! In this case, company C could know everything there is to know about their own trading performance. In fact, Proof will gladly give them any info they could possibly want about along these lines. There are also some basic general parameters about Proof’s overall trading that we should assume are known — in particular, the number of clients that Proof has, and the total notional value of shares traded. [These are metrics we release to investors, for example, as ways to track our growth.]

Overall, it seems reasonable to assume that company C might know: 1) the number of Proof clients, 2) the total amount that Proof traded in a given time frame (say quarterly), 3) the exact performance of a single client (themselves) for every possible stat, 4) what fraction of the total data set that known client represents. [Note: item 4) is a consequence of knowing both 2) and 3), but we feel it is helpful to spell it out separately, as it will be relevant below.]

Is aggregation enough to protect client T’s data from such a company C? Generally no — at least in the sense that the world where the released stat includes client T’s data might look quite different to company C than a world where the released stat does not include client T’s data. If client T’s performance is an outlier compared to other clients included in the aggregated stat, client T’s data might skew the released aggregated stat considerably.

Releasing stats that are meaningfully skewed by a single client’s data is also problematic for reasons beyond privacy concerns. Such stats are not very robust and not truly representative. Releasing them could be misleading and set false expectations for either the nature or consistency of Proof’s trading results.

So what can we do to release some aggregated stats while controlling these risks to privacy and robustness? Here’s a test that can prevent us from releasing a stat that is meaningfully skewed by any one client’s data:

1. Define a set of ranges for the stat. [E.g. for a stat that represents a percentage, we might specify that we will round to the nearest multiple of 10%, thereby grouping all possible answers that round to the same thing into a “range.”]

2. Compute the stat on our full data set with all client data aggregated. Define the range of the result as the value that is a candidate for release.

3. For each client, remove all of that’s client’s trading data from the dataset. Re-compute the stat and the range it belongs to. If the range does not match the value that is a candidate for release, terminate the process and do not release any value. If the range does match, restore that client’s data and go on to repeat the check for the next client.

4. Once we have checked that all removals of a single client’s data result in a stat that falls within the same candidate range, release the range. (This means the range can be reported in sales materials, published on our website, etc.)

Before we get deep into a discussion of what this means from the perspective of the maliciously curious company C, let’s just quickly walk through a hypothetical example to practice how this mechanically works. Suppose that Proof has 5 clients, and we are computing a stat like “the percentage of shares that traded on dark pools (as opposed to exchanges).” First, we’ll pick a level of rounding — say we’d like to report this percentage to the nearest multiple of 10%. Next we’ll compute this stat over our whole dataset, and suppose we get 16%, which we round up to 20% as our candidate value. Then we’ll remove client 1’s data from our dataset, and compute the stat again over the remaining data. Suppose that this time, we get an average of 18%, which again rounds up to 20%. So far so good. Next we’ll restore client 1’s data and remove client 2’s data. When we compute the stat again on what remains, we get 21.5%, which once again rounds to 20%. We’ll repeat this same check for what happens when we remove solely client 3’s data, and then solely client 4’s data, and finally, solely client 5’s data. If we get an answer that rounds to 20% every time, then we can release a statement like the following: “The percentage of shares that traded on dark pools falls in the range from 15% to 25%.” If for any of these checks we get a different rounded answer, we’ll terminate the process and release no range value for this stat.

One thing is important to note here: if the value of this stat for each individual client’s data falls inside the same range, then clearly every one of our weighted averages over different subsets of clients will fall inside this same range also, and we will be able to release the stat. But what we are requiring is something weaker. There may be an individual client whose data alone would lead to a different range for the stat. This might be ok! Especially if the weight of that client’s data relative to the other clients is relatively low. This is helpful because if we required each individual client’s value to fall inside the same range, clients with small sample sizes of trading activity could effectively block us from releasing useful stats due to their noisy individual numbers. Using our requirements above instead, we will likely be able to release more stats (and with narrower ranges) as our client base grows, rather than it getting harder and harder to release stats as we add clients.

Now let’s think about what the results of this process could mean from the perspective of company C, in the potentially challenging scenario where company C is one of the clients, and we are concerned about company C’s ability to infer some information about the value of the stat for another client T. It turns out the process above has a somewhat cute consequence: for any value of a stat that we release, there is an alternative explanation for that same value being released that is consistent with company C’s knowledge and doesn’t involve client T’s data at all.

This may sound kind of obvious — after all, didn’t we say that the value of the stat would fall inside the same range if we excluded client T’s data? But that alone is not a sufficient argument. Remember that company C may know exactly how many clients Proof has, and also what percentage of the data set that their own data represents. Hence the world with only 4 clients that results from removing client T does not count as an explanation that is consistent with company C’s knowledge. But we can get around this.

This part gets technical, so I’m going to put the full proof below, after I sum up what all of this means. If you want to skip the math that follows, the main idea is this: we can come up with an alternative data set that fully explains the release of the stat within the given range, and we can build this alternative data set solely from numbers that fall inside the experience of clients other than T. This hypothetical data set preserves the total number of clients, the value of the stat for company C, as well as the relative weighting of company C’s data inside the overall data set. In this way, we feel it is a plausible alternative explanation that is consistent with everything company C should know.

There are still a few important caveats to note here. One is that plausibility is not as good as “equally probable.” If we wanted to argue more rigorously about how company C might use Bayesian methods to infer probabilistic information despite plausible alternative explanations, we’d have to go down the rabbit hole of thinking about company C’s prior knowledge distribution, etc. I don’t currently feel that this level of complication is worthwhile for the relatively simple kind of stats that we expect to release.

Secondly, everything we’ve talked about here treats each statistic we release individually. There may be a concern that though plausible alternative explanations exist for each stat individually, perhaps there is no single alternative explanation that holistically explains the full collection of stats that we release. This is a concern that differential privacy has answers for: composition theorems in that domain give some satisfying techniques for bounding the overall privacy impact of many parallel statistical queries. However, in our case, we think it is reasonable to assume that individual stats are somewhat independent. It is quite possible, for example, to have great markouts on a 1 second time horizon and horrible slippage on a day time scale. In other words, we do not expect the information content of our stats to be that much greater than the sum of its parts. Combining plausible alternative explanations for individual stats into a holistic and plausible alternative explanation for all of our stats seems like a relatively easy exercise in our case, and not worth worrying too much about for now.

Thirdly, it might be tempting to sometimes choose ranges after looking at the underlying data. This can create a back channel for client data to influence the ranges themselves, thereby potentially leaking information through those choices. To mitigate this, we try to pick ranges somewhat naturally from our general experience before looking at the data in question.

Closing thoughts

So what does all of this mean for Proof? Well, it probably means we’ve made our lives a bit harder for not too much reason. We feel proud that our privacy policy for client data sets us apart from the rest of the industry. But did we do it because current or potential clients were clamoring for it? No. Don’t get me wrong, we think they like it — but do we realistically expect it to drive adoption of our products? No. Let’s face it, nobody chooses products based on their privacy policies. But we think it’s something that should be given more thought. And if you have any thoughts on what we’re doing, we’re happy to hear them!

Appendix: proof that a suitable alternative data set exists to explain any individual stat released under this framework.

Hi math enthusiasts! I hope we’re alone now. I need to confess something to you — I expect there is a much easier proof of this, but unfortunately, this is the one I came up with. Also medium’s lack of support for math notation is incredibly annoying, so apologies in advance.

It will help to have some more notation in place, so we’ll let c denote the value of the stat on company C’s data alone, and t denote the value of the stat on client T’s data alone. We’ll let R denote the released range.

One case is pretty easy: suppose that the value of c falls inside the range R. Then, for all company C knows, it’s entirely possible that we have the specified number of clients, and all individual clients have values falling within the range of R. This would explain the release of R as the range. But wait — the careful reader interjects — we computed R in the first place as the overall stat for the dataset that included client T’s data. So how does this explanation count as not involving client T’s data? Well, because of our checks, it must be true that the average would have fallen inside range R even without client T’s data, so we feel that this still qualifies.

Another relatively mild case is when the value of t falls inside the minimum and maximum of the other clients’ individual values. In this case, the value of t itself can be explained as some weighted average of the other values. In some sense, this means we could replace client T’s real data with synthetic data generated by resampling data points from the other clients in some way, and the value of t would be the same. We might think of this some kind of “FrankenClient” assembled from the pieces of other client data. Since this can be made from parts that don’t include client T’s data, we feel this is a compelling argument that client T’s data is reasonably protected. It should be noted though that the form of the composition is a function of client T’s value t (as this value is preserved).

A more delicate case occurs when c does not fall inside the range R for this stat, and the value of t does not fall inside the min and the max of all the other values. In particular, let’s assume that t is above the maximum of all the other values. (The case where it is below would follow analogously.)

Let’s define r to be the weighted average that results when we remove company C’s data. We know this value r is inside the range R, and also that t > r. We’ll let m denote the maximum of all the individual client values other than t.

Now, we know there must be some individual client values (not including c) that are <r. We’ll let u₁, …, uₖ denote these values. We’ll let uₖ₊₁, … , uₙ denote the remaining client values that are ≥ r (if any). We’ll let wᵢ denote the weight for each of these values respectively in our averaging over the whole data set. We’ll let w₀ denote the weight for client t. We claim that we set new weights z₀, …, zₖ such that:

We can also choose the zᵢ values to ensure that:

This must be possible because the value of the righthand sum in the first equation is between m and the uᵢ’s. Why can’t it be greater than m, you ask astutely? Well, if it were, then the total weighted average when we include the uₖ₊₁, …, uₙ values as well would be greater than r, and this would contradict our definition of r in the first place. Now, there exists a way to re-distribute the weight to hit any such value that falls inside the extremes of the values being averaged. In particular, we can have z₀ > w₀, and zᵢ < wᵢ for each i ≥ 1. In other words, we are making the value m heavier to make up for the fact that it is replacing a higher value t, and we can take away weight disproportionately from the lowest values among u₁, …, uₖ to compensate and to make the overall weighted average of the values other than m at least as high as it was previously. So suitable weights zᵢ must exist.

We can now define an alternative possible data set with the same number of clients, where all individual client values and weights are unchanged, except t is replaced with m, and each wᵢ involved in the expression above is replaced with the corresponding zᵢ. The overall weighted average for this data set is the same as our original one, so that still falls inside the range R. Next we verify that if we remove any one client’s value from this alternative data set, the weighted average of what remains will again fall inside the range R.

If the removed value is not the m that we inserted for t, or one of the u_i’s whose weights we changed, than the weighted average of what’s left is the same as it was for our original data set, and hence still falls inside R.

Let’s consider the case where we remove the new value m that we inserted in place of t. The sum of the weights that remain is 1-z₀. We can write out the weighted average of what remains as:

First, we note that this value cannot be too high to fall inside R. If it were, the overall average when we add m back in could not fall inside R either. So our only concern is that this value might be too low to fall inside R. But, this value is actually greater than or equal to what happened for our original data set when we left out the corresponding value of t with its original weight w₀, due to our requirements for the zᵢ’s. Hence it must still fall inside the range R.

Finally, let’s consider the case where we remove one of the uᵢ values with its new weight zᵢ (i.e. i ≤ k). Let’s compare this to what happens when we compute the weighted average over the whole data set, without leaving anything out (a computation which we know falls inside the range R). We’ll call this overall weighted average X.

Removing uᵢ only increases the relative weighted average of the values other than c, as u_i is below r. Also since zᵢ < wᵢ, the relative weight of this larger value compared to c only increases, potentially moving the average closer to this value that is >r. This means that either the weighted average with the uᵢ value removed is larger than X (in this case it can’t be too small to fall in R), or it is > r, and also cannot be too small to fall inside R.

We are left with the concern that the weighted average when we remove this particular uᵢ value could be too high. This definitely cannot happen if uᵢ is greater or equal to X, since removing a value that is at least as large cannot make the remaining weighted average any higher. So we can assume that uᵢ < X. We claim that the weighted average of what remains is less than what it was in the original data set when we removed this uᵢ value with the original weight wᵢ. In other words, we claim:

If we multiply out the denominators and cancel out common terms on both sides, we can see this follows naturally from the fact that wᵢ > zᵢ and uᵢ<X.

Putting all of this together, we know that we can always come up with an alternate data set that has the right number of clients, the right value and weight for company C’s data, and all of the other values falling within the range of values experienced by clients who are not client T.

--

--