Differential Privacy in the Real World

Aggregation of differentially private models to boost classification performance

Published in

Bluecore Engineering

8 min readMar 4, 2019

Developing robust and performant machine learning models requires large datasets. Yet, amassing a sufficient amount of data can be cumbersome and in some cases unfeasible. At Bluecore, we build predictive models to help e-commerce marketers acquire, engage, and retain customers. These models are built on proprietary datasets of first party data, which we collect on our clients’ behalf. These datasets uniquely leverage three critical dimensions of how consumers engage with brands — at the product level, against anonymous and known identities, and across a variety of browsing, searching, and purchase behaviors. Bluecore employs a strict multi-tenant protocol that ensures data access isolation: each client’s data is stored in a silo separate from the others. When training a model for a client, only the respective client’s data is used, as illustrated in Figure 1.

Figure 1: Representation of Bluecore’s data silos for model training and prediction.

While some of our clients’ websites see a dense flux of customers each day, others have a narrower reach and witness much less frequent online interactions. As most of our predictive models are designed to rely more on recent data, we are effectively constrained in terms of the amount of usable data we can possibly collect per client, making gathering large datasets impossible for these smaller clients. Additionally, as data collection only starts for the client when they sign up for our services, we systematically face a cold-start problem: it could take a while before our models have the requisite data needed for optimal performance.

However, if we could stop seeing Bluecore’s client base as a collection of disjoint and isolated datasets and find a way to combine them together, there exists enough data collectively to build better performing models for every client. The issue is, companies are understandably unwilling to volunteer their data to be shared in this way, out of concern for the privacy of the end consumer and the confidentiality of their own proprietary business data.

Privacy concerns & attacks

There exists a large body of literature studying the privacy issues stemming from releasing sensitive datasets or exposing the parameters/outputs of machine learning models that have been trained on private data. Some glaring privacy issues have been exposed in the past few years, perhaps the most notorious one being around the release of a private anonymized dataset by Netflix for its recommendation prize. The identities of the customers present in that database were almost entirely exposed using public data from IMDb.

Granted, at Bluecore, we do not share our models or their parameters with our clients. We only provide each client with their respective models’ outputs for the individuals that are in their customer base. One naive approach could therefore be to train one model on a collection of clients’ data, but then only release to each client the outputs obtained for their own customer base. However, this approach would violate the requirement that each client’s data remain in their individual silo. Furthermore, even ignoring this problem, it has also been shown that machine learning models can leak information about the data they were trained on, even in cases of black-box access to the model. For instance, research has shown that model-membership inference attacks could indicate whether a specific datapoint was used during training, and that model inversion attacks could pinpoint the value of certain features of the training data. This types of attacks are illustrated in Figure 2.

Figure 2: Model-membership inference and model inversion attacks mechanism.

A solution to all these types of privacy concerns mentioned above is to inject random noise somewhere in the process: the hope is that the noise added renders reverse-engineering the model (recovering individual records in the underlying data) impossible while still allowing the model to provide utility.

One of the simplest illustrations of this noise injection practice is randomized response. Imagine that you want to conduct a phone survey to evaluate the response of a population to a yes/no question. Now, imagine that the question is somewhat delicate, for instance “have you ever cheated on your spouse?”. You would assume that some people may be reluctant to answer “yes” truthfully. Instead of asking for the answer outright and risk ending up with statistics that are not representative of the true response of the population, ask each respondent to toss a coin in secret. If it comes up heads, they should respond “yes”, and if it comes up tails, they should tell the truth (“yes” or “no”) as shown in Figure 3. That way, every respondent has plausible deniability on their answer, and you still have a way of deriving the population’s average response. Indeed, we know that 50% of the time respondents said ‘yes’ because their coin showed heads, and 50% of the time they answered the truth. Therefore, if the survey finds that y% of the responses were “yes” then one can recover by x=2*(y-50), that x% of the population has actually cheated on their spouse, without breaching the respondents’ privacy. In this illustration, the random noise is injected at data collection time and comes from the random outcome of the coin toss.

Figure 3: Randomized response example in a survey.

Differential privacy

In order to protect individuals’ privacy in our clients’ datasets, we follow a similar approach at Bluecore by relying on the concept of differential privacy. Intuitively, differential privacy ensures that the output of a query or an algorithm trained on a dataset does not change drastically if its input datasets are identical except for one entry. Therefore, differential privacy aims to protect individual records in a database. The mathematical definition of differential privacy involves a parameter epsilon which quantifies the amount by which the outcome differs: the lower the epsilon, the smaller the discrepancy and the higher the privacy.

Back to our problem, there exist several ways of injecting noise in the training process: in the training data itself, during optimization, in the parameters output by the model, or at query time (each time the model is used) as is shown in Figure 4. This last method implies that the model itself is not differentially private, but the querying mechanism is: this means that each query to the model leaks some small amount of information about the original data. Therefore we can only use that type of model a limited amount of times before the privacy budget becomes too great, thus risking exposing individual records. This, along with our requirement of respecting the integrity of per-client data silos, is why in our case having a differentially private model is more practical.

Figure 4: Possible noise injections in the data collection/training/serving process.

Our solution

Our approach, which is the result of a collaboration with the Impact Team at Georgian Partners, and is detailed in this paper, relies on model parameter perturbation. Our experiments were performed on Bluecore’s propensity to convert model, which is a logistic regression classifier that predicts whether a customer is likely to purchase soon.

The new framework we propose is comprised of two steps. First, we build a differentially private model for each client separately: using only that client’s data, we train a regularized logistic regression model and add noise to the resulting parameters. We use a technique called Bolt-On Differentially Private Permutation-based Stochastic Gradient Descent (DPPSGD) that can be applied to any learning algorithm that relies on performing Stochastic Gradient Descent on a convex loss function.This results in a collection of differentially private classifiers, one per client.

The second step consists in aggregating these models by training an XGBOOST classifier for each (target) client:

we feed the target client’s data into each of the differentially private models we have and get their predictions,
we then use these predictions as a P-dimensional input (where P is the total number of clients) to an XGBOOST classifier to get final predictions for the target client’s data.

This approach is detailed in Figure 5 for one specific client.

This second aggregation step in itself is not differentially private, because the gradient boosting model we used is not, but remember:

the individual client-specific logistic regression classifiers are, and therefore we may use their predictions as we please and still preserve the privacy of the data they were built on,
only the target client’s data is directly used in this step ensuring that the only data that can leak to each client is their own.

Our experiments showed very encouraging results. While the individual perturbed (private) models were performing slightly worse than their non private counterpart, aggregation managed to counter-balance that effect and provide utility. We managed to gain a lift in both cold start and warm start scenarios for most clients in the cohort, with some fairly good privacy guarantees (epsilon=0.01). We observed that our aggregated models provided an average lift of 9.72 % in performance (area under the ROC curve) in a cold start setting over a non-private, non-aggregated baseline. We also observed that while mostly low/medium-traffic clients witnessed great improvements, a few of the bigger clients did also see a substantial lift in performance. This suggests that data quantity is not the sole driver of model quality, and that all types of clients would benefit from the implementation of this aggregated differentially private model in production.

Figure 6: Performance gain of our differentially private aggregate model over the baseline.

Conclusion

In all, our study showed that we are in fact able to successfully bridge the gaps in between our client’s siloed data in a manner that respects the privacy of their customers. However, implementing differentially private models does come at a high engineering cost and requires consequential theoretical and experimental efforts. Luckily, our bet that aggregation could in fact overcome the loss in performance induced by imposing privacy on our models payed off, and we showed that all types of clients could benefit from it. This is particularly rewarding, especially given the fact that differential privacy is a relatively new field, where “in the wild” successful applications have been very rare.