Building Ensemble of IV weighted Naïve Bayes model on differentially private data to predict Ad click rate

Published in

MiQ Tech and Analytics

8 min readOct 20, 2021

Manoj Kumar Rajendran, Principal data scientist, MiQ

As we embark on an era of data-driven decisions, preserving the privacy of the underlying data is of utmost importance. Various tech giants and data vendors are gathering users’ personal and social interactions to accumulate a vast pool of usable data every second. Organizations are scrambling to make their datasets GDPR and CCPA compliant while struggling to protect the privacy of the opt-in customers when sharing and analyzing the data.

The Netflix 2006 challenge and NYC taxi data release were some of the instances in which the most common privacy-preserving approach “Hashing/Anonymization” was exploited and brought to its knees. Through Reverse engineering and linkage attacks, the hackers showed the world that within a few hours it is possible to de-anonymize millions of customer data. Hence, we need a modernized approach to cyber security that protects personal data far better than traditional methods. Say hello to “Differential Privacy”.

The work done by Cynthia Dwork, et. Al (2006) on Differential Privacy has been gaining a lot of traction in recent times and is the key area of focus for companies in digital marketing and programmatic advertising which heavily relies on sensitive user behaviors/preferences in creating state of art predictive models.

Introduction to Differential privacy

Differential privacy is a strong, mathematical definition of privacy in the context of statistical and machine learning analysis. This concept enables analysts to study information about the population of interest while masking the presence of an individual in that population.

**Fig. 1 — Querying on Differentially private DB**

For example, let us query a database to count how many people visited a particular website. Now, this query can be considered “Differentially Private” if by examining the output, an analyst cannot deduce whether an individual was present in the database or not. This intuitively means the presence of an individual shouldn’t change the response/output of the query.

A process S is ε-differentially private if for all databases D1 and D2 which differ in only one individual for all possible outputs O:

P[S(D1) = O] ≤ e^ε ⋅ P[S(D2) = O]

ε is the privacy budget. Smaller ε makes e^ε=1, hence the output probability distributions of function S in neighbouring datasets D1 and D2 are roughly the same. Higher values of ε provide more privacy guarantees but affect the utility of the function (like bias vs variance trade-off). Depending on the trade-off amid privacy and accuracy, differential privacy can be adopted globally.

Mathematics behind differential privacy

The crux of differential privacy is “adding carefully calibrated statistical noise” to obscure the underlying sensitive data.

DP output = Original output + Noise

The two factors that decides the magnitude of noise are

ε- The privacy budget (discussed already)
Sensitivity of the query

Let us quickly understand what sensitivity stands for? It is the amount by which the output of a query can change by addition or deletion of a single entity/individual from the dataset. For counting queries, the sensitivity is 1.

The noise that is added can be derived from Laplace, Gaussian or Exponential distributions.

Given a dataset D and the function f: D →R^d, global sensitivity is Δf,

A(D)= f(D) + noise

Satisfies ε-differential privacy if the noise complies with the Laplace distribution; that is, noise~Lap(Δf/ε); with location parameter (LP) as zero and scale parameter (SP) as Δf/ε

Leveraging differential privacy in a tech ecosystem

United states census 2020 data was differentially private and a press release was issued informing that all the census data will be released with similar privacy guarantees going forward
Google has incorporated the “local differential privacy” mechanism in building “RAPPOR” (randomized aggregatable privacy preserving ordinal responses) inside chrome to learn statistics about how unwanted software is hijacking user’s settings
Apple has developed “differential privacy schema at scale” to study the user text and emoji patterns to improve the user experience

On a side note, with many data vendors following the footsteps of these tech giants in adopting differential privacy, most of the sensitive data in the future will only be available in a differentially private manner (aggregated with noise). Hence the onus is on us to be prepared and research on how to use this data effectively.

Tackling differentially private data

As discussed before, the concept of differential privacy is applied not just to queries but for models too. It’s epitome “the output shouldn’t differ by inclusion or exclusion of an individual” is an alternate way of saying “learn from the population trends and do not memorize a single individual”. This has helped many SaaS companies (software as a service) in addressing the “cold start” problem.

As most of the research on DP is done on machine learning side (such as PATE, Bolt-on SGD), very little work or knowledge is shared on how to use the aggregated noisy DP data. Pretty much a low hanging fruit for a data scientist in an AdTech company!

Now this aggregated noisy DP data presents us with two problems to solve:

How do we remove/control the noise factor in the aggregated data?

As differential privacy provides the most sophisticated privacy guarantees and the noise is intentionally added to obscure sensitive data, there isn’t much the analyst can do about the noise. Hence the data vendor pretty much controls the privacy and the usability of the DP data. If the epsilon value (privacy budget) is made public, an analyst can have a rough expectation of the prediction models built on the DP data (obviously the performance of the models built on DP data will be much lower than clean sensitive data)

How to use only aggregated data to make predictions at a granular level? (Example — Assume that Ad click rate is available by a) Domains b) Creative type c) Ad position, however, the click prediction should be done by instances i.e., given a domain, creative type, and ad position, predict the probability of a click)

The crux of the solution is “Naïve-Bayes”. The predefined packages in the programming languages such as Python and R, build Naïve-Bayes classifier on granular data (at event level). The idea is to mimic similar steps but on the aggregated data. By estimating the event likelihood for all the features at different levels, it is possible to make predictions for a given instance (granular level). More details to follow in the next section

Problem statement

Build a click prediction model by using only the aggregated differentially private data and benchmark the model performance with respect to the state of art XG boosted classification model built on the small granular level data.

Two datasets are considered in the analysis

Small granular level data (100k entries) — without any noise sampled from a larger dataset with 100M entries
Aggregated noisy data — Differentially private data queried on a larger dataset with gaussian noise, ε = 5 and 𝛿=1e-10 spread across 190 queries (single and cross features)

Sample queries to extract aggregated data

SELECT feat1, COUNT(click), SUM(click) FROM private_dataset GROUP BY feat1 WITHDIFFERENTIAL_PRIVACY(delta, epsilon);

SELECT feat1, feat2, COUNT(click), SUM(click) FROM private_dataset GROUP BY (feat1, feat2) WITH DIFFERENTIAL_PRIVACY(delta, epsilon);

Granular data (event level)

**Fig. 2 — Granular data view that won’t be available in a DP world**

Single feature aggregated noisy data

**Fig. 3 — Aggregation of events by single feature**

Cross features aggregated noisy data

**Fig. 4 — Aggregation of events by cross features**

Things to note:

Features are hashed, meaning an analyst cannot make sensible groupings, decisions based on the features
Negative values of counts are observed for some combinations as the added noise may be greater than the count values (noise is very significant in low count buckets and negligible in high count buckets)
Some of the features have very high cardinality (more than 1000 unique levels/values)

Proposed solution- Ensemble of Information value based Naïve Bayes likelihood estimators using single and paired noisy aggregates

Key pointers:

As the impact of added noise is the least in the features with the least number of levels, feature’s weightage in the model varies inversely proportional to the levels
The information value of the features are critical in determining their importance to the final prediction model
The conventional Naïve Bayes approach is tweaked to handle only aggregates, and an ensemble of models were used to predict the final event probability

Methodology for skyline mode (benchmark):

As all the features are categorical and hashed, the straightforward approach is to target encode them. Target encoding is the process of converting different levels of categorical variables to mean of its response value, in our case by the average click rate. In order to avoid overfitting, a novelty is introduced in the form of small gaussian noise. Post that, a hyper-parameter tuned XG boosted model is fit on the data. As the granular dataset is already very small compared to original dataset, model is scored on the same data it was built on (in sample predictions). This results in very high performance (accuracy, sensitivity, specificity etc.) serving as a true benchmark.

Results:

Our model is at a disadvantage as it is built on aggregated noisy data compared to the skyline model built on granular noise free data.

Proposed model performance:

The big area of improvement that can be deduced from the performance tables is Specificity (76.9% vs 94.6%), which directly ties back to concordance pairs % and average predicted probabilities. The proposed model can identify true positives well with strong conviction (0.58 vs 0.65) however some of the non-events are predicted as events with high probability values (0.16 vs 0.04). It is worth noting that proposed solution captures nearly 41% of events in the first decile (rank ordered by probability values), resulting in better lift metrics compared to the 36% of the skyline model.

Conclusion and future work

As we are shifting towards privacy first ecosystem, it is imperative that we are prepared for a future with just aggregated data. Be it click prediction or bid allocation, the learnings from this work can be leveraged. With the cookie-less world edging closer to the programmatic advertising world faster than ever before, it is important to acquire knowledge to work on differentially private data and avoid being left behind. Given the niche nature of the problem we are trying to solve, the results are still satisfactory and provide us the scope to research and build models that can challenge and even surpass the skyline model in the future.

Reference

1) Netflix 2006 challenge — https://money.cnn.com/galleries/2010/technology/1012/gallery.5_data_breaches/index.html

2) NYC taxi data release

https://www.theguardian.com/technology/2014/jun/27/new-york-taxi-details-anonymised-data-researchers-warn

3) Cynthia Dwork, et. Al (2006) on Differential Privacy https://link.springer.com/content/pdf/10.1007/11761679_29.pdf

4) United states census 2020 data

https://www.ncsl.org/research/redistricting/differential-privacy-for-census-data-explained.aspx

5) Google’s RAPPOR

https://www.chromium.org/developers/design-documents/rappor

6) Apple’s “differential privacy schema at scale”

https://machinelearning.apple.com/research/learning-with-privacy-at-scale

7) Differential privacy fundamentals