Let’s talk about ‘Model privacy’

Not about ‘data privacy’

Purvanshi Mehta
NeuralSpace
3 min readMay 18, 2021

--

There are a lot of talks about data privacy where every user’s data needs to be protected and their privacy respected. But what about our learned models? If there is an already deployed model, someone can query it multiple times, collect data and build a supervised learning model. Never thought about this? This indirectly hampers data privacy!

Model Privacy is not an intrinsic quantity associated with a model; instead, it is a measure of information that arises from interactions between the model and its queries.

I was going through last week’s ICLR papers and found this interesting problem which according to authors has not been addressed before. Enhancing the privacy of a already learned model can be very beneficial to use cases where ML is being used as a service(ML SaaS). Two important questions can be asked for enhancing model privacy are -

  1. How can privacy be enhanced for an already — learned model?
  2. All privacy preserving algorithms come with a utility-privacy tradeoff. Hence how can we balance this systematically?

Information Laundering

The main idea is to manipulate X and Y such that the model is not built on the exact inputs and does not output the exact output. Or we can say to have kernels (K1 and K2) over our input and output. Obviously this would come with a dip in accuracy but can we optimize the two?

Manipulate X to X˜ and Y to Y˜ and building a model on X˜ and Y˜ instead of X and Y. This obviously preserves the privacy part. For the utility part we need to have a high information overlap between our new system K and our old model K*. In ML terms the closeness aka KL divergence between the two should be less. The loss function can be defined as-

This can be solved for finding out the optimal pK1 (˜x | x) and pK2 (y | y˜) i.e our kernels! The β₁ and β₂ weights tune in the privacy-utility trade off. Small values of β pushes K to be exactly equal to K*.

Applying to text classification

Let us look at how this can be applied and tested practically. This experiment was part of the appendix of the paper.

The ‘20-newsgroups’ datasets was used with first 4 classes for classification. The model consisted of a TF IDF embeddings and a Naïve Bayes classifier- ‘Real Classifier’. Through querying the model they generate another dataset on which they train another Naïve Bayes. The comparison between performance of two models, forms the baseline.

Information laundering (proposed method) is applied and the performance of both models is again tested.

As expected the proposed approach produces less leakage and thus better privacy.

Conclusion

This paper is an interesting and important direction in the field of Machine Learning, specifically privacy. It would be interesting to see applications in various fields like NLP and vision.

--

--