Summary: Towards Robust and Privacy-preserving Text Representations (ACL 2018)
Authors: Yitong Li, Timothy Baldwin, Trevor Cohn.
At the surface, machine learning and privacy are diametrically opposed. One aims to learn as much about data as possible as a means towards optimizing some objective (e.g. predicting sentiment) and the other tries to limit the usage and spread of sensitive information. It seems intuitive that in most cases, a gain in privacy would result in a loss in classification accuracy. This paper shows that by being selective in the type of information we obfuscate, we can actually get a more accurate and generalizable classifier.
Let’s say that we have a dataset of restaurant reviews, and we want to predict each reviews sentiment. Assume that associated with each review are attributes about the author, like gender or age, and we would like to force our model to “protect” (i.e. ignore) these attributes.
Modeling
More formally, a review, x, is an input to the model (a BiLSTM) which produces a hidden representation, h, which is then used to form a prediction. We would like h to not be predictive of the protected attributes. The authors model this as adversarial learning, where a discriminator (a linear layer) is added for each protected attribute. The learning objective is to find the parameters that maximize the ability of our model to predict the target while minimizing the ability of the discriminator(s) to predict the value of each attribute. Notice that we are obfuscating information not in the training data, but in our latent representations.
The authors experiment on two tasks, sentiment classification and part-of-speech (POS) tagging, both on the TrustPilot reviews dataset which has information on the sex and age of the reviewer. I’ll go over the POS tagging task since I think the results are more interesting. The authors choose to protect both the sex and the age of the reviewer with the sex being ‘M’ and ‘F’ and age being over-45 (O45) and under-35 (U-35). I don’t know where the ages of 35–45 went but I didn’t look into it. The authors then report results on the test set with the accuracy stratified by the protected attributes.
The authors also report POS tagging results on the African-American Vernacular English dataset (AAVE) using the model trained on TrustPilot.
The adversarial model (ADV) does better on TrustPilot, but really shines when predicting on out-of-domain data. The authors attribute the better performance to the fact that the model “is capturing the bias signal less and [is] more robust to the tagging task”.
Thoughts and Conclusion
The first and most obvious limitation is that the protected attributes must be explicitly given and must be separate from the training data. However, I can see a follow up work to this paper where, for example, if we want our model to not be gender biased (e.g. doctors are male), we can use this adversarial learning process to remove the bias in our model by “white-ing out” nouns like “he” and “she” within our dataset and minimizing the ability of a discriminator to be predictive of the gender of these pronouns.
The second limitation is within the evaluation. The TrustPilot dataset is small (600 reviews), which makes me curious if the dataset were large if the gains in accuracy by removing bias would be less, albeit present.
Overall, I liked this paper. A simple idea that is easy to implement, gets good results, and probably only required a single GPU to train on. As a plus, I found the paper to be very well written.