Arguing with Edward Snowden

A Data Scientist’s take on defending Machine Learning models

Introduction

I’ve recently read Edward Snowden’s Permanent Record during my holiday. I think it is a great book that I highly recommend for basically anyone, however it is particularly interesting for IT-folks for the obvious reasons. It is a great story about a guy growing up together with the internet, starts to serve his country in a patriotic fervour after 9/11, and becomes a whistleblower when he notices the US has gone too far violating privacy in the name of security. Moreover, a paradox I found most interesting is something a Data Scientist can easily relate to.

The systems that collect data about one’s browsing on the internet (basically anything you do on the internet) was an engineering masterpiece. It surely did something the NSA has no mandate for, but when building something brilliant, it can be easy to miss the big picture, and help malignant actors by handing over great tools for them.

Think about this in terms of machine learning! I am quite sure — although I cannot know — that the Chinese Social Credit System’s mass surveillance network makes use of some state-of-the-art Deep Learning concepts. Or they even do some things more brilliantly, than publicly available research can suggest. But the system it is used for at best raises some super serious questions about individual rights. Being IT professionals, we cannot miss the big picture and have to be mindful of what is the consequence of our work!

When discussing the threats of massive private data collection done by governments, Mr Snowden makes some controversial statements about data science as well. Firstly, he incorrectly suggests that machine learning models in general are total black boxes, and their decisions cannot be explained afterwards — thereby making a point that algorithms make obscure decisions that people should make in a transparent way. Secondly, he states that recommendations are just ways to put pressure on an individual to buy popular products. I aim to argue against both of these statements.

Model explainability

There is an example in the book about COMPAS, a widely used risk-assessment algorithm in the judicial system of the USA. In this case, the point is that an algorithm made a decision having a substantial effect on someone’s life — and neither we nor the algorithm can even explain why so. I think this is an inherently wrong and ill-disposed point of view.

There are models which are explainable by nature which is one of the main reasons practitioners use them. If you think about linear regression for a regression problem: the product of the feature value and the corresponding beta gives you the amount this feature contributed to the prediction of the target value. In a classification problem, widely used logistic regressions behave almost the same way, as they are a special linear regression.

A decision tree produces the exact series of decisions the algorithm learned to be useful determining the target value. However, bagging and boosting algorithms make use of numerous trees built simultaneously or sequentially (Random Forest, Xgboost, Extremely Randomized Trees, LightGBM, etc). These voting trees, or high-dimensional data in case of Support Vector Machines are harder to interpret or visualize concisely. Moreover, in a deep Neural Network millions of matrix multiplication take place — keeping track of it sounds intimidating of course.

However, there are several methods that make these algorithms more transparent. This article shows how deep neural networks can be more interpretable in breast cancer research. Another great article elaborates on different model explainability techniques: permutation importance, partial dependence plots and SHAP-values. There is room for improvement in non-technical human readable interpretations of complex machine learning models, but there are techniques to explain why a model predicted such an output.

Consequently, if a model is not explained well, it is almost certainly arising from an omission or failure of a human actor. On top of that, algorithms being biased in terms of socio-economic factors is an accusation appearing increasingly often. It is important to note again, that this a failure of the modeler not the model. The data these models are trained on are reported to contain bias — accounting for that is a challenging task we as modellers surely have to overcome. Luckily, the theoretical foundations and tools are there to assess the “fairness” of algorithms for example, between two racial groups.

Advertisements and recommendations

A second argument I did not like in the book was about recommendations in general. The author states that recommendations are just about softly pressuring the customer to buy what others did buy. I think this argument misses the real point here.

There are advertisements everywhere. I would certainly agree that advertisements — apart from conveying information about a product — are means of putting some pressure on the subjects to buy a product. Nonetheless advertisements are natural and necessary in a market-driven economy, and in a world packed with so many products and services.

But if we accept the premise, that some sort of advertising is going to exist, which one would you prefer? The one with no personalization whatsoever, or one where the advertisers’ goal to make you buy happens to cause that you are getting more relevant advertisement about products that you may actually need? I’d prefer the latter. A sophisticated recommender system takes your history into account, along with other people’s history that have a similar record to you. If done right, they are much more than just popular product recommendations.

Conclusions

In general, I really liked the book. I also admire the bravery of Mr. Snowden that started a discussion about privacy, and the trade-off between privacy invasion and crime prevention. But I also think that the book expresses a negative attitude towards everything in connection with using large amounts of data. Opposing this, I believe that statistical models built on top of massive datasets can greatly benefit humanity — if used for the right purposes, transparently and responsibly.

ML&NLP Engineer @ Bold360AI. Text mining and predictive statistics enthusiast.

Data contains intelligence that can change the world — we help people discover, manage and use this intelligence.