Let’s treat machine learning models like human brains

Here’s a synthesis of my talk at the recent debate on big data governance organised by the Big Innovation Centre in association with APPG AI.

My baby boy loves playing with toys. Until recently, when he dropped his favourite teddy out of view it was as if it had disappeared off the face of the earth. Then he started to learn — through observing many similar scenes — the abstract concept of object permanence and gravity. Now he looks over the edge of his highchair when he drops any of his toys.

To draw a parallel with machine learning: when your data is fed into a machine learning algorithm, patterns in the data are transformed into an abstract set of rules in the model. The model is no longer about you — something about you just made it cleverer.

Privacy, bias and forgetting

There should be significantly less concern about the privacy of personal information living in machine learning models than in databases. Why is this the case? Aside from data, algorithms, and compute power; models build highly efficient inferences and assumptions about the world in order to work effectively. Feature engineering and componentization in deep learning leads to high levels of abstraction, which makes it impossible to interpret the raw logic.

Credit: “Why Should I Trust You?” Explaining the Predictions of Any Classifier — Ribeiro, Singh, Guestrin (2016) PDF

But there are valid concerns about machine learning models making biased inferences. For example, a deep learning model trained to classify wolves and huskies started classifying photos with snow in the background as huskies. This phenomenon occurs even when sensitive fields are removed from the data, but there are methods for detecting and correcting bias that often result in a fairer but lower performance metrics.

The right to be forgotten allows for your personal data to be removed from the database so that it’s no longer accessible in the system. But the models that were trained from your data should not have to be retrained immediately in all cases. For example, the safety improvements gained from the miles you drove in a self-driving car shouldn’t simply be forgotten when you erase your personal data from the database. Large datasets are also slow and expensive to retrain and distribute – e.g. the Inception v3 image classification model cost an estimated $30,000 and around 2 weeks to train on 8 Tesla K40 GPUs (source).

You can’t force (without painful therapy) a human brain to delete experiences that shaped its consciousness and then calibrate its understanding of the world accordingly. So, if we want to use machine learning to solve the world’s biggest problems, why should we treat large-scale machine learning models differently?

Interpreting the GDPR right to an explanation

Statistical algorithms like random forests are less accurate and can explain what they’re doing in a more interpretable way, whilst neural networks are more accurate but don’t offer meaningful explanations. There is a trade off between accuracy and explainability.

How to interpret complex models leads to some confusion around the GDPR right to an explanation. But a fear of uninterpretability should not prevent organisations from deploying machine learning because the definition of meaningful information itself is open to interpretation. Lawyer and Researcher in Digital Ethics, Sandra Wachter, from the Oxford Internet Institute wrote in a recent paper: “GDPR mandates that data subjects receive meaningful, but properly limited, information (Articles 13–15) about the logic involved, as well as the significance and the envisaged consequences of automated decision-making systems, what we term a ‘right to be informed’”.

Instead of locking down a definition for “meaningful information” in the Data Charter (the UK’s post-Brexit adoption of GDPR), regulators should be allowed to tighten it up on a more granular basis depending on the industry and use case. There is a significant amount of research into explaining model black boxes, including open source tools such as LIME. Providing a score for the impact that each of the features had on a prediction goes beyond GDPR requirements and is a more than sufficient level of detail for most use cases.

Final thoughts

Here are some points to consider in the debate on data governance and AI:

  1. We should treat a machine learning model like a human brain — you can’t unlearn abstract concepts.
  2. GDPR requires companies to inform of the existence of automated decision making, but “meaningful information” is open to interpretation.
  3. Data privacy and compliance are essential, but improving data quality and availability will make or break companies in the 4th industrial revolution.
  4. The debate will shift from data governance to competition dominance as large companies use the data network effect to increase monopoly power and cycles of further convergence and data concentration.

Thanks to Vanessa Barnett from Keystone Law for her input.

Show your support

Clapping shows how much you appreciated Alex Housley’s story.