Machine Learning Algorithms: Surprises at Deployment?
--
Machine learning (ML) algorithms are being used to generate predictions in every corner of our decision-making life. Methods range from “simple” algorithms such as trees, forests, naive Bayes, linear and logistic regression models, and nearest-neighbor methods, through improvements such as boosting, bagging, regularization, and ensembling, to computationally-intensive, blackbox deep learning algorithms.
The new fashion of “apply deep learning to everything” has resulted in breakthroughs as well as in alarming disasters. Is this due to the volatility of deep learning algorithms? I argue this is because of the growing divorce between predictive algorithm developers, their deployment context, and their end users’ actions.
Machine learning is based on correlations, not causation. This is the strength of ML and also its weakness. It means we can get good results if we train and evaluate the ML solution in the correct context of its deployment. But if we stray from context — different training/deployment data, misunderstanding the end user’s actions, ML literacy, motivation, and trust, etc. — we are in for surprises. As in Goethe’s poem “The Sorcerer’s Apprentice” (featured in Disney’s Fantasia animation), the magician’s apprentice learns only to mimic the magician’s actions, but does not understand them, resulting in a disaster.
The question we must ask when developing a ML solution is “how will the ML solution be used to generate an action?” — this requires understanding how the end user will use the system and the predicted values/scores. For example,
- will they apply the solution to a new type of data?
- will they understand and/or trust the resulting score?
- can they translate the reported algorithm performance level into practical implications? (e.g. costs of over-prediction vs. under-prediction)
- how will the user translate the predicted score into an action?
The ability to answer these crucial questions requires a dialogue between the algorithm developer (data scientist) and the end user, and often also the data collector. It is a difficult dialogue, where the different sides speak different languages, and there can be many misunderstandings. It means the data scientists must immerse themselves in the context of deployment, not only in terms of data, but also in terms of humans and decision makers.
In a recent ArXiv paper by a large group of Google researchers (+ two EE/CS professors and a PhD student), called “Underspecification Presents Challenges for Credibility in Modern Machine Learning” they report discovering a new underlying cause of deployment surprise: “under-specification”, that is, that algorithms that seem equally good at time of development (i.e. they all give a similar “solution” and therefore the problem is under-specified) perform dramatically differently during deployment, in terms of performance on subgroups.
Is the above paper’s discovery a new insight? Are deployment surprises a specialty of deep-learning? is “under-specification” a deep learning problem? Is under-specification in fact a problem in prediction?
That predictive algorithms can perform dramatically differently on data subgroups is well-known. Simpson’s Paradox is an extreme example, where a correlation between an input and output changes direction when examining subgroups of the data. The larger the number of predictors, the larger the chance of a Simpson’s Paradox. Predictive models are also easily “fooled” when the training dataset includes a minority group that has a different input-output relationship than the rest of the training data. Models are fooled because metrics used to train and evaluate algorithms give equal weight to each observation (e.g. least squares or maximum likelihood for training; RMSE and accuracy metrics for evaluation).
While the Google researchers paper’s abstract concludes with a vague sentence that might mislead readers to think there’s a technological solution (“Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain”), in several places in the 59-page paper the authors conclude:
“This confirms the need to tailor and test models for the clinical settings and population in which they will be deployed.”
or
“This is compatible with previous findings that inputting medical/domain relational knowledge has led to better out of domain behaviour…, performance… and interpretability… of ML models.”
The paper closes with a proposal to circumvent the need for a context-specific dialogue between the data scientist and the end user by building models that favor “predictors that approximately respect causal structure”. While using causal structure is feasible and useful in some domains, especially in low-dimensional problems, the areas where ML shines are exactly those where causal relationships are difficult to specify. Explanation and prediction both have their merits, and predictive solutions can still be sound and useful even without underlying causal modeling if developers and users collaborate and communicate during the entire loop of design, testing, deployment, and post-deployment feedback.
At their basis, deployment surprises are a failure to understand the limitations of ML, or even statistical models. They all rely on many human choices —by data scientists, data collectors, data engineers, people on whom data are collected, the end users (e.g. decision makers), and more.
In judicial decision making, there has been a growing number of studies identifying issues related to deployment disasters, triggered by the 2016 ProPublica report on glaring errors of the COMPAS system used in several judicial decision making contexts. Many of the issues are related to discrepancies between the data used to train the algorithm and those during deployment, but there are many other context-related issues that surface when we ask “how will the ML solution be used to generate an action?” We can then ask what data the judicial decision maker will use as input to the system and compare that to the input used to train the data (different populations, different definitions of “recidivism”, etc.). We can compare the action that will be triggered (e.g. parole decision) to the action used to define the output in the training data. These are examples of the critical knowledge a dialogue would uncover.
In our recent paper “The Hidden Inconsistencies Introduced by Predictive Algorithms in Judicial Decision Making”, we uncover four inconsistencies likely to be hidden from their end users: judges, parole officers, lawyers, and other decision-makers. These inconsistencies involve different human elements (data scientists, data engineers, data subjects, data collectors, judicial decision makers). The inconsistencies include choice of measured outcome and predictors, choice and quality of training data, predictive accuracy of subgroups (and the reference class problem), and communicated risk scores. None of these can be solved by removing the human from the loop; It is impossible to identify a causal structure underlying the complex and dynamic process; And aside from causal structure, there are serious issues of measurement.
The bottom line: embedding predictive ML algorithmic solutions in human decision making applications can be useful and stable, but that requires close and ongoing dialogue, collaboration, and understanding between the data scientists, end users, and the other humans involved.
Note: This article does not deal with ethical question of whether ML algorithms should be used in decision making. Rather it focuses on “surprises” that can appear at deployment.