Backstage: The Hitchhiker’s Guide to Responsible Machine Learning
A month ago we released an educational comic in the area of Responsible Machine Learning/Explanatory Model Analysis titled ,,The Hitchhiker’s Guide to Responsible Machine Learning’’. In 52 pages we present methods, tools and good practices for building and verifying predictive models using the example of covid mortality analysis. Although the book itself is not long and can be read in two hours, the idea for this comic has matured over the years.
Below I will share some of my thoughts on the creation of this (comic) book. English language version is available online at https://betaandbit.github.io/RML/. The Polish translation will appear online and in bookstores in January 2022.
The Experience Economy
One of the books I recently read was The Experience Economy by Joseph Pine and James Gilmore. A very interesting position showing the growing role of experience in the process of choosing products and services. In the education services, we are seeing the growing popularity of the edutainment, which combines education with entertainment. Of course, there is always the question of how much education is actually there, and what kind of entertainment we talk about, is it intellectual entertainment or just jokes. Even if not all the offers in this area are eatable, sometimes you will come across a gem. For me, Hans Rosling’s work, from his TED talks to his book Factfulness, is an example of how the two areas can be skilfully combined, telling the story of seemingly uninteresting statistics in an understandable and engaging way.
But what will happen next? Is it possible to plan an experience in which the participant not only listens about the story but also takes part in it? The RPG game industry has shown that it is possible. Not only can we read about the adventures of Geralt the Witcher by reading books about him, we can also experience some of these adventures in a computer game. In classic books, exercises are the trigger for such experiences. Not only do we read about facts, but by doing the exercises we can experience and understand more deeply the issue being discussed.
We decided on a similar solution in the comic book “The Hitchhiker’s Guide…”. The excerpts of the discussed story are equipped with sample codes and data that can be executed in the R console (and in the future also in Python). This way we don’t have to passively watch the adventures on the pages of the comic, but can look at the data ourselves and try to use a different model, or apply a different model validation technique.
In fact, based on these examples we run a whole hands-on workshop at UseR 2021 conference. For 3 hours the participants went through the same adventures as Beta and Bit, building, verifying and deploying a predictive model.
The Process of Explanatory Model Analysis
The process of building predictive models is surrounded by many myths. One of them is the mythical automation of model building, according to which one throws data into a tool, clicks a button and voilà a big file pops up which is a model.
In the RML comic we try to disenchant this myth in several ways. First, we show four iterations of building a model, with each iteration producing an increasingly complex but also effective model. Second, the first model is created before we have access to new data. Often, especially in medicine, a huge amount of domain knowledge is already available to build the first iteration of the model without any raw data. Third, in the modelling process in each step we learn something new about the problem being analysed and this new knowledge can be used in the next modelling step.
Due to its limited size, the comic does not go very deeply into the mathematical details of the individual methods. On even-numbered pages, the intuitions behind each technique are presented. But the whole is based on the textbook “Explanatory Model Analysis”, where you can read in detail how the different methods work and what are their advantages and disadvantages.
The methods we show are often referred to as Interpretable Machine Learning or Explainable Artificial Intelligence. Both of these names are, however, in some sense wrong. Not every model is actually interpretable, and our goal is not to interpret the model nor interpret the prediction. Similarly, the term explainable causes discussion of XAI methods to shift too often to the psychological basis of explainability. Although in reality we rarely want the model to actually explain anything on the same basis as a teacher explains to students or a parent explains to children. Very often our expectation is to justify the model’s prediction so we can question them. Properly naming things helps in understanding them, so in the comic we try to consistently use the term Explanatory Model Analysis to emphasise that we are talking about model understanding, just as Explanatory Data Analysis is about data understanding.
In the case of this comic, the stories were written by life itself. In the first half of 2020, our team, in collaboration with the MOCOS group founded by professor Tyll Kruger, participated in modelling the mortality of covid infection on the basis of very detailed epidemiological data.
As it turned out, the model that emerged had many more stakeholders than we expected, because not only were epidemiological services interested in it, but also many outsiders were curious or concerned about the complications and possible death in case of infection.
We decided to make the model itself publicly available at https://crs19.pl/. We were surprised to find that it was noticed by major media in Poland and Germany. And if so, what better way to show EMA in action than to base it on a real case where these techniques were used in response to a real need?
In the comic we used the characters Beta and Bit, which I had previously created for data literacy books (most of them are available only in Polish). She is fascinated by maths and statistics, he is a programmer experimenting with machine learning, together they are a great team to show the value of predictive modeling.
There was not enough space in the comic to show all the models we tested during the actual modelling. In particular the boosting model with monotonicity constraints or the logistic regression model with cubic splines, which also gave very promising results. Well, maybe one day there will be a Part II describing other modelling techniques as well.
Let’s go somewhere.