Bayesian Networks: Combining Machine Learning and Expert Knowledge into Explainable AI

Published in

Eliiza-AI

7 min readJul 3, 2020

Modern machine learning models often result in hard to explain black box situations: the inputs are known, but the path to the output and predictions are less clear. If data is also limited or too small to learn all intricacies hidden in the data, it can become hard to get meaningful results from the data and have high confidence in them.

Without labour-intensive, detailed work it is hard to encode expert knowledge beyond labour intensive feature engineering.

In contrast, at Eliiza we aim to create bespoke solutions tailored to your specific needs. One of the methods in our arsenal is the use of Bayesian Networks. This technique allows both for quick, good results as well as expert input, resulting in better models even on small datasets. Moreover, querying a Bayesian Network provides immediate insight into the importance and influence of each variable on a specific outcome.

Confidence in results is important and necessary, especially in the case of important business decisions. Bayesian Networks provide this confidence through the intrinsic calculation of confidence scores; most machine learning methods cannot do this, requiring costly post-hoc computation of confidence scores.

The results and codes snippets discussed here can be found in this notebook/repo.

Introduction to Bayesian Networks and Graphs

Bayesian Networks operate on graphs, which are objects consisting of “edges” and “nodes”. The image below shows a graph describing the situation around lunch time with three nodes (hungry, cooking, lunch time), and edges between them (arrows).

A simple graph created around lunch time.

This graph is not a good description of reality, as being hungry does not depend on it being lunch time (it could also be dinner time).

Implementing this “expert” knowledge, the graph can be turned into a directed acyclic graph (DAG). The expert knowledge will have to stand the test of the experiment though, and if it does adequately encode the problem at hand, it will have to be altered.

Here, this graph encodes that cooking something may mean it is lunch time, but it being lunch time does not mean you are cooking something — you could also be eating out or skipping lunch.

Example of a directed acyclic graph (DAG)for our very specific scenario around lunch time.

In summary, unlike most machine and deep learning methods, Bayesian Networks allow for immediate and direct expert knowledge input. This knowledge is used to control the direction and existence of edges between nodes, therefore encoding knowledge into a directed acyclic graph (DAG).

The Joint Probability and Conditional Probability Distributions

Having built a graph to match our problem, we can now encode the problem in probabilities, the basis of which will be the joint probability distribution:

p(h,c,lt) = p(lt|h,c)*p(c|h)*p(h)

which says that the probability for “hungry ( h ) and cooking ( c ) and lunch time ( lt )” is the same as the probability of it being “lunch time times the probability I am cooking given I am hungry times the probability I am hungry”. This may sound complex, yet this single equation encodes our entire problem, nodes and edges, which can be used to create inferences on our graph and counterfactual analyses (see below).

Building a joint probability distribution covering all the different cases is tedious and expensive, whereas looking at the individual conditional probability distributions is a lot quicker and easier, especially as the Bayes Theorem can be employed to simplify some terms.

Inference on Bayesian Networks

Having encoded our expert knowledge, the next step is to ready ourselves for some inferences, that is to make predictions.

Imagine I gathered some data over the course of a week, resulting in the table below:

Experiment results for five days, listed in the form of a truth table.

From this table, we need to create the conditional probability tables p(lt|h,c), p(c|h), and p(h), to then create the joint probability distribution.

Conditional probability tables for p(lt|h,c), p(c,h) and p(h)

These probability tables describe the relation between the different variables. Plugging these values into the joint probability function from above, we can calculate for example the probability of being hungry, while not cooking and it not being lunch time:

For three variables and a small dataset, it is relatively straightforward to calculate this “by hand”, but for larger problems, we will want to automate this by using our Python package of choice (e.g. pymc3, pomegrenate, causalnex, or others).

Applying Bayesian Networks to a “Real World” Problem

As a real world application, let’s take a fictitious dataset that reduces job applications to numerical scores for:

Education level
Work experience
Drive
Social commitment
The layout and font used in the application (shortened to font hereafter), and
Whether the respective candidate was hired.

The dataset encodes a bias with respect to font, making applicants much more (or less) likely to be hired for a certain score, and is shown below.

Histogramms of the different variables in the dataset, and on the right the heatmap of the correlations for this dataset — Distributions of values in this synthetic dataset and the data’s correlations.

A quick look at the correlations matrix, which contains the linear correlation between each of the variables, shows that most variables are independent (values <0.05), with the exception of the `hired/font`, and the `education&social_commitment/drive` pairs. This matches the intent of how the dataset was prepared.

Creating a directed acyclic graph (DAG) to describe data, if done exactly, can be computationally quite expensive, as the structure-finding problem scales super-exponentially with the number of variables/nodes. A method that can circumvent this issue is the NOTEARS-algorithm, which reduces the computational complexity to about O(n³). The DAG used for this dataset is shown below. In order to speed up our calculations, the continuous distributions were discretised into up to five different levels.

The conditional probability distributions on the training set consisting of 85% total data yields a model with an area under curve of ~0.95.

The area under curve (AUC) is a good measure of classification model quality in the case of a balanced dataset. A perfect model would have an AUC=1, which here would imply every candidate’s hireability is correctly classified. As a comparison, a model classifying the candidates at random, e.g. by throwing a 2-sided yes/no dice, would yield an AUC=1/2. If your AUC is smaller than 0.5, your model is terribly bad, as it is worse than choosing randomly. Moreover, it allows for an easy comparison between different models, where the better model will cover a larger area.

ROC curve with indicated AUC=0.95 and the random choice curve with AUC=0.5.

At this point, the Bayesian nature of the problem enters: it is not just possible to accurately determine a candidate’s hireability, but also the probabilities of belonging to either category (hired/not hired). A result from such a query would look similar to this:

A query of the Bayesian network to determine the likelihood of hypothetical candidate `3917` getting hired:

As the entire distribution of values is approximated, a confidence score between 0 and 1 is given for each category, where a score of 1 for `hired` would imply the candidate will definitely be hired, and unfortunately for candidate 3917, their score of 1 for ‘not hired’ means they will not be hired.

Counterfactual Analysis and Interventions

A counterfactual test for this dataset shows the impact of the font variable on getting hired. For individual candidates, the impact can be big.

Preventing bias in a machine learning model (think sexism, racism, discrimination, etc.) can be an arduous task, and spurious non-causal correlations need to be prevented and excluded from the modelling. Similarly, concept drift proves an issue for these models, meaning that your model may get increasingly inaccurate over time.

By using Bayesian Networks, all these issues can be addressed through counterfactual analysis. For most machine learning techniques, counterfactual analysis can only be performed in a sample by sample pseudo-manner, that is, by altering one input variable and looking at the effect on the outcome for a single sample. As the (joint) probability distributions are known in this case, we can elegantly change the entire underlying distribution in a counterfactual intervention. For example, by changing the distribution of values for font towards more favourable values we see a notable change in hiring:

A counterfactual intervention for the variable `font`: Changing the distribution towards a much better font (3) results in an overall 51% higher chance of people getting hired, and a 39% lower chance of not getting hired.

Through counterfactual analyses, all predictions of the Bayesian network can be fully understood. The network and associated predictions can also adapt quickly to changes in circumstances or changes of the underlying distributions, for example due to concept drift.

Conclusion

Bayesian Networks fill an important gap in the machine learning world, bridging the divide between other simple and fast models (Linear, logistic, …) lacking the probability information (read: giving certainty of a prediction), and computationally heavy and data-hungry methods like deep Bayesian neural networks admirably. Moreover, they offer the rare opportunity to study causality. Performing counterfactual analyses (“If I change variable A, what happens to the overall outcome?”) allows us to study data in a way that is not yet open to powerful deep learning methods (although work on this is happening).

In our example of the job candidate dataset, a counterfactual analysis showed that the single most important parameter is font, which has an outsized effect on hireability, revealing the bias encoded in this variable, that should really have no effect on an applications outcome at all, revealing bias.