Ben Horsburgh — Jr Principal ML Engineer, QuantumBlack
In December, the QuantumBlack team were fortunate enough to attend NeurIPS 2019 in Vancouver, where we hosted an expo workshop exploring how to deploy causal inference and reinforcement learning to generate models which consider cause and effect.
This session proved very popular and so we wanted to share the key elements with those who were unable to attend. Across the next two Medium articles we will explore how data scientists can harness both Causal Reasoning and Reinforcement Learning to build models which respect cause and effect.
The Causal Blind Spot
Advanced analytics is often deployed to decide where to make an intervention in order to influence a target. However, many traditional ML methodologies, from linear regression to deep learning, do not consider causality and instead only model correlation between datapoints. They may identify that there is a relationship between variables without defining what this relationship is or how they influence each other.
This can have a drastic impact on the model’s suggested intervention, diluting the effectiveness of interventions or even producing entirely irrelevant recommendations. For example, a non-causal model aiming to mitigate drought may recognise that there is a relationship between rising drought and rising ice cream sales, but may spuriously conclude that banning ice cream would mitigate drought.
In causal modelling, ML is used to create an initial structure, visualised as a graphical network, which highlights the perceived relationship between features. This is then reviewed by domain experts who can update the structure to highlight how each feature influences the other — in our example, an irrigation expert would highlight that accessible drinking water would be a far more accurate driver to mitigating drought than banning ice cream.
This process is known as Causal Reasoning and this article will cover off each of the three phases required to deploy it.
Phase 1: Structure Learning
Causal models need to be informed of the causal structure between features. In an ideal world a domain expert would input this structure, but this is often unfeasible — a model with just 50 variables would require just under 2,500 cause-effect relationships to be considered and explained.
Moreover, cause and effect chains make an already time-intensive process even more complex — changes to one feature may impact another, which in turn influences another. It is easy to overlook these chains when building structures by hand, and even easier to mistakenly create cyclical, chicken-egg chains which are then difficult to fix.
Recent advances, particularly the publication of DAGs with NO TEARS at NeurIPS 2018, have improved the efficiency and accuracy of structure learning algorithms that build these networks. They streamlined the process and avoid chicken-egg paradox structures. Importantly, they do not confirm causality — they estimate it. Working with non-experimental data, an iterative, collaborative process is necessary to verify predictions and domain experts are required to review and verify the structure’s causality, cross-referencing relationships against respected sector-specific publications, surveys and wider expert opinion. It is the augmentation of data and method with input from domain experts that allow us to take a step towards a causal interpretation.
This process helps inform insights — cause-effects that data scientists may find surprising are often well understood by experts, and even those that surprise experts are sometimes well understood by others in their field and can be verified through a search of wider materials.
A structured datatype will include nodes (variables that hold information) and edges (directed connections between nodes that can also hold information). Most structure learning algorithms output edge weights, which are useful to direct conversations between data scientists and experts. Presenting edges from highest to lowest weight helps data scientists drive an even more efficient review process, but we should be careful not to attach too much interpretation to weights — they are usually not probabilities or values which are interpretable by humans. Moreover, even low-weight edges can sometimes be important but statistical testing is difficult.
Once we have identified what the causes are, we can progress to learning how they behave.
Phase 2: Probability Learning
Structure learning may identify that the price of coffee is influenced in some way by population density, but will not specifically identify how — it is unable to indicate whether a rising population increases or decreases price, or whether there is a more complex relationship at play.
Probability learning estimates how much each cause drives each effect by learning the underlying Conditional Probability Distributions (CPDs). Each CPD describes the likelihood of a cause, given the state of its effects.
We have found that discrete CPDs are more practical than continuous CPDs. Continuous distributions are often limited to Gaussian distributions and so struggle to describe many relationships. Discrete CPDs can describe any shape of distribution, albeit with less precision, and are widely supported by many libraries.
We can utilise the domain experts to make a choice. Data scientists and domain experts should agree upon a data discretisation strategy at the outset. Taking into account the goals of the project, you should define what discretisation is required. For instance, if your project requires comparisons to be made then then percentile discretisation would likely suit.
That being said, be careful to avoid over discretising CPDs, as all probability estimates need to be described and can quickly accumulate. For a binary effect with three binary causes, a CPD would need to estimate 16 possible eventualities. For an effect with 10 states and three causes, each with their own 10 states, 10,000 possible eventualities must be estimated. For small datasets with fewer samples than possibilities, most eventualities will never be observed, and those that are will not be well represented. But even with large datasets, over-discretisation will mean CPDs will include many highly improbable eventualities. This will dilute the power of the model and increase computation time.
Learned probabilities should be evaluated by both data scientist and domain experts. For data scientists, treat this as a standard classification problem — learn the model probabilities using a training set, and then evaluate how accurate probabilistic predictions are for any given node using the test set.
Meanwhile, domain experts can read CPD tables and sense-check values. This is often where the more improbable probabilities can be eliminated.
Phase 3: Inference
By now we understand the cause-effect relationship structure of our dataset and how the relationships behave. This enables us to make inferences — essentially testing actions and theories to gauge response.
Inference can be split into observational and interventional. In observational inference, we can observe the state of any variable(s) and then query how changing this will impact the likelihood of any other state of any other variable. Querying the likelihood of other variables is done by playing out all cause and effect relationships, achieved mathematically by marginalising probabilities over the CPDs. An example of this would be to observe a city-centre coffee shop and conclude it is likely to incur expensive commercial rent — and that subsequently, the price of a coffee is likely to be high.
In interventional inference, we can intervene on the state of any variable(s), changing the likelihood of its states to whatever we choose and effectively asking ‘what if X was different?’ For example, we could hypothesise employees working a four-day week instead of five and then observe the effect this has on productivity.
Deciding where it is most appropriate to intervene can be achieved through sensitivity analysis. Every time we make an observation, we can see how this affects the state of a target we want to change. If we were to make thousands of separate, subtle observations across all variables, we could estimate which variables our target is most sensitive to. This is the basis of sensitivity analysis, although there are more efficient means to achieve it.
Sensitivity analysis is a particularly powerful tool as it helps us understand where to focus efforts. It is not always possible to intervene on sensitive causes — for example, there is no point in altering a customer’s address as there is no way of our eventual model controlling that. However, these more sensitive causes can play a role in determining conditional interventions.
ML developments may have helped streamline structure creation but a collaborative, hybrid learning process between humans — specifically data scientists and domain experts — is still fundamental when reaching beyond correlation to identify causation.
Challenges remain with Causal Reasoning and it can be time-intensive and difficult to complete a full project due to the high number of separate software libraries required during the testing phase. However, it remains an effective technique when building causal models — to support this, QuantumBlack has recently released our latest open source offering, CausalNex. This software library provides a far more streamlined process and help models avoid spurious conclusions and ultimately produce more intelligent and impactful analytics interventions.
Causality is increasingly coming under the microscope and it is a topic we are committed to exploring further in future, both with CausalNex and wider research. For instance, we will present a paper at AISTATS in June, which modifies NO TEARS to learn the structure of variables across time in addition to the intra-temporal relations. In the meantime, do stay updated with upcoming CausalNex developments.