**What’s new in CausalNex v0.10?**

*Paul Beaumont**, Data Scientist, **Hiep Nguyen**, Data Scientist, **Philip Pilgerstorfer**, Data Scientist, **Zain Patel**, Software Engineer, QuantumBlack*

CausalNex is an open source Python library that helps data scientists and domain experts to co-develop models that go beyond correlation and consider causal relationships. CausalNex provides a practical ‘what if’ library, deployed to test scenarios using Bayesian Networks (BNs), interpretable, graphical models.

Since our first release in January 2020, CausalNex has been well received by the community and we are very grateful to everyone who has helped us reach the 1,000 “GitHub stars” milestone. Community input has been crucial in the library’s ongoing development and we are delighted to now release CausalNex v0.10.0 — an update to provide data scientists with an improved experience when building and querying Bayesian Networks.

**What’s new?**

The focus of this release are features that can boost model performances and speed up inference time. The release includes:

**Advanced discretisation strategies**: Functionality to help to find optimal thresholds when discretising continuous variables**Faster inference**: Calculating the**Markov blanket**of a graph to simplify inference without losing any pertinent information, and an extension of our Inference Engine’s “.query()” functionality to support**multiprocessing**and speed up inference time**A new tool to simplify fitting probabilities from a Bayesian Network**: an sklearn compatible class that supports fitting conditional probability distribution (CPDs), discretising features, and making predictions.

Given that it’s been a while since our last communication update, readers may be unaware that previous releases removed the controversial Boston Housing dataset dataset from all of our tutorials, replacing it with the Diabetes dataset.

This article will dive deeper into each of the new functionalities and provide code snippets that demonstrate their usage.

**Advanced discretisation strategies**

One of the original design decisions in CausalNex was to use discrete probability distributions, instead of supporting continuous variables. At QuantumBlack, we found that modelling continuous features relied too heavily on unrealistic normality assumptions of real-world data, and discretisation — whilst inherently “losing” information in the discretisation process — lead to better models overall.

For that reason, continuous features or categorical features with a high number of classes need to be discretised before fitting in CausalNex. However, this does mean that care is required when choosing a discretisation method, as this will directly impact the output of the final model.

One new feature of this release is to apply supervised learning algorithms to finding the optimal splitting points for a continuous variable. Based on our experiments, the supervised approach will outperform existing unsupervised methods such as uniform, fixed and quantile. In this release, we support two new discretisation methods: decision tree and the MDLP algorithm. Essentially, the discretisation thresholds will be the split points of a feature when we try to optimise the accuracy between a particular feature and the target (e.g. Gini in a decision tree).

Throughout this article, we will use the Diabetes Data Set to demonstrate this feature.

importwarningswarnings.filterwarnings(“ignore”)importpandasaspd

importnumpyasnp

fromsklearn.preprocessingimportLabelEncoderfromcausalnex.discretiserimportDiscretiserfromIPython.displayimportImagefromcausalnex.plotsimportplot_structure, NODE_STYLE, EDGE_STYLEfromsklearn.datasetsimportload_diabetesfromsklearn.preprocessingimportStandardScalerdiabetes = load_diabetes()

X, y = diabetes.data, diabetes.target

names = diabetes[“feature_names”]

ss = StandardScaler()

X = ss.fit_transform(X)

y = (y — y.mean()) / y.std()raw_data = pd.DataFrame(X, columns=names)

raw_data[‘target’] = ystruct_data = raw_data.copy()

data = raw_data.copy()data.head()

As we can see, all of the features in our dataset are continuous and cannot be fit to the Bayesian Network as they are. Assume that we are trying to predict target given other features, we can discretise other features using a supervised learning approach. The below example leverages the new `DecisionTreeSupervisedDiscretiserMethod`

:

fromcausalnex.discretiser.discretiser_strategyimport(

DecisionTreeSupervisedDiscretiserMethod,

)features = list(data.columns.difference([‘target’]))

tree_discretiser = DecisionTreeSupervisedDiscretiserMethod(

mode=”single”,

tree_params={“max_depth”: 2, “random_state”: 2021},

)tree_discretiser.fit(

feat_names=features,

dataframe=data,

target_continuous=True,

target=”target”,

)

Output:

`DecisionTreeSupervisedDiscretiserMethod(`

mode=’single’,

split_unselected_feat=False,

tree_params={‘max_depth’: 2,

‘random_state’: 2021},

)

Object `tree_discretiser`

has learned all thresholds for each of the input features in `feat_names`

. We can now apply these thresholds to our data using the `transform method`

:

forcolinfeatures:

data[col] = tree_discretiser.transform(data[[col]])data.head()

We can also see the discretisation thresholds by looking at the attribute `map_thresholds`

:

`tree_discretiser.map_thresholds`

Output:

`{`

‘age’: array([-1.52877724, 0.15135724, 0.91505471]),

‘sex’: array([0.06347591]),

‘bmi’: array([-0.45903838, 0.19809295, 1.53501534]),

‘bp’: array([-1.01193172, 0.49603105, 1.24398059]),

‘s1’: array([-0.77064317, 0.12611715, 0.38646692]),

‘s2’: array([-1.20278984, 0.36409968, 0.37068325]),

‘s3’: array([-2.11218154, -0.33193551, 0.59688854]),

‘s4’: array([-1.03200924, -0.28336067, 0.64372227]),

‘s5’: array([-0.71241763, -0.07908702, 0.5547165 ]),

‘s6’: array([-0.67577836, 0.71754661, 2.58982706]),

}

After a few simple steps, all features are now categorical and ready to be used in our Bayesian Network. Note, for a small dataset we recommend setting a low value (<~3) for the parameter `max_depth`

to avoid having too many categories per feature.

**Faster inference**

CausalNex enables users to harness our learned Bayesian Networks to answer pertinent questions of interest. However, inference computation can sometimes a slow process, so this update includes two new methods intended to alleviate this.

**Reducing a graph to its Markov Blanket**

The Markov blanket (MB) of a variable is the subset of nodes in the Bayesian Network that contain all the useful information for predicting that variable. In other words, nodes outside a variable’s MB will (given knowledge of the nodes in the MB) have absolutely no influence on the variable of interest.

The concept is particularly useful when we have a large graph and a variable of interest. Instead of considering the whole graph, we need only to consider the Markov blanket subgraph in order to make more efficient inference. To demonstrate the new feature, we continue with the diabetes dataset:

fromcausalnex.structure.notearsimportfrom_pandassm = from_pandas(data)

sm.remove_edges_below_threshold(0.3)

sm = sm.get_largest_subgraph()viz = plot_structure(

sm,

graph_attributes={“scale”: “0.5”},

all_node_attributes=NODE_STYLE.WEAK,

all_edge_attributes=EDGE_STYLE.WEAK,

)Image(viz.draw(format=’png’))

Now, assume that `target`

is our variable of interest. We actually do not need all the nodes in the network but only the MB of target. To achieve that, we simply need to use the `get_markov_blanket`

function from causalnex. Specifically,

fromcausalnex.networkimportBayesianNetworkfromcausalnex.utils.network_utilsimportget_markov_blanketbn = BayesianNetwork(sm)

blanket = get_markov_blanket(bn, ‘target’)

`blanket`

is now a BayesianNetwork object that contains the structure of the MB of the original `bn`

network. This means that if we only care about `target`

and nodes having direct impact on `target`

, we only need to worry about nodes contained in `blanket`

.

viz = plot_structure(

blanket.structure,

graph_attributes={“scale”: “0.5”},

all_node_attributes=NODE_STYLE.WEAK,

all_edge_attributes=EDGE_STYLE.WEAK,

)Image(viz.draw(format=’png’))

As a result, our region of interest has been reduced to only seven variables, which the `InferenceEngine`

is able to compute marginals for more quickly than a larger graph.

**Accepting and multiprocessing lists of observations**

The second feature in this release to boost inference time is support for multiprocessing. We leverage `pathos.multiprocessing`

to perform parallel execution of CausalNex’s `InferenceEngine`

given multiple inputs. In instances where a user wishes to evaluate many observations, the new support for a list of observation dictionaries (as opposed to the single dictionary seen previously) and the ability to compute these in parallel will improve computation time:

discretised_data = data.copy()

discretised_data[‘target’] = Discretiser(

method=”fixed”,

numeric_split_points = [-0.5, 1],

).transform(discretised_data[“target”].values)target_map = {0:“Low”, 1:“Mid”, 2:”High”}discretised_data[‘target’] = (

discretised_data[‘target’].map(target_map)

)bn = bn.fit_node_states(discretised_data)

bn = bn.fit_cpds(

discretised_data,

method=”BayesianEstimator”,

bayes_prior=”K2",)discretised_data.head()

Now, bn is a fitted Bayesian Network and we can use `InferenceEngine`

to query the marginals with a list of observations. For example, given that we have two new observations and would like to understand how those observations will affect the marginal distributions, we can do the following:

fromcausalnex.inferenceimportInferenceEngineie = InferenceEngine(bn)observation_1 = {“age”: 2, “sex”: 1, “s3”: 3, “s5”: 0, “bmi”: 1}

observation_2 = {“age”: 1, “sex”: 1, “s3”: 2, “s5”: 0, “bmi”: 2}marginals = ie.query([observation_1, observation_2])

In the case of two observations, speed may not be a concern and the overhead of multiprocessing is not gainful. In such instances `.query()`

now accepting a list of dictionaries hopefully will aid ease of use with CausalNex. However, if we have a high number of observations — say 100 — the new multiprocessing feature will be beneficial. To trigger multiprocessing, we simply need to set the `parallel`

parameter in `query`

to `True`

:

pseudo_observation = [observation_1, observation_2] * 50#generate a hundred observationsimporttimestart = time.time()marginals_multi = ie.query(

pseudo_observation,

parallel=True,

num_cores=16,

)print(“Using multiprocessing, the query took {:.1f} seconds to run”.format(time.time() — start))start = time.time()marginals = ie.query(pseudo_observation)print(“Without multiprocessing, the query took {:.1f} seconds to run”.format(time.time() — start))

Output

Using multiprocessing, the query took 4.7 seconds to runWithout multiprocessing, the query took 10.4 seconds to run

As we can see, the time difference is significant for the same task. `parallel=False`

by default however, because the overhead cost can be more expensive than the task itself if a user does not request ~>100 observed marginals.

**Fitting Bayesian CPDs with scikit-learn’s syntax**

Previously, when we want to build a classifier using CausalNex, the standard steps often consist of:

- discretise the data,
- fit node states,
- fit CPDS, and
- make predictions.

While we encourage users to go through the process of each step to better understand the graph and causal relationships in the data, we believe having a tool that combines all the steps to output predictions instantly can come in handy in many situations. As a result, we developed `BayesianNetworkClassifier`

, which helps building models with scikit-learn syntax. `BayesianNetworkClassifier`

is inherited from scikit-learn’s `BaseEstimator`

and `ClassifierMixin`

and can be used as a standard model in a scikit-learn pipeline.

Let’s build a simple classifier with the diabetes dataset to demonstrate this new feature:

fromsklearn.model_selectionimporttrain_test_splitfromcausalnex.network.sklearnimportBayesianNetworkClassifierraw_data[“target”] = Discretiser(

method=”fixed”,

numeric_split_points=[-0.25],

).transform(

# convert target variable to categorical

raw_data[“target”].values

)label = raw_data[“target”]

input_data = raw_data.drop([“target”], axis=1)# train test splitX_train, X_test, y_train, y_test = train_test_split(

input_data, label, test_size=0.05, random_state=7

)# Specify arguments for the modeledge_list = list(sm.edges)

discretiser_alg = {val: “tree”forvalinlist(raw_data)[:-1]}

discretiser_param = {

“max_depth”: 1,

“random_state”: 2020,

}# we will discretise all features using this parameterfeature_discretiser = {

val: discretiser_paramforvalinlist(raw_data)[:-1]

}# discretising and probability fittingclf = BayesianNetworkClassifier(

edge_list,

discretiser_alg=discretiser_alg,

discretiser_kwargs=feature_discretiser,

)clf.fit(X_train, y_train)

Output:

`BayesianNetworkClassifier(`

bayesian_kwargs={

‘bayes_prior’: ‘K2’,

‘method’: ‘BayesianEstimator’,

},

discretiser_alg={

‘age’: ‘tree’,

‘bmi’: ‘tree’,

‘bp’: ‘tree’,

‘s1’: ‘tree’,

‘s2’: ‘tree’,

‘s3’: ‘tree’,

‘s4’: ‘tree’,

‘s5’: ‘tree’,

‘s6’: ‘tree’,

‘sex’: ‘tree’,

},

discretiser_kwargs={

‘age’: {

‘max_depth’: 1,

‘random_state’: 2020,

},

‘bmi’: {

‘max_depth’: 1,

‘random_state’: 2020,

},

‘bp’: {

‘max_depth’: 1,

...

‘sex’: {

‘max_depth’: 1,

‘random_state’: 2020,

},

},

list_of_edges=[

(‘age’, ‘s3’),

(‘sex’, ‘age’),

(‘sex’, ‘bp’),

(‘sex’, ‘s4’),

(‘sex’, ‘s6’),

(‘sex’, ‘target’),

(‘bmi’, ‘s4’),

(‘bmi’, ‘target’),

(‘bp’, ‘age’),

(‘bp’, ‘bmi’),

(‘bp’, ‘s3’),

(‘s1’, ‘s5’),

(‘s2’, ‘s1’),

(‘s2’, ‘s5’),

(‘s4’, ‘s2’),

(‘s4’, ‘s5’),

(‘s5’, ‘target’),

(‘s6’, ‘age’),

(‘s6’, ‘bmi’),

(‘s6’, ‘bp’),

(‘s6’, ‘s4’),

(‘target’, ‘s3’)],

return_prob=False,

)

Finally, we can make predictions using the CPDs the model has learned:

`clf.predict(X_test)`

Output:

`array([0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1])`

After a few steps, we have now achieved a classifier `clf`

, which contains the learned CPDS from the training data and can be used to make predictions for new observations with `.predict`

method.

**What’s next?**

We are very thankful for the community’s input so far and QuantumBlack will continue to develop CausalNex with its users in mind. We encourage all users to contribute by reporting issues and adding new features.

*If you have used CausalNex and found the library useful, we would really appreciate if you starred this on* *“GitHub”**. We are very much looking forward to achieving our next milestone of **2,000 “GitHub stars”**.*