Transparent ML for Enterprise Decisions — Rule Sets

7 min readJul 28, 2022

Note: this is Part III in a series of articles on transparent machine learning models, click here for Part I — Introduction, Part II — Linear Models and Part IV — Scorecards.

Intro to rule sets

Business rules for decision making has been used extensively for several decades. Historically they stem from pre-digital times, when policy manuals were used as guidance for how company representatives should conduct business. Declarative rules were — and still are — used to specify in detail how to perform anything from order validation, approvals, quoting, discounting and much more. Similarly, predictive rules have historically been crafted by hand — sometimes helped by statistical software — to assess risk, recommend products or detect fraud.

Within the Machine Learning community, learning rules from data — also known as Rule Learning or Rule Induction — has been studied since at least the 80s. Initially, most approaches were either based on extracting rules from decision trees, or by incrementally growing rule sets by generalizing directly from examples in the training data. In the last decade, newer methods have appeared that are based on mathematical optimization or neural networks.

To illustrate how rule sets can used for predictions, let’s consider the simple example of customer churn, which is a binary classification problem. We want to predict whether a customer will cancel their subscription, based on a set of variables:

Age
Income
Payment method
Monthly usage
Gender

There are two commonly used rule formats, rule sets and decision lists.

Rule Sets: A rule set is a set of If-Then rules, which all predict the same class. For example:

IF Age ≤ 25 and Monthly Usage ≥ 100 and Gender = Male
THEN Churn = “True”
IF Age ≤ 50 and Monthly Usage ≥ 150 and Income ≥ 40,000
THEN Churn = “True”
IF Monthly Usage ≤ 40 and Payment Method = Cheque
THEN Churn = “True”
…
By default Churn = “False”

The rule set above is not ordered, i.e. since all rules predict the same outcome it doesn’t matter in what sequence the rules are checked. By custom, if none of the rules trigger, there’s a default rule at the end, predicting the opposing outcome (Churn = “False” in this example).

Decision Lists: Also known as “rule lists”, this is an ordered set of If-Then-ElseIf-Else rules, where each condition can lead to a different prediction. For example:

IF Age ≤ 25 and Monthly Usage ≥ 100 and Gender = Male
THEN Churn = “True”
ELSE IF Age ≥ 40 and Monthly Usage ≤ 100
THEN Churn = “False”
ELSE IF Monthly Usage ≤ 40 and Payment Method = Cheque
THEN Churn = “True”
…
ELSE Churn = “False”

Notice that the first rule predict Churn to be “True”, the second rule predicts “False”, while the third predicts “True” again. A decision list of this type is said to be “blended”, and this mixing is the reason why a decision list has to be evaluated in sequence from top to bottom.

The (mathematical) format of rule conditions

While in principle the condition of a rule could take any form, traditionally most rule learning algorithms have generated the type of conditions shown above, which more generally has the following form,

i.e. a conjunction of conditions, each on a single variable. For example, the condition “Age ≥ 40 and Monthly Usage ≤ 100” from above follows this format.

ML lingo (optional): Mathematically, these types of rules with univariate conditions form n-dimensional hypercubes in the decision space. All boundaries formed by these cubes are orthogonal to the variable axes.

Visualizing rule sets

The ruleset example from above can be visualized in the following way, if we ignore all conditions except those on Age and Monthly Usage:

The rule set example from above, projected and visualized in the 2-dimensional space formed by Age and Monthly Usage. Areas in red classified as Churn.

You will have to imagine how these rectangles expand beyond your flat screen, forming multi-dimensional cubes that are constrained in space by the additional conditions imposed on other variables (not just Age and Monthly Usage shown here).

The strong points of rule set models

Rule set models are — if reasonable in number and complexity — easily interpretable, and natural for business analysts to read or even modify. They are also good in capturing certain types of patterns in the data:

Interacting variables: since the condition in each rule is a combination of constraints on different variables, there is good opportunity for the rule learning algorithms to identify rules that capture important interactions between variables.
Example: in the decision list above, the combination of young age and high monthly usage raises the risk for churn for men, whereas the combination of high age and low monthly usage does the inverse.
Non-linearities: a set of rules can in combination be used to capture a non-linear effect for a variable.
Example: in the first rule set above, observe how every rule involves a condition on Monthly Usage; testing whether it’s <40, >100, or >150 respectively. By “slicing up” the range for Monthly Usage this way — and defining different outcomes — we can capture a relationship that is not linear.

It is worth noting that these strong points are exactly the weak points of linear models that we discussed in Part II of this article series, and also that rule sets complement Score Cards (Part IV) that can model non-linearities, but not interactions between variables.

The weaknesses of rule set models

While transparent and flexible, rule sets have some inherent limitations:

Size of rule sets: when the decision space doesn’t follow boundaries orthogonal to the axis, you can imagine that we would need more of these cubes depicted above to nicely follow the boundaries that separate cases of churn from cases of non-churn. Put differently, this is one reason that, if we strive for good model accuracy, we might end up with large rule sets.
Complexity of conditions: while rules are good at capturing interactions between variables, doing so might result in conditions with constraints on many variables, which will increase complexity and reduce transparency.
Regression: above we focused on classification examples, which include use cases such as churn risk and fraud detection. When learning rules to predict a number — such as estimating a house price or life-time value of a customer —a common approach is to take the (weighted) average value of the right-hand-side of all the rules that triggered. While this approach might work — with enough rules — it might provide a worse accuracy/transparency ration than e.g. scorecards or even linear regression. Often, Score Cards (Part IV) will prove a better choice for transparent regression models.

Rule sets as Decision Tables

Business rules can also be displayed and managed in the form of decision tables. As a concrete example, our rule set from above would be captured the following way as a decision table:

The example rule set represented as a decision table. Each row is a rule.

In a decision table, each row represents a rule, and each cell contains a condition on a single variable. Just like with If-Then rules, there’s a “hit policy” defining how the rules in the table should be evaluated. For a regular rule set, we use a“first hit” policy, which means the table will evaluated from top to bottom, and the rule in the last row will trigger if no other rule above did.

Whereas a list of rules is semantically equivalent to a decision table, visually there are cases where a table might be preferable (and vice versa).

Sparse rule sets occur when you have relatively few conditions compared to the number of variables. Displayed as a table, this means there will be a lot of columns and a lot of empty cells. Sparse rule sets might be better visualized and managed as individual If-Then rules.
Dense rule sets is the name for the inverse situation, i.e. where there are a relatively small amount of variables and many variables are used in conditions, leading to a table where many cells are filled. In this case, a table format will not only be more compact, it will make analysis of the rules more feasible.

Regardless of whether a rule set is dense or sparse, your rule management software might provide different ways of analyzing the rules when represented as a table or list of rules. For example, in a table you might be able to re-order columns or sort the rows (be careful if you have a “first hit” policy!), which might make understanding of the rules much easier.

Summary

As a conclusion, using machine learning to generate business rules from data is very compelling, because rules are very familiar for business analysts and can be managed and executed alongside other rules used to encode policy or business strategy. For classification tasks with lots of interactions between variables, they can perform relatively well with a reasonable amount of rule complexity.

For the other articles in this series, see:

Greger works for IBM and is based in France. The above article is personal and does not necessarily represent IBM’s positions, strategies or opinions.

Transparent ML for Enterprise Decisions — Rule Sets

Written by Greger Ottosson