Naive Bayes Classifier Amplification

Published in

Pipedrive R&D Blog

7 min readOct 3, 2023

Machine learning and artificial intelligence are increasingly being introduced into our lives, gradually penetrating into all areas of activity. Behind the high-tech magic we see around us, there are very complex and simple algorithms and technologies, such as the Naive Bayes classification.

However, despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. Naive Bayers is an excellent starting algorithm for implementing machine learning in your work before you gradually move on to more complex solutions as you gain experience.

In this post, I would like to dig into Naive Bayes Classifier internals to show what profit could be retrieved from this simple machine learning algorithm beyond classification. For example:

Incremental training
Incremental feature expansion
Automatic features selection
Best category recommendation

To be more descriptive and avoid code overload, I use a custom implementation of the algorithm based on https://habr.com/ru/articles/120194/.

As an example of train data, here I used a standard example to predict gameplay depending on the weather:

A quick note: This post is not for those who are just about to learn machine learning and Naive Bayes. There will be no explanations here about how the algorithm works.

Incremental training

Incremental training refers to when an already trained model could be further trained with additional data. It’s available in packages like scikit-learn, but could also be implemented as:

import math
from collections import defaultdict


def train(data, classes, model=None):
    classes_stat, features_stat, length = model if model else (defaultdict(int), defaultdict(int), 0)
    length += len(data)

    for features, cls in zip(data, classes):
        classes_stat[cls] += 1

        for idx, feature in enumerate(features):
            features_stat[cls, feature, idx] += 1

    return classes_stat, features_stat, length


def proba_log(classifier, features):
    classes_stat, _, length = classifier

    func = lambda cls: math.log(classes_stat[cls] / length) + \
        sum(math.log(feature_proba(classifier, cls, feature, idx)) for idx, feature in enumerate(features))

    return {cls: func(cls) for cls in classes_stat}


def feature_proba(classifier, cls, feature, idx):
    classes_stat, features_stat, _ = classifier

    return (features_stat.get((cls, feature, idx), 0) / classes[cls]) or \
        (1 / (classes_stat[cls] + 1)) ## NOTE: just work-around Zero-Frequency Problem

To split weather data, incremental training can be coded like this:

data = [
    ['Rainy', 'Hot', 'High', 'false'],
    ['Rainy', 'Hot', 'High', 'true'],
    ['Overcast', 'Hot', 'High', 'false'],
    ['Sunny', 'Mild', 'High', 'false'],
    ['Sunny', 'Cool', 'Normal', 'false'],
    ['Sunny', 'Cool', 'Normal', 'true'],
    ['Overcast', 'Cool', 'Normal', 'true'],
]

classes = [
    'no',
    'no',
    'yes',
    'yes',
    'yes',
    'no',
    'yes',
]

model = train(data, classes)

data_inc = [
    ['Rainy', 'Mild', 'High', 'false'],
    ['Rainy', 'Cool', 'Normal', 'false'],
    ['Sunny', 'Mild', 'Normal', 'false'],
    ['Rainy', 'Mild', 'Normal', 'true'],
    ['Overcast', 'Mild', 'High', 'true'],
    ['Overcast', 'Hot', 'Normal', 'false'],
    ['Sunny', 'Mild', 'High', 'true'], 
]

classes_inc = [
    'no',
    'yes',
    'yes',
    'yes',
    'yes',
    'yes',
    'no',
]

model = train(data_inc, classes_inc, model)

Incremental features expansion

Incremental features expansion is a less obvious possibility of Naive Bayes, which follows from its specifics, where features are not dependent on each other. An already-trained model could be easily extended with new features. To get this working, all it takes is a small change in the code:

def train(data, classes, model=None, f_inc=False):
    classes_stat, features_stat, rows, cols = model if model else (defaultdict(int), defaultdict(int), 0, 0)
    idx_start = cols if f_inc else 0  

    for features, cls in zip(data, classes):
        if not f_inc:
            classes_stat[cls] += 1

        for idx, feature in enumerate(features):
            features_stat[cls, feature, idx_start + idx] += 1

    return classes_stat, features_stat, rows + len(data) if not f_inc else rows, cols + len(data[0]) if f_inc else cols or len(data[0])

Now, features can be trained separately to the same model:

data = [
    ['Rainy', 'Hot'],
    ['Rainy', 'Hot'],
    ['Overcast', 'Hot'],
    ['Sunny', 'Mild'],
    ['Sunny', 'Cool'],
    ['Sunny', 'Cool'],
    ['Overcast', 'Cool'],
    ['Rainy', 'Mild'],
    ['Rainy', 'Cool'],
    ['Sunny', 'Mild'],
    ['Rainy', 'Mild'],
    ['Overcast', 'Mild'],
    ['Overcast', 'Hot'],
    ['Sunny', 'Mild'], 
]

classes = [
    'no',
    'no',
    'yes',
    'yes',
    'yes',
    'no',
    'yes',
    'no',
    'yes',
    'yes',
    'yes',
    'yes',
    'yes',
    'no',
]

model = train(data, classes)

new_features = [
    ['High', 'false'],
    ['High', 'true'],
    ['High', 'false'],
    ['High', 'false'],
    ['Normal', 'false'],
    ['Normal', 'true'],
    ['Normal', 'true'],
    ['High', 'false'],
    ['Normal', 'false'],
    ['Normal', 'false'],
    ['Normal', 'true'],
    ['High', 'true'],
    ['Normal', 'false'],
    ['High', 'true'], 
]

model = train(new_features, classes, model, f_inc=True)

If, in the future, you need to extend the model with new features, it can be done easily.

Automatic features selection

As can be seen in the proba_log function above, the overall probability is influenced by feature probabilities. Some features may have a greater impact than others, so features that do not have a significant effect on the biasing of probability for a value of classes can be excluded from the model to save memory and computation time.

To understand how to evaluate the impact of a feature on the biasing of probability in favor of a value of classes, let’s look at a simple example.

Windy | Play
------+-----
true  | no
true  | no
true  | yes
true  | yes

Even without the model train, it’s obvious that the Windy feature category true doesn’t affect Play . However, if you train the model with this:

model = train([['true'], ['true'], ['true'], ['true']], ['no', 'no', 'yes', 'yes'])

And predict Play class for Windy: true

print(proba_log(model, ['true']))

{'no': -0.6931471805599453, 'yes': -0.6931471805599453}

Then there are the same numbers for both classes and the category true of the feature Windy is useless for classification. To define if the feature category is useless, compare its weight for both classes, and if its diff is less than a given threshold, it should be considered useless:

|P(true|no) - P(true|yes)| <= threshold

This formula works for simple cases when there are only two classes for classification. However, Naive Bayes Classification isn’t limited to two classes, and it could predict three, four or even more classes. To clarify how to calculate if a feature category is useless for more than two classes, let’s illustrate the situation above as a picture:

As yes and no are opposite classes, feature probability could be presented as vectors with an axis that shows feature probability value. So, if you add one more class value or, to be more precise, a vector direction, like maybe, it could be presented as:

In general, feature category usefulness can be calculated as the sum of vectors. For example:

Since there is a way to evaluate the usefulness of each feature category, further steps in the selection of features depend on what you can imagine.

Best category recommendation

Naive Bayes classification allows you to calculate the probability of a class, but from a practical point of view, this may not be enough when simply providing the probability of an event’s occurrence. In fact, it is more helpful in providing recommendations on what to do to increase the probability of the required event.

Since the Naive Bayes model is a statistic of values, it can also provide information about which value will have the greatest impact on biasing the probability toward the desired class.

The calculations are similar to the previous example of calculating the utility of a feature. You could also use vector addition, but as practice has shown, you can get by with operations on scalar probability values.

Let’s imagine that we are selling any product for which the sale's success is dependent on its color, for example. In this case, in addition to the likelihood of selling a new product, we would like to receive a recommendation on which color is preferable to choose for the product to increase the likelihood of its sale.

Goods color | When sold
------------+----------
  blue      |  asap
  red       |  soon
  green     |  never

Let’s imagine we are interested in class asap. In that case, it needs to compare the following ratios for each category (blue, red, green) of Goods color feature because, on one side, we want the probability for the required class as maximum and the probability of other classes as minimum.

P(blue|asap) / (P(blue|soon) + P(blue|never))

P(red|asap) / (P(red|soon) + P(red|never))

P(green|asap) / (P(green|soon) + P(green|never))

The category with the highest ratio value is the most preferred option, maximizing the probability among the categories of considered features.

Despite its simplicity, Naive Bayes classification has good capabilities when extracting additional knowledge from the model, increasing the speed training and improving predictions, allowing you to work with real-time data and easing debugging due to the transparency of calculations and data storage in the model.

Naive Bayes Classifier Amplification

Incremental training

Incremental features expansion

Automatic features selection

Best category recommendation

Written by Sergei