Naive Bayes Classifier Amplification
Machine learning and artificial intelligence are increasingly being introduced into our lives, gradually penetrating into all areas of activity. Behind the high-tech magic we see around us, there are very complex and simple algorithms and technologies, such as the Naive Bayes classification.
However, despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. Naive Bayers is an excellent starting algorithm for implementing machine learning in your work before you gradually move on to more complex solutions as you gain experience.
In this post, I would like to dig into Naive Bayes Classifier internals to show what profit could be retrieved from this simple machine learning algorithm beyond classification. For example:
- Incremental training
- Incremental feature expansion
- Automatic features selection
- Best category recommendation
To be more descriptive and avoid code overload, I use a custom implementation of the algorithm based on https://habr.com/ru/articles/120194/.
As an example of train data, here I used a standard example to predict gameplay depending on the weather:
A quick note: This post is not for those who are just about to learn machine learning and Naive Bayes. There will be no explanations here about how the algorithm works.
Incremental training
Incremental training refers to when an already trained model could be further trained with additional data. It’s available in packages like scikit-learn, but could also be implemented as:
import math
from collections import defaultdict
def train(data, classes, model=None):
classes_stat, features_stat, length = model if model else (defaultdict(int), defaultdict(int), 0)
length += len(data)
for features, cls in zip(data, classes):
classes_stat[cls] += 1
for idx, feature in enumerate(features):
features_stat[cls, feature, idx] += 1
return classes_stat, features_stat, length
def proba_log(classifier, features):
classes_stat, _, length = classifier
func = lambda cls: math.log(classes_stat[cls] / length) + \
sum(math.log(feature_proba(classifier, cls, feature, idx)) for idx, feature in enumerate(features))
return {cls: func(cls) for cls in classes_stat}
def feature_proba(classifier, cls, feature, idx):
classes_stat, features_stat, _ = classifier
return (features_stat.get((cls, feature, idx), 0) / classes[cls]) or \
(1 / (classes_stat[cls] + 1)) ## NOTE: just work-around Zero-Frequency Problem
To split weather data, incremental training can be coded like this:
data = [
['Rainy', 'Hot', 'High', 'false'],
['Rainy', 'Hot', 'High', 'true'],
['Overcast', 'Hot', 'High', 'false'],
['Sunny', 'Mild', 'High', 'false'],
['Sunny', 'Cool', 'Normal', 'false'],
['Sunny', 'Cool', 'Normal', 'true'],
['Overcast', 'Cool', 'Normal', 'true'],
]
classes = [
'no',
'no',
'yes',
'yes',
'yes',
'no',
'yes',
]
model = train(data, classes)
data_inc = [
['Rainy', 'Mild', 'High', 'false'],
['Rainy', 'Cool', 'Normal', 'false'],
['Sunny', 'Mild', 'Normal', 'false'],
['Rainy', 'Mild', 'Normal', 'true'],
['Overcast', 'Mild', 'High', 'true'],
['Overcast', 'Hot', 'Normal', 'false'],
['Sunny', 'Mild', 'High', 'true'],
]
classes_inc = [
'no',
'yes',
'yes',
'yes',
'yes',
'yes',
'no',
]
model = train(data_inc, classes_inc, model)
Incremental features expansion
Incremental features expansion is a less obvious possibility of Naive Bayes, which follows from its specifics, where features are not dependent on each other. An already-trained model could be easily extended with new features. To get this working, all it takes is a small change in the code:
def train(data, classes, model=None, f_inc=False):
classes_stat, features_stat, rows, cols = model if model else (defaultdict(int), defaultdict(int), 0, 0)
idx_start = cols if f_inc else 0
for features, cls in zip(data, classes):
if not f_inc:
classes_stat[cls] += 1
for idx, feature in enumerate(features):
features_stat[cls, feature, idx_start + idx] += 1
return classes_stat, features_stat, rows + len(data) if not f_inc else rows, cols + len(data[0]) if f_inc else cols or len(data[0])
Now, features can be trained separately to the same model:
data = [
['Rainy', 'Hot'],
['Rainy', 'Hot'],
['Overcast', 'Hot'],
['Sunny', 'Mild'],
['Sunny', 'Cool'],
['Sunny', 'Cool'],
['Overcast', 'Cool'],
['Rainy', 'Mild'],
['Rainy', 'Cool'],
['Sunny', 'Mild'],
['Rainy', 'Mild'],
['Overcast', 'Mild'],
['Overcast', 'Hot'],
['Sunny', 'Mild'],
]
classes = [
'no',
'no',
'yes',
'yes',
'yes',
'no',
'yes',
'no',
'yes',
'yes',
'yes',
'yes',
'yes',
'no',
]
model = train(data, classes)
new_features = [
['High', 'false'],
['High', 'true'],
['High', 'false'],
['High', 'false'],
['Normal', 'false'],
['Normal', 'true'],
['Normal', 'true'],
['High', 'false'],
['Normal', 'false'],
['Normal', 'false'],
['Normal', 'true'],
['High', 'true'],
['Normal', 'false'],
['High', 'true'],
]
model = train(new_features, classes, model, f_inc=True)
If, in the future, you need to extend the model with new features, it can be done easily.
Automatic features selection
As can be seen in the proba_log
function above, the overall probability is influenced by feature probabilities. Some features may have a greater impact than others, so features that do not have a significant effect on the biasing of probability for a value of classes can be excluded from the model to save memory and computation time.
To understand how to evaluate the impact of a feature on the biasing of probability in favor of a value of classes, let’s look at a simple example.
Windy | Play
------+-----
true | no
true | no
true | yes
true | yes
Even without the model train, it’s obvious that the Windy
feature category true
doesn’t affect Play
. However, if you train the model with this:
model = train([['true'], ['true'], ['true'], ['true']], ['no', 'no', 'yes', 'yes'])
And predict Play
class for Windy: true
print(proba_log(model, ['true']))
{'no': -0.6931471805599453, 'yes': -0.6931471805599453}
Then there are the same numbers for both classes and the category true
of the feature Windy
is useless for classification. To define if the feature category is useless, compare its weight for both classes, and if its diff is less than a given threshold, it should be considered useless:
|P(true|no) - P(true|yes)| <= threshold
This formula works for simple cases when there are only two classes for classification. However, Naive Bayes Classification isn’t limited to two classes, and it could predict three, four or even more classes. To clarify how to calculate if a feature category is useless for more than two classes, let’s illustrate the situation above as a picture:
As yes
and no
are opposite classes, feature probability could be presented as vectors with an axis that shows feature probability value. So, if you add one more class value or, to be more precise, a vector direction, like maybe
, it could be presented as:
In general, feature category usefulness can be calculated as the sum of vectors. For example:
Since there is a way to evaluate the usefulness of each feature category, further steps in the selection of features depend on what you can imagine.
Best category recommendation
Naive Bayes classification allows you to calculate the probability of a class, but from a practical point of view, this may not be enough when simply providing the probability of an event’s occurrence. In fact, it is more helpful in providing recommendations on what to do to increase the probability of the required event.
Since the Naive Bayes model is a statistic of values, it can also provide information about which value will have the greatest impact on biasing the probability toward the desired class.
The calculations are similar to the previous example of calculating the utility of a feature. You could also use vector addition, but as practice has shown, you can get by with operations on scalar probability values.
Let’s imagine that we are selling any product for which the sale's success is dependent on its color, for example. In this case, in addition to the likelihood of selling a new product, we would like to receive a recommendation on which color is preferable to choose for the product to increase the likelihood of its sale.
Goods color | When sold
------------+----------
blue | asap
red | soon
green | never
Let’s imagine we are interested in class asap
. In that case, it needs to compare the following ratios for each category (blue, red, green
) of Goods color
feature because, on one side, we want the probability for the required class as maximum and the probability of other classes as minimum.
P(blue|asap) / (P(blue|soon) + P(blue|never))
P(red|asap) / (P(red|soon) + P(red|never))
P(green|asap) / (P(green|soon) + P(green|never))
The category with the highest ratio value is the most preferred option, maximizing the probability among the categories of considered features.
Despite its simplicity, Naive Bayes classification has good capabilities when extracting additional knowledge from the model, increasing the speed training and improving predictions, allowing you to work with real-time data and easing debugging due to the transparency of calculations and data storage in the model.