Essential Math for Machine Learning: Bayes’ Theorem

5 min readApr 13, 2024

This article is part of the series Essential Math for Machine Learning.

Introduction

Machine learning has revolutionized how we solve problems, from product recommendations to medical diagnoses. And behind this revolution lies an impressive mathematical framework. One fundamental concept you’ll often encounter in the world of machine learning is Bayes’ Theorem. This seemingly simple formula has far-reaching implications in probability, reasoning with uncertainty, and building intelligent learning systems.

What is Bayes’ Theorem?

At its core, Bayes’ Theorem provides us with a way to update our beliefs about an event when we have new evidence. It can be expressed in the following form:

P(A|B) = (P(B|A) * P(A)) / P(B)

Let’s break down the terminology:

P(A|B): The probability of event A happening given that event B has already happened (posterior probability).
P(B|A): The probability of event B happening given that event A has already happened (likelihood).
P(A): The probability of event A happening (prior probability).
P(B): The probability of event B happening (marginal probability or evidence).

Interpretation of the Bayes’ Theorem

In essence, the Bayes’s Theorem is about how the evidence B updates our belief about the probability of event A from P(A) to P(A|B).

Let me explain with a simple example. Suppose you have a medical test for a rare disease. Here’s how Bayes’ Theorem comes into play:

Prior Probability P(A): Let’s say the disease affects 1% of the population.
Likelihood P(B|A): The test is very accurate — if a person has the disease, there’s a 98% chance of a positive result.
False Positive Rate P(B|-A): However, there’s a 3% chance of a false positive (positive result even if you don’t have the disease).

Now, you test positive. Uh-oh. Should you panic? Bayes’ Theorem helps us calculate the actual probability of having the disease given the positive result:

P(Disease | Positive Test)
  = (0.98 * 0.01) / (0.98 * 0.01 + 0.03 * 0.99) 
  ≈ 0.25

Despite the positive test, there’s only about a 25% chance you have the disease. See how Bayes’ Theorem updates our belief from the initial 1% to a much more realistic estimate of the probability after considering the test results.

Why is Bayes’ Theorem Important in Machine Learning?

Handling Uncertainty: Machine learning models often deal with incomplete or noisy data. Bayes’ Theorem allows us to systematically reason about uncertainty and update our models’ predictions as new evidence becomes available.

Naive Bayes Classifiers: A direct application is in the popular Naive Bayes algorithms, used in spam filtering, text classification, and sentiment analysis. These classifiers heavily rely on Bayes’ Theorem to calculate the probability that an item (an email, a document) belongs to a certain class.

Bayesian Networks: These are powerful graphical models representing probabilistic relationships between variables. They find applications in areas like risk modeling, decision analysis, and medical diagnosis.

Example: Spam Email Classifier

Imagine you’re building your own email filtering system. You want to automatically categorize incoming emails as spam or legitimate (ham). The Naive Bayes algorithm provides a powerful tool for this task. It allows us to calculate the probability of an email belonging to a particular class (spam or ham) based on the presence of certain words.

The code is available in this colab notebook.

import math
import random
import string

# Word lists for spam and ham
spam_words = ["free", "offer", "promotion", "click", "winner", "deal"]
ham_words = ["meeting", "report", "project", "urgent", "schedule", "important"]

# Function to generate random emails
def generate_email(word_list, num_words):
    return " ".join(random.choices(word_list, k=num_words))

# Generate training data
spam_emails = [generate_email(spam_words, random.randint(5, 15)) for _ in range(200)]
ham_emails = [generate_email(ham_words, random.randint(5, 15)) for _ in range(200)]
emails = spam_emails + ham_emails
labels = ["spam"] * 200 + ["ham"] * 200

# Naive Bayes Implementation

def train_naive_bayes(emails, labels):
    # Count word occurrences 
    word_counts = {'spam': {}, 'ham': {}}
    for email, label in zip(emails, labels):
        words = email.split()
        for word in words:
            word_counts[label].setdefault(word, 0)
            word_counts[label][word] += 1

    # Calculate probabilities of P(word|label) with add-one smoothing
    vocab = set()  # Collect unique words
    for label in word_counts:
        vocab.update(word_counts[label].keys())

    probabilities = {}
    for label in word_counts:
        total_words_in_class = sum(word_counts[label].values()) + len(vocab) 
        probabilities[label] = {}
        for word in vocab:
            # Give a small probability to words which are not in the class 
            probabilities[label][word] = (word_counts[label].get(word, 0) + 1) / total_words_in_class

    return probabilities, vocab

def classify_email(email, probabilities, vocab):
    spam_prob, ham_prob = 0, 0
    words = email.split()
    
    for word in words:
        if word in vocab:
            spam_prob += math.log(probabilities["spam"].get(word, 10**-10))  # Tiny probability if word is unknown
            ham_prob += math.log(probabilities["ham"].get(word, 10**-10))

    return "spam" if spam_prob > ham_prob else "ham"

# Train the model
probabilities, vocab = train_naive_bayes(emails, labels)

# Classify a new email
new_email = "Free offer! Click now to get a special promotion"
prediction = classify_email(new_email, probabilities, vocab)
print("New email 1 classification:", prediction)

new_email = "Let's schedule a meeting for the project report."
prediction = classify_email(new_email, probabilities, vocab)
print("New email 2 classification:", prediction)

Output:

New email 1 classification: spam
New email 2 classification: ham

Explanation:

Data Generation: We define lists of words typically found in spam and non-spam emails (ham). The generate_email function creates email samples with random lengths, drawing words from the respective lists. This creates our training data.
Train Naive Bayes: The train_naive_bayes function is the heart of the algorithm. It goes through each email and builds a dictionary word_counts that tracks how many times each word appears in spam and ham emails. It then calculates the probability of each word appearing in a spam or ham email using add-one smoothing. This technique prevents zero probabilities for unseen words during classification. The function also creates a vocabulary vocab containing all unique words encountered in the training data.
Classify New Email: The classify_email function takes a new email, splits it into words, and checks if each word exists in the vocabulary. It calculates the logarithm of the probability of the email belonging to the 'spam' and 'ham' classes based on the presence of the words. By using logarithms, we avoid underflow errors when dealing with very small probabilities.

Beyond the Basics

While we’ve provided a simplified overview, Bayes’ Theorem’s importance extends far beyond these examples. Bayesian inference forms a cornerstone of modern machine learning, providing a sound theoretical foundation for probabilistic models and decision-making under uncertainty.