Naïve Bayes Algorithm | Maximum A Posteriori in Machine Learning

Rizwana Yasmeen
9 min readJul 4, 2023

--

When working with labeled training datasets in Machine Learning problems the algorithms are classified as either classification or regression. When you dive a bit more, you will come across the Naïve Bayes Algorithm, which is one of the most fundamental algorithms.

The Naïve Bayes algorithm is a basic and popular classification technique that utilizes a probabilistic approach. It is based on Bayes’ theorem, a fundamental concept in probability theory. Naïve Bayes is commonly used for text classification tasks such as spam detection and sentiment analysis.

In Machine learning, given a data frame, we assume that the data points are conditionally independent of each other.

Naïve Bayes is based on the Bayes theorem, although the Bayes theorem is derived from conditional probability.

To understand the Naïve Bayes we need to know a few probability concepts.

Conditional probability:

The probability of occurrence of any event A, when another event B in relation to A has already occurred, is known as conditional probability. It is denoted by P(A|B).

In words, the conditional probability of event A given event B is equal to the probability of both events A and B occurring together divided by the probability of event B, if P(B) ≠ 0.

Independent Event:

An event whose occurrence or non-occurrence does not affect the probability of the occurrence of another event. In other words, the outcome of one event has no influence on the outcome of the other event.

In simpler terms, when events A and B are independent, the probability of both events occurring together is the product of their individual probabilities. The occurrence of event B does not change the probability of event A happening, so we can treat them as separate probabilities.

Mutually Exclusive Event:

Mutually exclusive events are events that cannot occur simultaneously. If one event happens, the other event cannot happen at the same time. In other words, the occurrence of one event excludes the possibility of the other event occurring.

Formally, two events A and B are considered mutually exclusive if their intersection (the event where both A and B occur) is an empty set. Mathematically, this can be represented as A ∩ B = Ø

In simple words, Two Events A and B are said to be mutually exclusive when this is true P(A ∩ B) = 0

P(A|B) = P(A ∩ B)/P(B), since P(A ∩ B) = 0, then P(A|B) = 0

The Probability of A given B known that both are mutually exclusive events is going to be zero

For example, tossing a coin and throwing dice. In the first case, a Head and a Tail while tossing a coin cannot occur simultaneously. Similarly, numbers 2, 3, 4, or 5 while throwing a dice cannot occur simultaneously. We require a separate outcome for every toss or throw. Independent events can have common outcomes, while mutually exclusive events cannot. There is no common area between any two mutually exclusive events.

Two exclusive events signify that they have nothing in common between them. so P(A∩B) = 0

P(A ∪ B) = P(A) + P(B) − P(A ∩ B)

so P(A∪ B) = P(A) + P(B)

For mutually exclusive events A and B, the equation simplifies to:

P(A ∪ B) = P(A) + P(B)

The probability of the union of two mutually exclusive events A and B is equal to the sum of their individual probabilities. Since these events cannot occur together, there is no overlap, and we do not need to subtract any intersection.

Note:

If events are Mutually exclusive then only one of them can occur.

Independent events say that both can occur but the occurrence of the one will not affect the other.

Joint Probability:

Joint probability refers to the probability of two or more events occurring simultaneously. It is a measure of the likelihood that multiple events will happen together.

A joint probability is the likelihood that event B will take place at the same time as event A. Both events must be independent of one another, which implies they cannot be conditional or rely upon one another.

A joint probability is also known as the intersection of two or more events.

It is different from conditional probability, which refers to the probability that one event will happen when another event takes place.

Bayes’ Theorem:

Bayes’ theorem is a fundamental concept in probability theory and statistics named after Thomas Bayes, an 18th-century British mathematician. It provides a way to calculate conditional probabilities, which are the probabilities of an event occurring given that another event has already occurred.

In its simplest form, Bayes’ theorem can be stated as follows:

P(A|B) = (P(B|A) * P(A)) / P(B)

We derive Bayes theorem from conditional probability.

Now, understand the formula, In this formula, every term has specific names let’s see

P(A|B) is a Posteriori

P(B|A) is a Likelihood

P(B) is a Marginal

P(A) is a Prior

Naïve Bayes Algorithm:

Consider a basic situation in which you need to learn a machine-learning model from a given set of attributes. Then you will have to describe a hypothesis or a relation to a response variable and then using this relation, you will have to predict a response, given the set of attributes you have.

Using Bayes’ Theorem, You can create a learner that predicts the probability of a response variable belonging to the same class given a new set of attributes.

Consider a data frame(nxd) with the features Xi = (f1, f2, f3, f4……. fd-1) and Y with the list of discrete values i.e. Classes (C1, C2, C3……CK).

By using the probabilistic approach, given Xq compute the probability of Xq belonging to each class whichever is maximum assign that class to Xq

i.e. P(C1|Xq) P(C2|Xq) P(C3|Xq) ……………. P(CK|Xq)

the query point (Xq) will belong to that class for which P(Ci|Xq) is the highest.

In Naïve Bayes we are trying to maximize the Posteriori probability, so it is also called Maximum A posteriori (M.A.P).

Doing this Joint probability process iteratively is known as the CHAIN RULE OF CONDITIONAL PROBABILITY.

Example for Naïve Bayes Algorithm:

Comparing these probabilities, we see that P(Yes|Xq) = 2/55 is smaller than P(No|Xq) = 2/33.

Therefore, based on the probabilities, the class for the feature values Xq (Humidity = Moderate and Temperature = Hot) would be classified as “No” for Rain.

Advantages of the Naïve Bayes Algorithm:

Simplicity and speed: Naïve Bayes is simple to understand and implement. It has a minimal computing cost and can function well with limited training data. It is very scalable and performs effectively on large datasets.

Handling High-Dimensional Data: Naïve Bayes is especially effective in dealing with high-dimensional data, such as text classification problems, where the number of features is considerable in comparison to the number of instances. It works well in these scenarios and has the potential to achieve good results.

Quick Training: In the training process, basic probabilities are calculated based on the occurrences of features in the training data. This procedure is quick and doesn’t require extensive computing or optimization steps.

Strong Performance on Categorical Features: When dealing with categorical features or discrete data, it works effectively. It assumes feature independence, which can be a legitimate assumption in some contexts.

Handles Missing Data: The algorithm can handle missing data gracefully by ignoring missing attribute values during probability computations. This attribute is useful when working with real-world datasets that often contain missing values.

Disadvantages of the Naïve Bayes Algorithm:

Zero Probability Issue: If a categorical variable in the test dataset contains a category that wasn’t present in the training dataset, the model will give it a 0 probability and be unable to predict anything. This issue can be handled using smoothing techniques such as Laplace smoothing or add-one smoothing.

Numerical Instability Error: When probabilities are multiplied together, especially if the probabilities are small, the final value might become extremely small, possibly underflowing the numerical precision of the computer. This can lead to numerical errors and inaccurate results. Log transformation is used to prevent numerical instability.

Sensitivity to Feature Distribution: Naïve Bayes makes the assumption that features have a certain probability distribution (e.g., Gaussian, multinomial, or Bernoulli). If the data does not conform to these assumptions, the algorithm’s performance may suffer.

Assumption of Feature Independence: Given the class label, Naïve Bayes assumes that all features are independent of each other. Although this assumption makes computations easier, it could not be accurate in many real-world situations. Correlations among features can negatively impact the accuracy of the algorithm.

Limited Expressiveness: Because of the independence assumption, Naïve Bayes may have trouble capturing complex relationships and interactions between features. It may not be appropriate for tasks where feature dependencies are important.

Applications of Naïve Bayes

Text Classification: The Naïve Bayes algorithm is almost always used as a classifier and is a great choice for spam filtering of your emails or news categorization on your smartphone.

Recommendation Systems: Naïve Bayes can be used with collaborative filtering to create recommendation systems for you. On Netflix, there is a section called “Because you watched ___” that does just that.

Sentiment analysis: Naïve Bayes is a useful algorithm for determining whether a target group (consumers, audience, etc.) has positive or negative opinions. Consider IMDb reviews and feedback forms.

Conclusion:

The Naïve Bayes algorithm is a basic yet powerful classification algorithm that uses a probabilistic approach. It uses Bayes’ theorem and the premise of feature independence to estimate class probabilities based on observed data. The algorithm performs well in some situations, such as text classification, multiclass problems, real-time classification, data containing categorical features, and handling missing data.

Overall, Naïve Bayes is a useful tool in machine learning, especially in certain scenarios and domains. It is worth considering into account as an initial baseline model or in situations where its assumptions align well with the problem at hand. However, careful study and consideration of the specific characteristics and requirements of the data are required to evaluate whether Naïve Bayes is the best method for a particular task.

Thank you for reading. Please let me know if you have any feedback.

My Other Posts

K-Nearest Neighbor(KNN) Algorithm in Machine Learning

--

--