Naive Bayes Classification

Piyush Tyagi
5 min readDec 29, 2018

--

Naive Bayes Classification works on concept of conditional probability. It would answer the question such as what is the probability that a given tuple of a data set belongs to a particular class label. Some of the people get stuck as they heard about probability(conditional one). Let me start from the basics so that misconception about this beautiful classification techniques eliminates. Before actually diving deep into “Bayesian Classifier” lets start with base of this technique i.e “Bayes Theorem”.

Bayes Theorem

Consider a data set D with a tuple X in Bayes Theorem X works as an evidence. Let H be some hypothesis such as that the data tuple X belongs to a specified class C. For classification problems, we want to determine P(H|X), the probability that the hypothesis H holds given the “evidence” or observed data tuple X. We want to know the probability that tuple X belongs to class C, given that we know the attribute description of X (don’t worry if anything till now doesn’t work for you). Suppose in D our data tuple X is confined to income and age only (actually these are the attributes) and one of the tuple says that X is a customer with age 35 and income $40,000 and let H is the hypothesis that our customer will buy a computer then the following are probabilities we need to consider-

P(H|X):the probability that customer X will buy a computer given that we know the customer’s age and income.

P(X|H):the probability that the customer X is 35 years of age and earns $40,000 ,given that we know he/she will buy the computer.

P(X):the probability the customer X from a set of customers is 35 years old and earns $40,000.

P(H):the probability that the customer will buy the computer.

Note:Here P(H|X), P(X|H) are called as posterior probability, or a posteriori probability and P(X) ,P(H) are called prior probability, or a priori probability.

How to calculate these probabilities ?

Bayes Theorem provides a way of calculating the posterior probability, P(H|X) from P(X|H), P(X) and P(H).

Bayes’ theorem is:

from above in our case P(H|X)=(P(X|H)*P(H))/P(X).

Naive Bayesian Classification

  1. Let a training data set D consists of tuples(many) and let X is one of them with n attributes A1, A2, A3,….., An and m associated class labels C1, C2, C3, C4,…, Cm.
  2. Bayesian classifier predicts that tuple X belongs to the class Ci if and only if P(Ci|X)it is maximum among all i.e, we want to maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized is called the maximum posteriori hypothesis.
  3. Now the question arises how to maximize this ? we know P(Ci|X)=(P(X|Ci)*P(Ci))/P(X). Since P(X) is constant therefore we need to maximize P(X|Ci)*P(Ci) only.
  4. If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely(it means we have equal number of each class label), that is, P(C1)=P(C2)=….=P(Cm), and we would therefore maximize P(X|Ci). Otherwise, we maximize P(X|Ci)*P(Ci).
  5. P(Ci)=|Ci,D|/|D| where |Ci,D| is the number of training tuples of class Ci in D.
  6. For a given data set with many attributes it would be computationally expensive to calculate P(X|Ci)because we need to consider each attribute separately so we. “To reduce computation in evaluating P(X|Ci), the naive assumption of class-conditional independence is made. This presumes that the attributes’ values are conditionally independent of one another” P(X|Ci)=P(x1|Ci)*P(x2|Ci)*P(x3|Ci)=..=P(xn|Ci). It is actually the product of all the probabilities for n attributes.
  7. Recall that here xk refers to the value of attribute Ak for tuple X.To compute P(X|Ci), we consider the following cases:
  • If Ak is categorical, P(X|Ci) then is the number of tuples of class Ci in D having the value xk for Ak, divided by |Ci,D| the number of tuples of class Ci in D.
  • If Ak is continuous-valued, then we need to do a bit more work, but the calculation is pretty straightforward. A continuous-valued attribute is typically assumed to have a Gaussian distribution with a mean(u) and standard deviation(sd), defined by

for P(X|Ci) in the above equation put X =xk, u= mean of Ci ,and standard deviation= sd(Ci).

It was all the raw theory now consider an example for better understanding

Source-cheeg

Suppose we have a tuple X =(age =youth, income =medium, student =yes, credit rating =fair) and we need to predict its class label (yes or no).P(Ci) i=1,2 the prior probability of each class, can be computed based on the training tuples:

Here we have x1=age ,x2=income, x3=student , x4=credit_rating

P(C1)=P(buys_computer =yes) =9/14 =0.643 (since total 9 rows of yes)

P(C2)=P(buys_computer =no) =5/14 =0.357.

To compute P(X|Ci), for i=1, 2, we compute the following conditional probabilities:

P(age =youth |buys_computer =yes) =2/9 =0.222 (it P(x1|C1)=prob(age=youth and buys_computer=yes)/prob(buys_computer=yes)=(2/14)/(9/14)=2/9)

P(age =youth | buys_computer =no) =3/5 =0.600 (it P(x1|C2))

P(income =medium | buys_computer =yes) =4/9 =0.444 (it P(x2|C1))

P(income =medium | buys_computer =no) =2/5 =0.400 (it P(x2|C2))

P(student =yes | buys_computer =yes) =6/9 =0.667

P(student =yes |buys_computer D=no) =1/5 =0.200

P(credit_rating =fair |buys_computer =yes) =6/9 =0.667

P(credit_rating =fair |buys_computer =no) =2/5 =0.400

Using these probabilities, we obtain P(X|buys_computer =yes)= P(age =youth |buys computer =yes)* P(income =medium |buys_computer =yes)*P(student =yes |buys_computer =yes)* P(credit rating =fair |buys_computer =yes)=0.222*0.444*0.667*0.667 =0.044.

Similarly, P(X|buys_computer =no)=0.600*0.400*0.200*0.400 =0.019.

To find the class, Ci , that maximizes P(X|Ci)*P(Ci), we compute

P(X|buys_computer =yes)*P(buys-computer =yes)= 0.044*0.643 =0.028

P(X|buys_computer =no)*P(buys_computer =no)=0.019*0.357 =0.007

Therefore, the naive Bayesian classifier predicts buys_computer =yes for tuple X.

To avoid computing probability values of zero. Suppose that for the class buys computer =yes in some training database, D, containing 1000 tuples, we have 0 tuples with income =low, 990 tuples with income =medium, and 10 tuples with income =high. The probabilities of these events are 0, 0.990 (from 990/1000), and 0.010 (from 10/1000), respectively. Using the Laplacian correction for the three quantities, we pretend that we have 1 more tuple for each income-value pair, for each income-value pair. In this way, we instead obtain the following probabilities (rounded up to three decimal places):

1/1003=0.001 ,999/1003=0.988, 11 /1003=0.011 respectively. The “corrected” probability estimates are close to their “uncorrected”counterparts, yet the zero probability value is avoided.

--

--