Probability Theory — An Essential Ingredient for Machine Learning

Souvik Majumder
Analytics Vidhya
Published in
5 min readApr 1, 2020

This article will describe the important concepts of Probability Theory, that are required for understanding Machine Learning.

What is Probability ?

Probability is a numerical description of how likely an event is to occur or how likely it is that a proposition is true. Probability is a number between 0 and 1.

Let us take a basic example.

When a coin is tossed, there are two possible outcomes:

  • heads (H) or
  • tails (T)

We say that the probability of the coin landing H is ½

And the probability of the coin landing T is ½

When a single die is thrown, there are six possible outcomes: 1, 2, 3, 4, 5, 6.

The probability of any one of them is 1/6

Marginal Probability

The probability that any of several mutually exclusive events will occur is equal to the sum of the events’ individual probabilities. It is also called as the Sum Rule.

Marginal Probability of a random variable is given by

Product Rule

The probability of two (or more) independent events occurring together can be calculated by multiplying the individual probabilities of the events. For example, if you roll a six-sided die once, you have a 1/6 chance of getting a six. If you roll two dice at once, your chance of getting two sixes is: (probability of a six on die 1) x (probability of a six on die 2) = (1/6) x (1/6) = 1/36

which can also be written as

From the above equation, we get P(y|x),

which is also known as Conditional Probability or Bayes Theorem.

P (y|x) is the posterior probability i.e., the probability of the
outcome based on the given evidence.
P (y) is the prior probability.
P (x) is the probability of the evidence.

Examples

Example 1: Bag I contains 4 white and 6 black balls while another Bag II contains 4 white and 3 black balls. One ball is drawn at random from one of the bags and it is found to be black. Find the probability that it was drawn from Bag I.

The above problem states that the given ball was a Black ball.

Also, it is asking us to find the probability that the ball was drawn from Bag 1.

From Bayes Theorem Formula,

P(Bag 1) = 1/2 — — — — — — — — — — — — — — — — — (1)

P(Black) = P(a bag chosen at random) AND [P(Black drawn from Bag 1) OR P(Black drawn from Bag 2)]

The above equation explains that we have two independent events. Firstly, I can choose any random bag from the two bags. Secondly, from the bag that I have chosen, I will now pick a random ball that will come out as black.

Therefore,

P(Black) = 1/2 x (6/10 + 3/7) — — — — — — — — — —-(2)

P(Black | Bag 1) = 6/10 — — — — — — — — — — — — — (3)

Substituting (1), (2) and (3) in the Bayes Theorem Conditional Probability Formula,

Example 2: In a particular clinic, 10% of patients are prescribed narcotic pain killers. Overall, five percent of the clinic’s patients are addicted to narcotics (including pain killers and illegal substances). Out of all the people prescribed pain pills, 8% are addicts. If a patient is an addict, what is the probability that they will be prescribed pain pills?

From the above problem, given condition is that the patient is an addict.

Basically, the problem asks us to find the probability that he is prescribed pain killers.

So the Bayes Theorem Equation becomes,

P(Prescribed Pain Killers) or Probability of a patient who has been prescribed pain killers = 10% = 0.10

P(Addict) or Probability of the patient being an addict= 5% = 0.05

P(Addict | Prescribed Pain Killers) or Probability of the patient being an addict given that he has been prescribed pain killers = 8% = 0.08

Substituting the above data in the Bayes Theorem Conditional Probability Formula,

Bayes Theorem Conditional Probability Distribution is an important concept which is specifically used in Naive Bayes Regression or Classification problems. In Naive Bayes Classification problems, where there are multiple independent variables in the data set, the prediction or the classification is done on the dependent variable, by comparing with a cut-off probability value. If the outcome is greater than the cut-off value, the value of the dependent variable become 1, else 0.

The typical Conditional Probability Distribution looks like,

where Y is the dependent variable and k is the value 0 or 1, in the training data set.

X1, X2…… XN are the independent discreet variables.

At instances, where some of the independent variables are continuous, the probability of likelihood of evidence is given by the Probability Density Function,

--

--

Souvik Majumder
Analytics Vidhya

Full Stack Developer | Machine Learning | AI | NLP | AWS | SAP