Probability Basics for Machine Learning & Data Science in 10 Minutes

Lakshmi Prakash
Design and Development
7 min readOct 8, 2022

One needs to learn probability and statistics to learn data science and machine learning. There is no machine learning without probability. While these subjects might not be too easy, they are not too hard to learn either! If you have not started learning probability yet, today would be a good day to start! 😉 Just for the record, I ❤️ probability! I hope you would, too!

Sets Basics:

A “Set” is simply a collection of different elements. A set could also contain either no element or be a collection of multiple sets, too, where there are multiple sets within a set, the smaller sets within the larger set would be called “subsets”. And a set with no element in it is called an “empty set” or “null set”.

Trial and Experiment:

A trial is an attempt that we make or an action we perform, expecting to see one result from among all the possible results we can get. That set which contains all possible outcomes in our experiment is called “sample space”, and it is usually denoted by “Ω”. An experiment involves a series of such trials.

For example, drawing a card from a deck of 52 cards could be a trial in an experiment of one user drawing ten cards from the deck, one card at a time. And for an experiment of tossing two coins in parallel, you could say that the sample space here is {HH, HT, TH, TT} because these are all the possible outcomes we can expect in this case.

Union and Intersection of sets:

The “union” and “intersection” of two sets A and B would be represented by “A ∪ B” and “A ∩ B” respectively.

Let us say that there are two sets, A and B, A being the set of programmers who code mostly in Python and B being the set of programmers who code mostly in Java. The “union” of sets A and B would include all these programmers, all of these programmers, a collection of those who code mostly in Python and those who code mostly in Java. And the “intersection” of sets A and B would be those who can code in both Python and Java.

Union and Intersection of Sets

Here, in set A, which includes programmers who can code in Python, we have 18 elements, and in set B, which includes programmers who can code in Java, we have 9 elements.

Therefore, A ∪ B has 27 elements and A ∩ B has 6 elements.

Complements of Sets:

The complement of a set A is the another set A-complement, which includes all elements in the sample space except the elements of the set A.

Probability, as the name implies, refers to the “likelihood” of an event, a numerical quantity that tells us how possible it is for something in particular to take place from among all the different options that can happen.

Probability of an event, for example, the probability of event A is denoted using the notation “P(A)”.

Basic Axioms of Probability:

These are the three basic rules or principles that probability is founded on. Any theorem or or formula or derived rule in the world of probability must satisfy these conditions and would be derived using these principles.

Non-negativity: Probability can never be negative. Maybe probability is an optimist by nature? Okay, okay, no jokes. Probability being the value of how likely an event is to occur cannot be a negative value. Because try to make sense of it, what does “when you toss this coin, there is a -5 chance of seeing heads” even mean?

P(A) ≥ 0

Normalization: The probability of the entire sample space would always be equal to 1. The sum of probabilities of all the different outcomes in a situation should always be equal to 1. The maximum value of probability can be only 1, and the total probability of all outcomes would also equal 1. If there’s a probability of 1 that an event will occur, then that means that with absolute certainty, the event will occur and that there is no other alternate outcome, that there’s only one option.

P(Ω) = 1

Additivity: When it comes to two or more mutually exclusive events, the probability of the union of all these events is equal to sum of probabilities of all these events.

Probability Basics for Machine Learning and Data Science

Note:

In general, P(A ∪ B) = P(A) + P(B) — P(A ∩ B)

As you can see here, P(A) + P(B) will always be either equal to or greater than P(A ∪ B).

In case of mutually exclusive or disjoint events, “A ∩ B” would be an empty set, so according to the axiom of additivity, for disjoint events, P(A ∪ B) = P(A) + P(B).

Also, don’t let phrases like “A or B” confuse you. In English, when we say “A or B”, we mean “either A or B”. But in probability, the probability of event A or B would be P(A ∪ B), not either P(A) or P(B).

De Morgan’s Laws:

De Morgan’s Laws are a couple of rules that are arrived at using the basics of set theory logic.

  1. The complement of the intersection of two sets is equal to the union of complements of the two sets.
  2. The complement of the union of two sets is equal to the intersection of the complements of the two sets.
Probability meme from reddit

Counting:

Counting comes up often, in some of the most common probability problems we would have all come across, like those involving permutations and combinations, for example.

Why is learning counting important? Because in several cases, even in real-life scenarios, you could calculate the probability of an event by simply counting the number of such occurrences and by using that value to calculate the probability when directly calculating the probability could be hard.

A brief explanation of the counting principle as given by Sheldon Ross:

“If r experiments that are to be performed are such that the first one may result in any of n1 possible outcomes, and if for each of these n1 possible outcomes there are n2 possible outcomes of the second experiment, and if for each of the possible outcomes of the first two experiments there are n3 possible outcomes of the third experiment, and if, … , then there are a total of n1 · n2 ··· nr possible outcomes of the r experiments.”

“I believe that we do not know anything for certain, but everything probably.” — Christiaan Huygens

Random Variables:

A random variable can be considered a function that assigns values to each one of the elements in the sample space. You have discrete random variables, continuous random variables, and mixed random variables.

Conditionality: The thing about probability is that it is calculated based on the knowledge we have. This means that when we gain new/additional information, we know better about concerned events compared to what we knew earlier. This means that we should change our ideas accordingly now to keep ourselves updated with events and to make our calculations more accurate. This is what conditioning does in probability. When we gain new information, how does that change values we already knew based on (conditioning on) the information we just learnt?

Example: The probability that it might rain on a random day in the year can be “x”. But when we are told that this day falls in the monsoon season, that naturally changes the probability, and x evidently has to increase, right?

“Probability of A given B” is written as “P(A|B)”.

P(A|B) = P(A ∩ B)/P(B)

Independence: Independence is one of the easiest concepts to understand and process, and still one of the most important concepts we need to master in order to work efficiently in probability. Before we proceed further, I must tell you that “independent events” and “disjoint events” are not the same. Often, beginners can misunderstand “independence” to mean “disjoint events”.

When two sets A and B are disjoint, it means that they have nothing (nothing of importance) in common, that they don’t intersect. On the other hand, when two events are said to be “independent”, it means that when we know something about event A, that naturally gives us some input about event B as well, or vice versa.

For example, let us say John has street food on weekends and cooks food at home on weekdays. Let A be an indicator of whether it is a weekday or not and B be the indicator of whether John has food at home or not. Now, knowing the value of A would give us some insight about B, right? This means that A and B are not independent.

On the other hand, consider these two measures: C and D, let C be an indicator of whether you have Covid-19 or not and D be an indicator of whether you have diabetes or not — these two are independent because knowing whether one has diabetes does not tell us anything about whether they have Covid-19 or not and vice versa.

An Example for Joint Probability Distribution of two random variables

Bayes’ Rule:

Bayes’ theorem, also known as Bayes’ rule, states that for any two events A and B, P(A|B) = [P(B|A) * P(A)]/P[B]. This is particularly useful when it comes to Bayesian Inference.

There is a lot more in probability theory, but for the basics, I think this should do. What do you think? :)

--

--

Lakshmi Prakash
Design and Development

A conversation designer and writer interested in technology, mental health, gender equality, behavioral sciences, and more.