Probability-101 for Data Science

A complete theoretical guide to probability and concepts required for data science and machine learning.

Published in

Hands-On Data Science

5 min readAug 24, 2020

Photo by Javier Allegue Barros on Unsplash

The first question that comes to my mind is that why is probability even necessary to learn machine learning and data science? After some web searching, I came to some important conclusions about why probability is vital.

Why Probability?

Probability is used several times in predictive circumstances. Observing this will help us to understand why probability is indispensable.

Classification Problem: A classification problem requires us to predict the probability that the input example belongs to a particular class. Whether it is an image classification or object detection, we predict the probability of the input belonging to each class.
Models based on Probability framework: Models like linear regression and logistic regression can be seen as a probabilistic framework that tries to minimize their respective loss functions.
Models trained by Probability framework: Many machine learning models are trained using an iterative algorithm designed under a probabilistic framework. Maximum Likelihood Estimation (MLE) can be a good example.
Models tuned by Probability framework: Typical approaches include grid searching ranges of hyperparameters or randomly sampling hyperparameter combinations. Bayesian optimization is more efficient to hyperparameter optimization that involves a directed search of the space of possible configurations based on those configurations that are most likely to result in better performance.
Models evaluated with probabilistic measures: For classification problems, we consider the cost function as log-loss or the cross-entropy function, which needs the predicted probabilities.

Terminology

Experiment: Any uncertain situation which can have a possible number of outcomes. Eg. The weather conditions for the next day maybe a rainy, sunny, cloudy, storm, high winds, etc.
Outcome: It is the result of the actual trail i.e. what is the actual condition the next day. It must have a single answer.
Events: An event is one or more outcomes of an experiment. Eg. The weather types (sunny, rainy, etc) are all possible events for the next day.
Random Variable: A random variable is a numerical description of the outcome of a statistical experiment.
Probability: It is defined as the likelihood of an event occurring or the probability of sunny weather tomorrow. It also defined as

A Bernoulli Trial

Bernoulli Trail

It is an experiment that has only two sets of the outcome either positive or negative.

Binomial Distribution

The probability distribution of the number of success (positive outcome) in n number of Bernoulli trail is called Binomial distribution.

Example: Suppose team A has a winning probability of 0.75 against team B. There is a series of five matches and we need to find the probability of team A winning the series.

The general formula for getting probability is given by:

And when we plot all these probabilities, we obtain Binomial distribution.

Here, we have unequal chances for team A to win, but when we have equal chances or an infinite number of matches this binomial distribution tends to become the well known Normal Distribution.

Increasing the number of matches keeping the winning probability = 0.75, where n is the number of matches.

Increasing the number of matches for the winning probability = 0.5; where n is the number of matches

Around 20 matches we can observe a clear normal distribution when the winning probability is 0.5. In case the probability for success us 0.75, the distribution is normal but slightly shifted.

Central Limit Theorem

If we take and a sample from a population distribution, and plot the mean, the graph would tend to a normal distribution when we have taken a sufficiently large number of samples.

Let us consider a normal distribution of data for students’ marks out of a hundred. Let us plot the means of sample for different sample sizes.

Here X-axis represents the value of the mean of a sample and the y-axis represents the probability. It can also represent the frequency if we replace probability by it.

There is another observation, the standard deviation of the population is greater than the standard deviation of the samples.

Probability Density function

The probability distribution of a continuous random variable, which can take any value between a range of numbers, is called a Probability Density function.

Area Under Curve

When we have replaced frequency by probability, then the area under the curve of a normal distribution or a probability density function is 1 (unity).

Suppose we want a probability that students will score between 70 to 90 marks. To find that we can find the area under the curve of probability density function putting a limit to x-axis from 70 to 90.

Z-Scores

Suppose we want to find the percentage of students scoring less than 80 marks given the mean and standard deviation of the population.

The Z-score is defined as

Summing Up

The probability and statistics more of dealing with population and samples are called inferential statistics which is different from descriptive statistics that we have discussed above.

This was a basic introduction to probability. This is enough to get started with the machine learning algorithm with these descriptive statistics.