Probability for Data Science

Published in

Analytics Vidhya

6 min readFeb 24, 2020

Introduction:

Probability is a very important mathematical concept for data science, used in validating hypothesis, bayes theorem and interpreting outputs in machine learning among others. We will cover some basic concepts of probability in this blog. Let’s begin.

Frequency Tables:

It is a way to represent the count of a category in a distribution. Let’s consider the distribution of 5 color balls: Red,Red,Green,Blue,Red. The frequency table for the same will be:

Histograms and Bar Plots:

Histogram is used to describe the distribution of continuous variables. Wikipedia describes a histogram as an estimate of the probability distribution of a continuous variable. We will see what continuous variables and probability distributions are later in this article. Below is an example of a Histogram plot, we’re plotting the frequency mrp of items in a store.

Similarly, bar plots can be used to plot categorical variables, below is an example for bar plot. Example, plot the average duration of trip for each cab vendor.

Probability:

Let us first have a look at some of the terms associated with probability:

Experiment: It is a trial, having a set of well-defined outputs

Outcome: It is a possible result of an experiment

Events: The set of outcomes from an experiment

Probability can be defined as the likelihood of an event happening.This probability value is between 0 and 1.

The sum of probabilities of all possible events of an experiment occurring is equal to 1.

The formula for probability is:

Probability(event) = Number of desired outcomes / Number of total outcomes.

Example: For an experiment of tossing a fair coin, the set :{heads,tails} will be the outcomes. There will be 2 possible events, one we get heads, second we get tails.Since the coin is fair, unlike the one used in Sholay(which all bollywood aficionados would be aware of 😝), the probability of getting heads and tails are equal, i.e 0.5.

Let desired outcome be heads. Hence Number of desired outcomes in this case is 1. There are 2 possible outcomes: heads & tails. Using above formula

P(heads) = ½.

The probability for the event of getting tails can be calculated in a similar manner.

Bernoulli Trials:

Bernoulli Trials are experiments with exactly two outcomes. Examples are:

Tossing a fair coin, outcome can be heads/tails.
Outcome of a sports game, win or lose.
Outcome of a test: students pass or fail in the exam.

Binomial Distribution:

Binomial Distribution is used to decide the number of successes in n bernoulli trials.Let p, be the probability for success and q be the probability for failure of a bernoulli trial. Let x be the number of successes in a trial. The total number of failures will then be n-x. The probability distribution formula can be given as:

P(X) = nCx * px * (q)n — x

Now we know that probability of failure = 1 — probability of success. Hence, we can also write q as 1-p.

P(X) = nCx * px * (1 — p)n — x

We can plot the values of this Binomial Distribution as a probability mass function.

Probability Mass function:

Wikipedia defines a probability mass function as a function that gives the probability that a discrete random variable is exactly equal to some value. A discrete random variable is a variable that cannot be equal to a decimal value.

The probability mass function for tossing a fair coin 5 times will be as shown below:

For a large number of trials, let’s assume this number to be approaching infinity, the the probability mass function will turn into a continuous normal function(we’ll see more on normal functions later in the article), which is called probability distribution function, below is an example.

Continuous Random Variable:

Continuous Random variables are variables which can take any value in a given range.For example amount of water in a jug, can have any value between 0 to the holding capacity of the jug, including decimal values. A continuous random variable can be graphically plotted as a probability distribution function,which we have seen earlier.

Skewness of distributions:

Data can be distributed in various ways. We can check the skewness of the distribution using histograms or density curves, as we have done below. You can always check the skewness of the distribution by plotting it.

Right Skewed Distribution:

A distribution having a longer tail towards the right side of the graph is a right skewed distribution. For a right skewed distribution Mode < Median < Mean. Below is what a Right-skewed curve looks like.

2. Left Skewed Distribution:

A distribution having a longer tail towards the Left side of the graph is a Left skewed distribution. For a Left skewed distribution Mode > Median > Mean. Below is what a Left-skewed curve looks like:

3. Normal Distribution:

A distribution that has a symmetric structure i.e it does not skew either towards right or left is a normal distribution. It is also known as a bell curve since it has a bell shape. For a normal distribution, Mode = Median = Mean. Below is the plot for normal distribution.

Some important points to remember for a normal distribution are:

It is symmetric around mean
The empirical rule for normal distribution is that 68% values fall within 1 standard deviation of mean,95 % of the values fall within 2 standard deviations of mean. We consider both directions around the mean
When we replace frequency with probability, we convert Normal Distribution to Standard Normal Distribution. A standard normal distribution has mean = 0 and standard deviation = 1 and area under the curve is 1.

Central Limit Theorem:

Consider that we have a large dataset. We will now select multiple samples out of this dataset and plot the means of the sample.If the number of samples reach infinity, the distribution reaches a normal distribution. According to this theorem, the mean of any sample drawn from the population will approximately be equal to the mean of population.

Z-Score :

Z-score is defined as the number of standard deviations, the observed value is away from the Mean.

Z-score formula

Where

x: some value in normal distribution

μ: mean of the normal distribution

σ: standard deviation of the normal distribution

The distribution for Z-score is as shown below:

A positive Z score indicates that the observed value is Z standard deviations to the right of mean. Negative Z score indicates that the value is to the left of mean. Around 99% of z-values lie between -3 to 3, and anything outside this range can be considered highly unusual. Z-scores are widely used for statistical hypothesis testing.

Conclusion :

We have seen a brief overview of concepts related to probability. Hope you all enjoyed it. You can also read my article on descriptive statistics here. See you next time