Probability Distributions in Data Science and Machine Learning | Part 1

Abhishek Barai
Nov 24, 2020 · 7 min read
Image for post
Image for post
analyticsindiamag.com

For a data scientist aspirant, Statistics is a must-learn thing. It can process complex and challenging problems in the real world so that Data Scientists can mine useful trends, changes, and data behavior to fit into the appropriate model, yielding the best results. Every time we get a new dataset, we must understand the data pattern and the underlying probability distribution for further optimization and treatment during the Exploratory Data Analysis(EDA). During EDA, we try to find out the behavior of data using different probability distributions. If the data satisfies any one of the issuances or resembles them, we further treat them for a better result.

Data Scientists deal with many kinds of data, such as categorical, numerical, text, image, voice, and many more. Each of them has a way of analysis and representation. Here we are going to consider the numerical data for further analysis. Numerical data can be of two types.

  1. Discrete — It can only take specific values. The outcome of the data is fixed. For example, the number of employees in a company, the result when you roll a die where a possible outcome can be between [1,6]

We can plot this numerical data, visualize and draw a conclusion based on its pattern, behavior, and the type of probability distribution it follows. Before going into the deep, let’s be familiar with some terminologies.

Q. What is a Probability Distribution?

Ans: A probability distribution is a function that describes all the possible outcomes and likelihood that a random variable can take within a range.

Q. What is a Random Variable?

Ans: A variable associated with some chance, measured, is called a random variable. The value of a random variable is unknown, and the outcomes can be obtained using experiments. It can be discrete(when the event has a specific result) or continuous(when the event has resulted within a particular range).

Q. What is Probability Mass Function(PMF)?

Ans: The distribution of discrete random variables is called the probability mass function(PMF). The pmf of a discrete random variable x is defined as,

Image for post
Image for post

Q. What is Probability Density Function(PMF)?

Ans: The distribution of continuous random variables is called the probability density function(PDF). The pdf of variables(let x) whose values range over an interval of numbers(let a & b) is defined as,

Image for post
Image for post

Discrete Probability Distributions:

There are several discrete probability distributions commonly used in statistics and data science. Such as,

  1. Bernoulli

Bernoulli Distribution:

Bernoulli distribution for a Bernouilli trial has only two possible outcomes success or failure. For example, tossing a coin can only yield two outcomes heads or tails.

Let the probability of success be p; then, a failure will be (1-p). So the function can be defined as,

Image for post
Image for post

The probability of getting head for a single unbiased coin toss will be p=0.5 as there is an equal chance of getting a result. Then (1-p) = 0.5 . So,

P(x=1) = p(1) = p = 1/2

Distribution Plot:

Image for post
Image for post
prob of getting success=0.3 and failure=0.7 for a single chance

Binomial Distribution:

As we saw, Bernoulli distribution is based on the outcome of a single experiment. Suppose an unbiased coin is tossed 10 times. Then, in this case, what will be the probability of getting at least 7 times head? Now binomial distribution comes into the picture.

A binomial distribution can be thought of as simply the probability of a success or failure outcome in an experiment or survey repeated multiple times.

Q. Under which conditions a binomial distribution can be a Bernoulli distribution?

  1. The number of trials(n) should be 1.

Assumptions:

  • The experiment is performed under the same set of conditions for any number of trials. For example, if a prob. of success(p) is 0.5, it will be 0.5 throughout the trials.

Definition:

A random variable x is said to follow binomial probability distribution if it assumes non-negative integral values. The probability mass function is given by the probability law, as shown below.

Image for post
Image for post

Now the probability of getting at least 7 head would be,

Image for post
Image for post

Parameters:

Image for post
Image for post
n = number of independent trials

Distribution Plot:

Image for post
Image for post
for the different probability of success

Properties:

  • A binomial distribution is skewed unless p=q=1/2.

Q. Under which conditions a binomial distribution can be a normal distribution?

  1. The number of independent trials should be indefinitely large, n → ∞.

Uniform Distribution:

Uniform distribution for discrete random variables is a symmetrical probability distribution where a finite number of values is observed equally. For example, when we roll a dice or toss an unbiased coin, the probability of getting these outcomes are equally likely.

For a random variable x, the uniform distribution function can be defined as,

Image for post
Image for post

For example, by rolling an unbiased dice, we get 6 possible values: {1,2,3,4,5,6}. So there is an equally likely chance to get any one of the value.

So, f(X==x)=1/6 (prob, of getting a value)

Parameters:

Image for post
Image for post
mean and variance for uniform distribution

Distribution Plot:

Image for post
Image for post

Poisson Distribution:

The Poisson distribution is a discrete distribution that was derived by a mathematician called Dennish Poisson. He developed this method in 1830 to describe the number of times a gambler would win a rarely won game of chance in a large number of tries.

Basically, it shows how often an event is likely to occur within a specified period of time. As the random variables are discrete, it can only be measured as occurring or non-occurring.

Definition:

A random variable x is said to follow a Poisson distribution when it assumes only non-negative values and its probability function is given by,

Image for post
Image for post
λ = Poisson parameter

It is a uni-parameter and univariate distribution. It is also a limiting case of the binomial distribution.

Q. Under which conditions a binomial distribution can form a Poisson distribution?

  1. The number of trials(n) should be huge, say ∞.

Examples:

Many real-life datasets which we encounter as a data scientist follows the Poisson distribution. Such as,

  1. The number of transaction frauds happens in a month for a particular bank.

It is important to have an idea of what kind of distribution our dataset is following. In this way, we can draw a certain conclusion about data modeling.

Parameters:

Image for post
Image for post
For Poisson distribution, both mean and variance is the same, which is the Poisson parameter.

Distribution Plot:

Image for post
Image for post

Geometric Distribution:

Suppose we are surveying for an independent candidate after polls that how many votes did he/she get. So outside a polling booth, we started asking people they voted, and each time we are getting the name of other candidates. Finally, we got a person who said that he/she voted for that independent candidate. Here Geometric distribution will be represented by the number of people we had to poll before finding someone who voted for our candidate.

Basically, it represents the number of failures before we succeed in a series of Bernoulli trials(which has two outcomes always).

We can define the function as,

Image for post
Image for post

Assumptions:

  • There are two possible outcomes for each trial (success or failure).

Parameters:

Image for post
Image for post

Distribution Plot:

Image for post
Image for post

Note: There are many kinds of discrete probability distribution is present. Such as mulinomial, negative binomial, hypergeometric etc. These kind of distributions also have an high impact in case of statistics and its good to have an idea from data science prospective. But I will complete the discrete part here with the above 5 distributions.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Abhishek Barai

Written by

Data Scientist | NLP Engineer | Quantitative Researcher | Blogger

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Abhishek Barai

Written by

Data Scientist | NLP Engineer | Quantitative Researcher | Blogger

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store