12 Most asked Probability Questions in a Data Science Interview…

22 min readAug 31, 2023

In this exciting blog installment, I’m carrying on with our series where I’ll be breaking down topics just like I’m chatting with an awesome 10-year-old student! Let’s dive in and explore these cool concepts together!

Probability and Statistics is the backbone of Machine Learning, If you are interviewing in any product based company you can expect a separate round for probability and stats.

But Firstly we need to understand Why Probability and Statistics is so important?

Let me take an example,

Picture this: Zomato, our food-loving superhero, is on a mission to predict the number of orders they’ll get by the end of this year. But hold on, there’s a twist! Zomato has two super teams: Team A and Team B.

Team A, the mighty data wizards, used mind-bending Machine Learning spells and optimization magic to predict there will be a whopping 12 lakhs more orders this year compared to the last. But here’s the thing, the big bosses making decisions might not speak the same tech language as TeamA.

How can they trust this mystical number?

It’s like reading a secret code!

Now, let’s zoom into Team B. These legends cooked up a different potion. They boldly declared that there’s a stunning 95% chance the order count will dance somewhere between 10 to 14 lakhs.

This magical range is called the Confidence Interval — Team B is basically saying, “We’re 95% sure that the orders will land right in this sweet spot!”

So, in this enchanting tale, Team A and Team B have conjured up their own predictions, but Team B added an extra sprinkle of magic with that Confidence Interval. It’s like Zomato’s crystal ball showing a fascinating range of possibilities, lighting up the path for the big decision-makers!

Importance of probability and statistics in machine learning could go into many dimensions as there would be so many use cases.

Get ready to embark on an epic journey through the dazzling world of Machine Learning. Imagine each topic is like a powerful tool in a wizard’s toolkit, helping us solve the most fascinating challenges.

Hypothesis testing, A/B testing, Normal Distribution, Conditional Probability, and the legendary Bayes Theorem — they’re all here to play their crucial roles in unraveling Machine Learning mysteries. 🕵️

Hold onto your seats because in this blog, we’re about to dive into the juiciest stuff! We’re taking the trickiest interview questions and turning them into enchanting tales.

So get ready to unlock the secrets behind these mind-bending concepts, and impress even the toughest interview dragons with your newfound knowledge!

Your journey to Machine Learning mastery starts now!

Ques-1 Define the Bayes Theorem?

Bayes Theorem

Bayes’ Theorem is a way to figure out how likely something is to happen when you have some old information and new clues. It’s like mixing your favorite flavors of ice cream to make the tastiest treat!

Here’s the special recipe for Bayes’ Theorem:

P(A|B) = (P(B|A) * P(A)) / P(B)

Don’t worry if it looks a little fancy. Let’s break it down like a puzzle:

P(A|B) means “the chance of something happening A when you know something B.”

P(B|A) is “the chance of finding clue B when you know A.”

P(A) is the chance of something happening A, just like your first guess.

P(B) is the chance of finding clue B, like new evidence.

Now, let’s put it all together with a fun story:

Imagine you’re trying to figure out if your friend Sammy ate your cookies. You start with a guess (P(A)) that Sammy loves cookies and might have eaten them.

Then, you find a clue: there are crumbs on Sammy’s shirt (B). You also know that when Sammy eats cookies, crumbs usually show up (P(B|A)).

But you also remember that sometimes Sammy gets crumbs from other snacks too (P(B|not A)). Bayes’ Theorem helps you mix all of this information, your first guess (P(A)), the chance of seeing crumbs if Sammy ate the cookies (P(B|A)), and the chance of seeing crumbs from other snacks (P(B|not A)).

When you put the numbers in the formula and mix it all up, you get a better idea of how likely it is that Sammy actually ate your cookies (P(A|B)). It’s like making a special blend of flavors to find out the tastiest result! So, Bayes’ Theorem is like using a cool math trick to help you solve mysteries and make better guesses when you have some old knowledge and exciting new clues!

Ques 2. What is Poisson Distribution??

Imagine you’re counting how many times something special happens in a certain time or space, like how many shooting stars you see in an hour.

The Poisson distribution is quite special. In this magical world of probability, it’s one of the few distributions where the mean (average) and the variance (a measure of how much the data spreads out) are equal.

The Poisson Distribution is like a magical tool that helps you understand how often these rare events might happen. It’s like having a superpower to make guesses about things that don’t happen all the time!

Now, here’s the special formula for the Poisson Distribution:

P(X; λ) = (e^(-λ) * λ^X) / X!

I know it looks a little bit tricky, but let’s break it down:

P(X; λ): This is the chance of the event happening X times in a certain time or space, and λ is a special number that helps us figure this out.

e^(-λ): This is a special number (it’s like 2.71828…) raised to the power of negative λ. λ^X: This is λ raised to the power of X, which means you’re multiplying λ by itself X times.

X!: This is the factorial of X. It’s like multiplying all the numbers from X down to 1. So, if X is 3, then X! is 3 × 2 × 1 = 6.

When you put all these things together, you get a number that tells you how likely it is for a certain number of events to happen.

Imagine you’re counting shooting stars in an hour, and on average, you see 2 shooting stars per hour (that’s λ = 2). You can use the Poisson Distribution to guess how likely it is to see, let’s say, 3 shooting stars in an hour (that’s X = 3). You plug in the numbers into the formula, do the math, and you get a chance!

So, the Poisson Distribution is like your special helper for guessing how often rare things happen. It’s a bit like magic math that makes predictions about special events!

Ques 3. What are The properties of a normal distribution??

Imagine you have a bunch of superhero friends, each with their own unique heights. If you were to make a chart of their heights, something really interesting would happen. The chart would form a special shape that looks like a gentle hill!

Now, this hill has some cool features:

Symmetry: Imagine folding the hill in half vertically. The left side and the right side would match up perfectly, like a mirror. This means that if you have a superhero friend who is super tall, you’re likely to have a friend who’s really short too.

Bell Shape: The hill is smooth and curved like a friendly bell. It’s the tallest in the middle and gets shorter as you move away from the middle. Just like a hill in a playground, but a bit more magical!

Mean, Median, and Mode: The highest point on the hill is where most of your superhero friends’ heights are. We call this the “mean.” The “median” is right in the center of the hill, and the “mode” is where the hill’s bump is. In a normal distribution, all three of these are right around the same spot.

Spread Out: How wide the hill is tells us how spread out your superhero friends’ heights are. If the hill is wide, it means they have a range of heights. If it’s narrow, they are more similar in height.

Area Under the Curve: The whole hill covers all your friends’ heights. If you were to color in the hill, the total colored area would be equal to 1. It’s like a superhero cape that wraps around all the heights.

Now, here’s the formula that’s like the magical code for understanding this hill:

f(x) = (1 / (σ * √(2π))) * e^(-((x — μ)²) / (2σ²)) f(x): This is like the height of the hill at a certain point.

μ: This is where the highest point of the hill (the mean) is. σ: This is like how wide the hill is (the standard deviation).π: It’s a special number, about 3.14159…
e: This is another special number, about 2.71828…

x: This is the height you’re looking at on the hill. So, a normal distribution is like a friendly hill that helps us understand how things in nature often spread out.

The formula is like a treasure map that lets us explore this hill and learn more about the heights of our superhero friends in a super cool way!

4. What is called an Empirical Rule in Normal Distribution?

Imagine you have a basket of colorful balls, and you want to see how they’re spread out in terms of size. The Empirical Rule is like a special trick that helps you make good guesses about where most of the balls are and how they’re arranged.

Here’s how the Empirical Rule works: 1.

68–95–99.7 Rule: This is like a secret code that tells you about where the balls are most likely to be. Imagine the basket of balls is like a hill. About 68% of the balls will be on the part of the hill that’s pretty close to the middle. Around 95% will be on the part of the hill that’s a bit farther from the middle but not too far. And almost 99.7% of the balls will be on the hill, no matter where they are.

Now, for the formula part:

The Empirical Rule doesn’t have a single formula like some other things do, but it’s more like a set of rules based on patterns. However, you can use a basic understanding of percentages and measurements to get a sense of where things are likely to be.

If you want to be a bit more specific, you can use the standard deviation to help you. Remember, the standard deviation measures how spread out the balls are. The formula for the Empirical Rule mostly works with the standard deviation.

So, while there’s no one-size-fits-all formula for the Empirical Rule, it’s like a treasure map that helps you make really good guesses about how things are arranged. It’s a bit like magic math that gives you hints about where the colourful balls in your basket are most likely to be!

Ques 5. Can you tell me the difference between Standard deviation Vs Variance?

Imagine you and your friends are collecting colourful marbles. You all have different amounts of marbles, and you want to know how spread out the numbers of marbles are in your group.

Variance is like a way to measure how much the numbers vary, or spread out, from the average. It’s like calculating how far each friend’s marble count is from the group’s average marble count and then squaring those differences, adding them up, and finding the average of those squared differences. The result tells you how much the numbers “vary” from each other. But this number might be a bit tricky to understand because it’s in squared units, not the same units as your original marbles.

Standard deviation, on the other hand, is a more friendly measurement. It’s like the square root of the variance. It’s in the same units as your original marbles, so it’s easier to understand. It shows you the “average” difference between each friend’s marble count and the group’s average marble count. A higher standard deviation means the numbers are more spread out, and a lower standard deviation means the numbers are closer together.

So, to sum it up, variance tells you how much the numbers vary by looking at their squared differences from the average, while standard deviation is a more understandable measure that gives you the “average” difference between each number and the average. Just remember, when you’re dealing with marbles, data, or anything else, these tools help you understand how spread out or close together the numbers are in a group!

Ques 6 . What do you mean by Central limit Theorem?

Ans6-

Hey there! Imagine you’re at a pizza party with your friends, and you all love different kinds of pizza. Some like cheese, some like pepperoni, and some like veggies. Now, let’s say you want to know how many slices of pizza everyone usually eats.

The central limit theorem is like a magic rule that helps us understand how things usually work when we look at a lot of groups of friends. It says that even if each group of friends is different, the total number of pizza slices they eat tends to look similar.

Here’s a simple formula to help you understand it:

Total Pizza Slices = Average Number of Slices × Number of Friends in the Group

Now, let’s say you ask each group of friends how many slices they usually eat and you make a special graph. This graph will look like a hill in the middle. Most groups of friends will be close to the top of the hill, and only a few will be far away.

So, even if your friends have different favorite pizzas and eat differently, when you add up the total slices they eat in each group, it will look like that special hill on the graph. This is because of the central limit theorem, which helps us understand how things usually work when we have lots of groups.

Remember, it’s like magic — no matter what kind of pizza your friends like, the total number of slices they eat follows that special hill shape!

Ques 7. How will you handle missing data in statistics?

Are you dealing with missing data…..

Imagine you’re playing a game with your friends, and you all have a score. But sometimes, one friend might not be able to finish the game, so their score is missing. How do we figure out who the winner is when we’re missng some scores?

Handling missing data is like solving this puzzle! There are a few ways to do it, but let’s talk about one way called “averaging.”

Let’s say you and your friends played the game, and these are the scores:

Your score: 10
Friend A’s score: 15
Friend B’s score: (missing)
Friend C’s score: 8
To find the average (a fancy word for “middle”) score, we add up all the scores we have and divide by how many scores we have. The formula looks like this:
Average Score = (Your Score + Friend A’s Score + Friend C’s Score) / Number of Scores
So, the average score would be: Average Score = (10 + 15 + 8) / 3 = 33 / 3 = 11

Now, we can use this average score to guess what Friend B’s score might be. Since we don’t know their score, we can pretend they got the average score. In this case, Friend B’s score would be 11. This is one way to handle missing data and keep playing the game!

Remember, it’s like making a good guess based on the scores you do have. Just like how you might guess what a missing puzzle piece looks like by looking at the ones around it!

Another Way:

Imagine you have a bunch of flowers, but some of them are a bit wilted. To make the bouquet look nicer, you might decide to remove the wilted flowers. Handling missing data through deletion is kind of like that — if you have missing information, you just remove the entire piece of data.

In our game score example:

Your score: 10
Friend A’s score: 15
Friend B’s score: (missing)
Friend C’s score: 8

With deletion, you might decide to just ignore Friend B’s score and only use the scores of those friends whose scores you know. So, you would only work with your score, Friend A’s score, and Friend C’s score. This way, you’re not guessing or filling in the gaps, but you’re giving up on using the missing data altogether.

Deletion:- remove data set that have null value
Imputation: — replacing missing values with the estimated value

With imputation, you might look at the scores you do have and think about how your friends usually play. Maybe you know Friend B is really good at the game and often scores high. So, you could guess that their score might be higher than the others, like 20. This way, you’re filling in the missing data with a guess that’s based on what you know about your friends’ playing habits.

Modelling: This involves using statistical models to predict the missing values

Ques 8. What do you mean by an Outlier?

Imagine you’re in a classroom with all your friends, and you’re all about the same age. You’re all between 7 and 9 years old. But one day, a new student comes in who is 15 years old! That new student is an outlier!

An outlier is something or someone that’s very different from the others in a group. It stands out because it’s not like the usual things or people you have around. In math and statistics, we use numbers to talk about this idea.

Let’s use some numbers to explain:
Think about the ages of the students in your classroom:

Student 1: 8 years old
Student 2: 7 years old
Student 3: 9 years old
Student 4: 8 years old
New Student: 15 years old

See how the new student’s age is much higher than the ages of the other students? That’s an outlier! It’s way different from the ages of your other classmates.

Now, here’s a simple idea to spot an outlier:

Outlier = A Number That’s Much Bigger or Smaller Than the Others

So, just like that much older new student stands out in your classroom, an outlier is something that stands out in a group of things or numbers because it’s really different from the rest!

Ques 9. Define Bayes Theorem??

The Bayes theorem determines the probability of an event A occurring based on the probability of the occurrence of event B — provided both events occur independently. The following Bayes theorem formula represents it:

P(A|B) is the probability of event A occurring after event B.
P(B|A) is the probability of event B occurring after event A.
P(A) is the probability of event A occurring.
P(B) is the probability of event B occurring.

The different terms associated with the Bayes theorem are as follows:

Conditional Probability — When the happening of an event A depends on the occurrence of another event B, it is known as conditional probability.
Posterior Probability — The conditional probability of an event happening based on new information or prior probability is known as posterior probability.
Prior Probability — It is the probability of an event’s occurrence based on previous information.
Joint Probability — The chances of two or more events taking place simultaneously is their joint probability.
Random Variables — The continuous range of values denoting the outcome of random experiments are the random variables.

Calculation Example

Let us look at how the Bayes theorem probability calculator works. Assume that there are two investment options, A and B. Then, the probability of generating positive returns from A is 74%, and the probability of generating positive returns from B is 45%. Also, the possibility of investment B providing a positive return when investment A also provides a positive return is 13%.

Based on the given data, determine the probability of investment A providing a positive return when investment B also provides a positive return.

Solution: Given:
P (A) = 0.74
P (B) = 0.45
P (B│A) = 0.13

P (A│B) = [(0.13 × 0.74) / 0.45] = 0.21

Thus, the probability of generating positive returns from investment A when investment B also generates positive returns is 0.21.

Ques 10 What are the applications of Bayes Theorem?

The Bayes law is the base of Bayesian statistics. It is applied to various fields to determine the probability of an event using past experiences and evidence. Such analyses can help predict unfavorable outcomes. Also, once an unfavorable outcome is predicted, an organization can prepare itself with corrective measures.

Thus, it is used in statistics, medicine, machine learning, engineering, philosophy, sports, finance, humanities, and law. Now, let us go through some real-life Bayes theorem examples to understand the application of the Bayes rule:

In finance, Bayes law determines the risks and returns of an investment.
It is also used for determining credit ratings. Lenders analyze the uncertainty associated with debt recovery. Every potential lender is screened before sanctioning funds.
In medical science, this tool is used for determining the accuracy of results (results of medical test).
In machine learning and artificial intelligence, Bayesian statistics help detect spam and credit card fraud.
Many sports betting (prediction) algorithms are based on the results of the Bayes theorem.

Ques 11 What do you mean by Null Hypothesis and Alternative Hypothesis?

Null Hypothesis

We can define a null hypothesis as a general statement or a default position that says there is no relationship between two measured phenomena or there is no association among groups.

Why is Null Hypothesis Important?

Hypothesis helps the researcher to translate any given problem to a clear explanation for the outcome of the study.
Testing (which involves accepting, approving, rejecting, or disproving) the null hypothesis and thus concluding that there are or we can say that there are no grounds for believing that there is any relationship between two phenomena is basically a central task in the modern practice of science; in the field of statistics.

• To be more specific, hypothesis testing gives precise criteria for rejecting or accepting a null hypothesis within a level known as the confidence level.

Null Hypothesis Symbol

A Null Hypothesis is denoted by the symbol H0 in statistics. It is usually pronounced as “h- nought” or “H-null”. The Subscript in H is the digit 0.

Null Hypothesis Principle and When is A Null Hypothesis Rejected?

The principle followed for null hypothesis testing is basically collecting the data and determining the chances of a given set of data during the study on any given random sample, assuming that the null hypothesis is true.

Suppose, if the given data does not face the expected null hypothesis, then the outcome we will get will be quite weaker and they conclude that by saying that the given set of data does not provide strong evidence against the null hypothesis which is because of insufficient evidence.

Finally, this leads to null hypothesis rejection.

What is Alternative Hypothesis?

The alternative hypothesis, often denoted as “Ha” or “H1,” is a statement in hypothesis testing that suggests there is a significant difference, effect, or relationship between groups or variables being studied. It contrasts with the null hypothesis (H0), which assumes that any observed differences or effects are due to random chance. The alternative hypothesis is what researchers aim to support with their data when conducting hypothesis tests.

Type 1 Error and Type 2 Error

Type I error

A Type I error means rejecting the null hypothesis when it’s actually true. It means concluding that results are statistically significant when, in reality, they came about purely by chance or because of unrelated factors.

The risk of committing this error is the significance level (alpha or α) you choose. That’s a value that you set at the beginning of your study to assess the statistical probability of obtaining your results (p value).

The significance level is usually set at 0.05 or 5%. This means that your results only have a 5% chance of occurring, or less, if the null hypothesis is actually true.

If the p value of your test is lower than the significance level, it means your results are statistically significant and consistent with the alternative hypothesis. If your p value is higher than the significance level, then your results are considered statistically non-significant.

α = probability of a Type I error = P(Type I error) = probability of rejecting the null hypothesis

when the null hypothesis is true: rejecting a good null.

Type II error

A Type II error means not rejecting the null hypothesis when it’s actually false. This is not quite the same as “accepting” the null hypothesis, because hypothesis testing can only tell you whether to reject the null hypothesis.

Instead, a Type II error means failing to conclude there was an effect when there actually was. In reality, your study may not have had enough statistical power to detect an effect of a certain size.

Power is the extent to which a test can correctly detect a real effect when there is one. A power level of 80% or higher is usually considered acceptable.

The risk of a Type II error is inversely related to the statistical power of a study. The higher the statistical power, the lower the probability of making a Type II error.

β = probability of a Type II error = P(Type II error) = probability of not rejecting the null

hypothesis when the null hypothesis is false. (1 − β) is called the Power of the Test.

EXAMPLE

Suppose the null hypothesis, H0, is: Frank’s rock climbing equipment is safe. Type I error: Frank thinks that his rock climbing equipment may not be safe when, in fact, it really is safe. Type II error: Frank thinks that his rock climbing equipment may be safe when, in fact, it is not safe.

α = probability that Frank thinks his rock climbing equipment may not be safe when, in fact, it really is safe.

β = probability that Frank thinks his rock climbing equipment may be safe when, in fact, it is not safe.

Notice that, in this case, the error with the greater consequence is the Type II error. (If Frank thinks his rock climbing equipment is safe, he will go ahead and use it.) This is a situation described as “accepting a false null”.

Ques 12 What do you mean T-test and P-value?

T test

A T test can only be used when comparing themeansof two groups (a.k.a. pairwise comparison). If you want to compare more than two groups, or if you want to do multiple pairwise comparisons, use an ANOVA test or a post-hoc test.

The t test is a parametric test of difference, meaning that it makes the same assumptions about your data as other parametric tests. The t test assumes your data:

are independent
are (approximately) normally distributed

3. have a similar amount of variance within each group being compared (a.k.a. homogeneity of variance)
If your data do not fit these assumptions, you can try a nonparametric alternative to the t test, such as the Wilcoxon Signed-Rank test for data with unequal variances.

Performing a t test

The t test estimates the true difference between two group means using the ratio of the difference in group means over the pooled standard error of both groups. You can calculate it manually using a formula, or use statistical analysis software.

A t test is a statistical test that is used to compare the means of two groups. It is often used

in hypothesis testing to determine whether a process or treatment actually has an effect on

the population of interest, or whether two groups are different from one another.

T test formula

The formula for the two-sample t test (a.k.a. the Student’s t-test) is shown below.

In this formula, t is the t value, x1 and x2 are the means of the two groups being compared, s2 is the pooled standard error of the two groups, and n1 and n2 are the number of observations in each of the groups.

A larger t value shows that the difference between group means is greater than the pooled standard error, indicating a more significant difference between the groups.

You can compare your calculatedtvalue against the values in a critical value chart (e.g., Student’s t table) to determine whether your t value is greater than what would be expected by chance. If so, you can reject the null hypothesis and conclude that the two groups are in fact different.

What Is P-Value?

In statistics, the p-value is the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct.

The p-value serves as an alternative to rejection points to provide the smallest level of significance at which the null hypothesis would be rejected. A smaller p-value means that there is stronger evidence in favor of the alternative hypothesis.

The P-Value Approach to Hypothesis Testing

The p-value approach to hypothesis testing uses the calculated probability to determine whether there is evidence to reject the null hypothesis. The null hypothesis, also known as the conjecture, is the initial claim about a population (or data-generating process). The alternative hypothesis states whether the population parameter differs from the value of the population parameter stated in the conjecture.

In practice, the significance level is stated in advance to determine how small the p-value must be to reject the null hypothesis. Because different researchers use different levels of significance when examining a question, a reader may sometimes have difficulty comparing results from two different tests. P-values provide a solution to this problem.

For example, suppose a study comparing returns from two particular assets was undertaken by different researchers who used the same data but different significance levels. The researchers might come to opposite conclusions regarding whether the assets differ.

If one researcher used a confidence level of 90% and the other required a confidence level of 95% to reject the null hypothesis, and if the p-value of the observed difference between the two returns was 0.08 (corresponding to a confidence level of 92%), then the first researcher would find that the two assets have a difference that is statistically significant, while the second would find no statistically significant difference between the returns.

To avoid this problem, the researchers could report the p-value of the hypothesis test and allow readers to interpret the statistical significance themselves. This is called a p-value approach to hypothesis testing. Independent observers could note the p-value and decide for themselves whether that represents a statistically significant difference or not.

P-value Formula

We Know that P-value is a statistical measure, that helps to determine whether the hypothesis is correct or not. P-value is a number that lies between 0 and 1. The level of significance(α) is a predefined threshold that should be set by the researcher. It is generally fixed as 0.05. The formula for the calculation for P-value is