Ace Probability and Statistics in Interviews

Manish Todi

Published in

Analytics Vidhya

11 min readOct 4, 2019

Hello folks,

Are you struggling to get the correct answer while attending any Data Science Interviews?

This is the right place!!

Here I compile all the main questions in this domain of Probability and Statistics!!

What is a random variable?

Ans: A random variable is a variable whose value is unknown or a function that assigns values to each of an experiment’s outcomes. Random variables are often designated by letters and can be classified as discrete, which are variables that have specific values, or continuous, which are variables that can have any values within a continuous range.

Random variables are often used in econometric or regression analysis to determine statistical relationships among one another.

KEY TAKEAWAYS

A random variable is a variable whose value is unknown or a function that assigns values to each of an experiment’s outcomes.
Random variables appear in all sorts of econometric and financial analyses.
A random variable can be either discrete or continuous in type

2. What are the conditions for a function to be a probability mass function?

Ans: The probability mass function, f(x) = P(X = x), of a discrete random variable X has the following properties:

All probabilities are positive: fx(x) ≥ 0.
Any event in the distribution (e.g. “scoring between 20 and 30”) has a probability of happening of between 0 and 1 (e.g. 0% and 100%).
The sum of all probabilities is 100% (i.e. 1 as a decimal): Σfx(x) = 1.
An individual probability is found by adding up the x-values in event A. P(X Ε A) =

3. What are the conditions for a function to be a probability density function?

Ans: The probability density function is the probability function which is defined for the continuous random variable. The probability density function is also called the probability distribution function or probability function. It is denoted by f (x).

Conditions for a valid probability density function:

Let X be the continuous random variable with a density function f (x). Therefore,

4. What is conditional probability?

Ans: Conditional probability is the probability of one event occurring with some relationship to one or more other events. For example:

Event A is that it is raining outside, and it has a 0.3 (30%) chance of raining today.
Event B is that you will need to go outside, and that has a probability of 0.5 (50%).

A conditional probability would look at these two events in relationship with one another, such as the probability that it is both raining and you will need to go outside.

The formula for conditional probability is:
P(B|A) = P(A and B) / P(A)
which you can also rewrite as:
P(B|A) = P(A∩B) / P(A)

5. What are the conditions for independence and conditional independence of two random variables?

Ans: Independence does not imply conditional independence: for instance, independent random variables are rarely independent conditionally on their sum or on their maximum.

Conditional independence does not imply independence: for instance, conditionally independent random variables uniform on (0,u)(0,u) where uu is uniform on (0,1)(0,1) are not independent.

6. What is a Bernoulli distribution?

Ans: A Bernouilli distribution is a discrete probability distribution for a Bernouilli trial — a random experiment that has only two outcomes (usually called a “Success” or a “Failure”). For example, the probability of getting a heads (a “success”) while flipping a coin is 0.5. The probability of “failure” is 1 — P (1 minus the probability of success, which also equals 0.5 for a coin toss). It is a special case of the binomial distribution for n = 1. In other words, it is a binomial distribution with a single trial (e.g. a single coin toss).

The probability of a failure is labeled on the x-axis as 0 and success is labeled as 1. In the following Bernoulli distribution, the probability of success (1) is 0.4, and the probability of failure (0) is 0.6:

The probability density function (pdf) for this distribution is px (1 — p)1 — x, which can also be written as:

The expected value for a random variable, X, from a Bernoulli distribution is:
E[X] = p.
For example, if p = .04, then E[X] = 0.4.

The variance of a Bernoulli random variable is:
Var[X] = p(1 — p).

7. What is normal distribution?

Ans: A normal distribution, sometimes called the bell curve, is a distribution that occurs naturally in many situations. For example, the bell curve is seen in tests like the SAT and GRE. The bulk of students will score the average ©, while smaller numbers of students will score a B or D. An even smaller percentage of students score an F or an A. This creates a distribution that resembles a bell (hence the nickname). The bell curve is symmetrical. Half of the data will fall to the left of the mean; half will fall to the right.
Many groups follow this type of pattern. That’s why it’s widely used in business, statistics and in government bodies like the FDA:

Heights of people.
Measurement errors.
Blood pressure.
Points on a test.
IQ scores.
Salaries.

The empirical rule tells you what percentage of your data falls within a certain number of standard deviations from the mean:
• 68% of the data falls within one standard deviation of the mean.
• 95% of the data falls within two standard deviations of the mean.
• 99.7% of the data falls within three standard deviations of the mean.

The standard deviation controls the spread of the distribution. A smaller standard deviation indicates that the data is tightly clustered around the mean; the normal distribution will be taller. A larger standard deviation indicates that the data is spread out around the mean; the normal distribution will be flatter and wider.

8. What is the central limit theorem?

Ans: In the study of probability theory, the central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution (also known as a “bell curve”), as the sample size becomes larger, assuming that all samples are identical in size, and regardless of the population distribution shape.

Said another way, CLT is a statistical theory stating that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population. Furthermore, all the samples will follow an approximate normal distribution pattern, with all variances being approximately equal to the variance of the population, divided by each sample’s size.

Although this concept was first developed by Abraham de Moivre in 1733, it wasn’t formally named until 1930, when noted Hungarian mathematician George Polya officially dubbed it the Central Limit Theorem.

Understanding the Central Limit Theorem (CLT)

According to the central limit theorem, the mean of a sample of data will be closer to the mean of the overall population in question, as the sample size increases, notwithstanding the actual distribution of the data. In other words, the data is accurate whether the distribution is normal or aberrant.

As a general rule, sample sizes equal to or greater than 30 are deemed sufficient for the CLT to hold, meaning that the distribution of the sample means is fairly normally distributed. Therefore, the more samples one takes, the more the graphed results take the shape of a normal distribution.

Central Limit Theorem exhibits a phenomenon where the average of the sample means and standard deviations equal the population mean and standard deviation, which is extremely useful in accurately predicting the characteristics of populations.

KEY TAKEAWAYS

The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger.
Sample sizes equal to or greater than 30 are considered sufficient for the CLT to hold.
A key aspect of CLT is that the average of the sample means and standard deviations will equal the population mean and standard deviation.
A sufficiently large sample size can predict the characteristics of a population accurately.

9. What is the difference between covariance and correlation?

Ans: The following points are noteworthy so far as the difference between covariance and correlation is concerned:

A measure used to indicate the extent to which two random variables change in tandem is known as covariance. A measure used to represent how strongly two random variables are related known as correlation.
Covariance is nothing but a measure of correlation. On the contrary, correlation refers to the scaled form of covariance.
The value of correlation takes place between -1 and +1. Conversely, the value of covariance lies between -∞ and +∞.
Covariance is affected by the change in scale, i.e. if all the value of one variable is multiplied by a constant and all the value of another variable are multiplied, by a similar or different constant, then the covariance is changed. As against this, correlation is not influenced by the change in scale.
Correlation is dimensionless, i.e. it is a unit-free measure of the relationship between variables. Unlike covariance, where the value is obtained by the product of the units of the two variables.

Conclusion

Both measures only linear relationship between two variables, i.e. when the correlation coefficient is zero, covariance is also zero. Further, the two measures are unaffected by the change in location.

Correlation is a special case of covariance which can be obtained when the data is standardized. Now, when it comes to making a choice, which is a better measure of the relationship between two variables, correlation is preferred over covariance, because it remains unaffected by the change in location and scale, and can also be used to make a comparison between two pairs of variables.

10. What is the Box-Cox transformation used for?

Ans: A Box Cox transformation is a way to transform non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means that you are able to run a broader number of tests.

The Box Cox transformation is named after statisticians George Box and Sir David Roxbee Cox who collaborated on a 1964 paper and developed the technique.

Running the Test

At the core of the Box Cox transformation is an exponent, lambda (λ), which varies from -5 to 5. All values of λ are considered and the optimal value for your data is selected; The “optimal value” is the one which results in the best approximation of a normal distribution curve. The transformation of Y has the form:

This test only works for positive data. However, Box and Cox did propose a second formula that can be used for negative y-values:

The formulae are deceptively simple. Testing all possible values by hand is unnecessarily labor intensive; most software packages will include an option for a Box Cox transformation, including:

R: use the command boxcox(object, …).
Minitab: click the Options box (for example, while fitting a regression model) and then click Box-Cox Transformations/Optimal λ.

11. What do you understand by Hypothesis in the context of Machine Learning?

Ans: The process of hypothesis testing is to draw inferences or some conclusion about the overall population or data by conducting some statistical tests on a sample. The same inferences are drawn for different machine learning models through T-test which I will discuss in this tutorial.

For drawing some inferences, we have to make some assumptions that lead to two terms that are used in the hypothesis testing.

Null hypothesis: It is regarding the assumption that there is no anomaly pattern or believing according to the assumption made.
Alternate hypothesis: Contrary to the null hypothesis, it shows that observation is the result of real effect.

P value

It can also be said as evidence or level of significance for the null hypothesis or in machine learning algorithms. It’s the significance of the predictors towards the target.

Generally, we select the level of significance by 5 %, but it is also a topic of discussion for some cases. If you have a strong prior knowledge about your data functionality, you can decide the level of significance.

On the contrary of that if the p-value is less than 0.05 in a machine learning model against an independent variable, then the variable is considered which means there is heterogeneous behavior with the target which is useful and can be learned by the machine learning algorithms.

The steps involved in the hypothesis testing are as follow:

Assume a null hypothesis, usually in machine learning algorithms we consider that there is no anomaly between the target and independent variable.
Collect a sample
Calculate test statistics
Decide either to accept or reject the null hypothesis

Calculating test or T statistics

For Calculating T statistics, we create a scenario.

Suppose there is a shipping container making company which claims that each container is 1000 kg in weight not less, not more. Well, such claims look shady, so we proceed with gathering data and creating a sample.

After gathering a sample of 30 containers, we found that the average weight of the container is 990 kg and showing a standard deviation of 12.5 kg.

So calculating test statistics:

T = (Mean — Claim)/ (Standard deviation / Sample Size^(1/2))

Which is -4.3818 after putting all the numbers.

Now we calculate t value for 0.05 significance and degree of freedom.

Note: Degree of Freedom = Sample Size - 1

From T table the value will be -1.699.

After comparison, we can see that the generated statistics are less than the statistics of the desired level of significance. So we can reject the claim made.

You can calculate the t value using stats.t.ppf() function of stats class of scipy library.

Errors

As hypothesis testing is done on a sample of data rather than the entire population due to the unavailability of the resources in terms of data. Due to inferences are drawn on sample data the hypothesis testing can lead to errors, which can be classified into two parts:

Type I Error: In this error, we reject the null hypothesis when it is true.
Type II Error: In this error, we accept the null hypothesis when it is false.

Other Approaches

A lot of different approaches are present to hypothesis testing of two models like creating two models on the features available with us. One model comprises all the features and the other with one less. So we can test the significance of individual features. However feature inter-dependency affect such simple methods.

In regression problems, we generally follow the rule of P value, the feature which violates the significance level are removed, thus iteratively improving the model.

Different approaches are present for each algorithm to test the hypothesis on different features.

Ace Probability and Statistics in Interviews

KEY TAKEAWAYS

KEY TAKEAWAYS

Conclusion

Running the Test

P value

Calculating test or T statistics

Errors

Other Approaches

Written by Manish Todi