12 Most Common Probability and Statistics questions for Data Science Interview

Nishesh Gogia
Analytics Vidhya
Published in
11 min readSep 14, 2020

Here are the most common data science interview questions on Probability and Statistics.

QUES 1- WHAT DOES MEAN AND STANDARD DEVIATION TELL YOU ABOUT ANY DISTRIBUTION?

Ans- Mean tell us about the central value of distribution, or where the central value of the distribution lies, Standard deviation tell you about the spread of the distribution.

QUES 2- WHAT IS KURTOSIS AND SKEWNESS?

Ans- Skewness is the measure of assymetry or it can be define as How dissimilar is the distribution from the Normal Distribution.

RIGHT SKEW- Its simply means outliers are to the right. Here mean>median>mode

LEFT SKEW- It simply means outliers are to the left. Here mode>median>mean

Kurtosis- It is defined as how heavy the tail of distribution differs from the tail of normal distribution.

QUES 3- HOW TO DO STANDARD NORMAL VARIATE(z) AND STANDARIZATION?

Ans- Lets see what is z, lets say z is a gaussian ditributed random variable with mean 0 and standard deviation1.

Now lets say we have a gaussian distributed random variable X with mean (u ) and standard deviation(sd) , now we know X can take various values so lets say x can take (x1,x2,x3,x4,x5)

Now if we subtract the mean(u) from every observation of X and divide it with sd, we will get z.

BUT WHY DO WE DO STANDARIZATION?

The main reason we do this standardization is the moment we do standization we know between -1 to 1, 68% of the data lies. -2 to 2 ,85% of the data lies.

So we know ssome poperties of z so its always good to convert into a standard normal distribution.

QUES 4- WHAT IS KERNEL DENSITY ESTIMATION?

Ans- So it is the process of smoothening the histograms to probablity density function.

From the image above we can see that a histogram is coverted to a desnity function so hwo do we do it, at every point lets say “5” in histogram we build a gaussian kernel (see red lines in density function) by making 5 as a mean , we will repeat it with every point in histogram, in the end we just add up all the values occuring at a single point, for example, at “5” ther are 3 gaussian kernels, we will add up the values of these kernel to get the pdf.

Now what about the standard deviation of every gaussian kernel(as we fixed mean already) so here standard deviation is also called bandwidth.

So if we make bandwidh too small the it would appear as Red line above, if we make bandwidth make too big, it would be very flat liek green lien above, if we make it normal, it will look like Black.

QUES 5- IMPORTANCE OF SAMPLING THEOREM AND CENTRAL LIMIT THEOREM?

Ans-

Sampling Theorem

It simply says lets say X is any random distribution not neccesarily gaussian, lets say we take random sample of size n lets say 30 ,we call it (s1), again we will take a random sample of size n, we call it (s2), lets say we take m samples like this so last sample would be (sm)

if we will take mean of all the samples , lets say for s1 sample, x1' is the mean, for s2, x2' is the mean and so on.

x1',x2',x3'….xm’(mean values of all the samples)

Distibution of xi’s is called the the sampling distribution of sampling means.

Suppose you have a random variable that has a population mean, μ, and a population standard deviation, σ. If a sample of size n is taken, then the sample mean, x¯ has a mean μx¯=μ and standard deviation of σx¯=σ/squareroot(n). The standard deviation of x¯s lower because by taking the mean you are averaging out the extreme values, which makes the distribution of the original random variable spread out.

This we will use in Central Limit Theorem.

CENTRAL LIMIT THEOREM

Theorem states — Suppose a random variable X (population distribution) with a finite mean and standard deviation forms any distribution. If a sample of size n is taken, then the sample mean, x¯, becomes normally distributed as n increases.

Lets take a random variable X(population distribution)with any distribution, with a finite mean and standard deviation, (pareto distribution had infinite mean and standard deviation) , and we take m sample of size n, lets say

s1, s2, s3…sm and then we took the mean of all the sample x1',x2',x3'…..xm’, central limit theorem says if we plot the distribution of these means it will tends to form NORMAL DISTRIBUTION with mean equal to population distribution and variance will be (variance of population/n²).

Lets say m=1000 and n=30 so by just looking into 30k data points, we are able to estimate the whole population mean and population variance thats make it the most fundamental Theorem.

QUES 6- IMPORTANCE OF Q-Q PLOT?

Ans 6- There are various things to check if your distribution is gaussion or not, TWO MOST USED TECHNIQUES ARE

  1. Q-Q PLOT(QUANTILE QUANTILE PLOT)
  2. KS TEST( WE WILL SEE LATER)

So how do we plot Q-Q plot?

So lets assume we have a random variable X and we take 500 observations out of them, lets say x1, x2…..x500.

HERE WE DO NOT KNOW THE DISTRIBUTION OF X, AT THE END OF QQ PLOT WE SHOULD KNOW IS IT NORMAL DISTRIBUTED OR NOT.

STEPS TO FOLLOW

1. Sort xi’s in ascending order and find percentile

(if you dont know how to find percentile or what is exactly percentile, lets assume i have 100 values and i sort them into ascending order.

X={x1,x2,x3….x100}, here x1<x2<….x100

In this set, lets say i am ranking each value from 1 to 100 so first value will get rank 1 and the last value will get rank 100.

I can say that below the value of x10 or below the value of 10th rank or 10th percentile, 10% of the values lies and above x10 or above 10th percentile, 90% of the value lies.

That is the meaning of percentile.

so we will get 100 percentile values for the orginal 500 samples

x5,x10,x15,….x500{these are the percentile values}

x5 is the value below which only 1% of the values lies(because here the sample size is 500 not 100)

x10 is the value below which only 2% of the value lies

x25 is the value below which only 5% of the value lies

2. Second step is to create a Random Variable Y which has a Normal Distribution and has a mean=0 and standard deviation =1.

Again we will take 500 observation, sort them and find their percentile

so lets say we have y1,y2,y3…y100(same as we did with our original distribution X)

LET ME REMIND YOU WE DON’T KNOW WHAT IS THE DISTRIBUTION OF X, THAT’S WHY WE ARE USING Q-Q TEST TO DETERMINE WHETHER THE DISTRIBUTION OF X IS GAUSSIAN/NORMAL OR NOT.

3. Third step is to plot QQ plot between X and Y

so we have {x1,y1},{x2,y2},{x3,y3}……{x100,y100}

we will plot and if all the points lie in the same line, it means X is NORMALLY DISTRIBUTED but need not have mean= 0 and standard deviation =1.

if all points does not lie in the same line, it means X is not NORMALLY DISTRIBUTED.

In the picture below points are deviating in the end, it means sample quantiles is not normally distributed.

If number of observations are small , it is hard to interpret QQ plot.

Q-Q plots are also used to check if two random variable X and Y have same distribution or not by the same method.

CODE

import scipy.stats as stats

import numpy as np

import pylab

stats.probplot(Y, dist= ‘norm’,plot=pylab)

pylab.show

QUES 7- WHAT IS UNIFORM DISTRIBUTION?

Ans-7 In statistics, a type of probability distribution in which all outcomes are equally likely. A deck of cards has within it uniform distributions because the likelihood of drawing a heart, a club, a diamond or a spade is equally likely. A coin also has a uniform distribution because the probability of getting either heads or tails in a coin toss is the same.

QUES 8- WHAT IS DICRETE AND CONTINOUS UNIFORM DISTRIBUTION?

Ans 8-

  1. DISCRETE

2. CONTINOUS

QUES 9-HOW TO RANDOMLY SAMPLE DATA POINTS?

Ans 9-Simple random sampling is the most basic and common type of sampling method used in quantitative social science research and in scientific research generally. The main benefit of the simple random sample is that each member of the population has an equal chance of being chosen for the study. This means that it guarantees that the sample chosen is representative of the population and that the sample is selected in an unbiased way. In turn, the statistical conclusions drawn from the analysis of the sample will ​be valid.​

There are multiple ways of creating a simple random sample. These include the lottery method, using a random number table, using a computer, and sampling with or without replacement.

Lottery Method of Sampling

The lottery method of creating a simple random sample is exactly what it sounds like. A researcher randomly picks numbers, with each number corresponding to a subject or item, in order to create the sample. To create a sample this way, the researcher must ensure that the numbers are well mixed before selecting the sample population.

Sampling With Replacement

Sampling with replacement is a method of random sampling in which members or items of the population can be chosen more than once for inclusion in the sample. Let’s say we have 100 names each written on a piece of paper. All of those pieces of paper are put into a bowl and mixed up. The researcher picks a name from the bowl, records the information to include that person in the sample, then puts the name back in the bowl, mixes up the names, and selects another piece of paper. The person that was just sampled has the same chance of being selected again. This is known as sampling with replacement.

Sampling Without Replacement

Sampling without replacement is a method of random sampling in which members or items of the population can only be selected one time for inclusion in the sample. Using the same example above, let’s say we put the 100 pieces of paper in a bowl, mix them up, and randomly select one name to include in the sample. This time, however, we record the information to include that person in the sample and then set that piece of paper aside rather than putting it back into the bowl. Here, each element of the population can only be selected one time.

QUES 10- EXPLAIN BERNOULLI AND BINOMIAL DISTRIBUTION?

Ans 10-

BERNOULLI DISTRIBUTION- This distribution is used when you have two outcomes, probablity of getting one outcome is p and probability of getting another is 1-p. This distribution is a discrete distribution.

BINOMIAL DISTRIBUTION-A binomial distribution can be thought of as simply the probability of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated multiple times. The binomial is a type of distribution that has two possible outcomes (the prefix “bi” means two, or twice). For example, a coin toss has only two possible outcomes: heads or tails and taking a test could have two possible outcomes: pass or fail.

  • The first variable in the binomial formula, n, stands for the number of times the experiment runs.
  • The second variable, p, represents the probability of one specific outcome.

QUES 11- WHAT IS CHEBSHEV’S INEQUALITY?

Ans- So this is a very intersting topic , Now lets say i have random variable X which says about the height of all the studets in the school or office or anywhere.

CASE1- Lets say we know that the distribution of X and it is gaussian distributed.

Now when we know it is gaussian distributed we know gaussian distribution follows 68%, 95% and 99.7% rule, means 68% of the total data lies between first standard deviation, 95% of the total data lies between second standard deviation and 99.7% of the total data lies between third standard ddeviation.

We can easily plot a cdf of the data and can answer any question, for example lets say we know the mean=150cm and standard deviation=10cm, so by this rule 95% of the total data lies between second standard deviation.

p(u-2σ<X<u+2σ)=95%, means 95% of the total heights of the people lies between [130<X<170].

CASE2- What if we do not know the distribution,lets take an example, lets say we have a random variable X which tells us about the salaries of all the people in the country but we do not know the distribution but with central limit theorem we got mean and standard deviation, we have to make sure the mean should be finite and standard deviation must be non zero and finite.

Now the question is can we know what % of salaries lies with second standard deviation which will be p(u-2σ<X<u+2σ).

Lets say u=40k and σ=10k, now can we know what % of individuals have a salary in range of [20k, 60k] which is just under second standard deviation.

Here comes CHEBSEV’S INEQUALITY, IT SAYS,

P(|X-u|≥kσ)≤1/k²

where k is a constant value

P(X≥u+kσ AND X≤u-2σ)≤1/k²

It can be written as

P(u-2σ≤ X ≤u+2σ) > 1-(1/k²)

Now we can easily answer that salary question because, u-2σ= 20 and u+2σ=60, from this we got k=2, so from this

P(20<X<60)>1-(1/4)

P(20<X<60)>0.75, means atleast 75% of the peole salary lies between this region.

QUES 12- EXPLAIN BOX COX TRANSFORMATION?

Ans 12-

So Box Cox transformation is that mathematical trick which converts Pareto distribution to Gaussian Distribution.

Lets understand how Box Cox works.

Lets say we have a Pareto distribution(X) and its data points are x1,x2,x3…xn

Step1

boxcox(X)= lamda

So basically you will be giving “n” observations of x to box-cox and it will give you lamda.

Now how box cox will give you lamda is involves a lot of mathematics and it is not necessary to get into that maths.

so lets assume Box Cox is a box in which n observations goes inside and lamda comes outside.

STEP 2

Now from the above picture, we can clearly understand, if lamda we got in step 1 is “0” then just taking log of (xi) will give us Gaussian Distribution otherwise we have to use the formula in the picture above.

CODE FOR BOX COX

scipy.stats.boxcox(X,lamda= “ ”)

Literally a one line code but a important step to get the understanding.

THANKS FOR READING, ENJOY LEARNING MACHINE LEARNING…

Nishesh Gogia

--

--