Important Questions on Statistics

In this blog, you’ll find Statistics Questions that were asked in interviews (Part 1). For part 2 click here.

Sumeet Agrawal
8 min readNov 20, 2021

Ques 1) What is the difference between inferential statistics and descriptive statistics?

Ans) Inferential statistics allows you to make predictions (“inferences”) from that data. With inferential statistics, you take data from samples and make generalizations about a population.

There are two main areas of inferential statistics:

  1. Estimating parameters. This means taking a statistic from your sample data (for example the sample mean) and using it to say something about a population parameter (i.e. the population mean).
  2. Hypothesis tests. This is where you can use sample data to answer research questions. For example, you might be interested in knowing if a new cancer drug is effective.

Descriptive statistics describe data (for example, a chart or graph). It provides exact and accurate information. Let’s say you have some sample data about a potential new cancer drug. You could use descriptive statistics to describe your sample, including:

  • Sample mean
  • Sample standard deviation
  • Making a bar chart or boxplot
  • Describing the shape of the sample probability distribution

Ques 2) Most common characteristics used in descriptive statistics?

Ans) Center — middle of the data. Mean / Median / Mode is the most commonly used measure.

  • Mean — Average of all the numbers
  • Median — the number in the middle
  • Mode — the number that occurs the most. The disadvantage of using Mode is that there may be more than one mode.

Spread — How the data is dispersed. Range / IQR / Standard Deviation / Variance is the most commonly used measure.

  • Range = Max — Min
  • Inter Quartile Range (IQR) = Q3 — Q1
  • Standard Deviation (σ) = √(∑(x-µ)2 / n)
  • Variance = σ2

Shape — the shape of the data can be symmetric or skewed

  • Symmetric — the part of the distribution that is on the left side of the median is the same as the part of the distribution that is on the right side of the median
  • Left skewed — the left tail is longer than the right side
  • Right skewed — the right tail is longer than the left side

Outlier — An outlier is an abnormal value

  • Keep the outlier based on judgment
  • Remove the outlier based on judgment

Ques 3) What are left-skewed distribution and right-skewed distribution?

Ans) Left skewed

  • The left tail is longer than the right side
  • Mean < median < mode

Right skewed

  • The right tail is longer than the right side
  • Mode < median < mean

Ques 4) Explain the Statistical test.

Ans) Statistical/Hypothesis Tests are used to make summarize the population from the sample. These tests are used to check whether there is enough evidence in sample data to conclude/deduce that any particular condition is also true for the entire population. These tests enable us to make decisions on the basis of observed patterns from data.

There is a wide range of statistical tests. The choice of which statistical test to utilize relies upon the structure of data, the distribution of the data, and variable type. There are many different types of tests in statistics like t-test, Z-test, chi-square test, ANOVA test, binomial test, one sample median test, etc.

Choosing a Statistical test-

Parametric tests are used if the data is normally distributed. A parametric statistical test makes an assumption about the population parameters and the distributions that the data came from. These types of tests include t-tests, z-tests, and ANOVA tests, which assume data is from a normal distribution.

Non-parametric statistical test- Non-parametric tests are used when data is not normally distributed. Non-parametric tests include the chi-square test.

Ques 5) When should you use a t-test vs a z-test?

Ans) Z-test- A z-test is a statistical test used to determine whether two population means are different when the variances are known and the sample size is large. In the z-test mean of the population is compared. The parameters used are population mean and population standard deviation. Z-test is used to validate a hypothesis that the sample drawn belongs to the same population.

Ho: Sample mean is same as the population mean(Null hypothesis)

Ha: Sample mean is not same as the population mean(Alternate hypothesis)

z = (x — μ) / (σ / √n),

where , x=sample mean, u=population mean, σ / √n = population standard deviation

T-test- In the t-test, the mean of the two given samples is compared. A t-test is used when the population parameters (mean and standard deviation) are not known and also, have a small sample size.

Types of t-test-

a) One sample t-test — The mean of a single group is compared with a given mean. For example to check the increase and decrease in sales if the average sales are given. Here’s the formula to calculate this:

where,

  • t = t-statistic
  • m = mean of the group
  • µ = theoretical value or population mean
  • s = standard deviation of the group
  • n = group size or sample size

b) Independent sample t-test — The independent t-test which is also called the two-sample t-test or student’s t-test, is a statistical test that determines whether there is a statistically significant difference between the means in two unrelated groups.

For example -Let’s say we want to compare the average height of the male employees to the average height of the females. Of course, the number of males and females should be equal for this comparison. This is where a two-sample t-test is used.

Here’s the formula to calculate the t-statistic for a two-sample t-test:

where,

  • mA and mB are the means of two different samples
  • nA and nB are the sample sizes
  • S2 is an estimator of the common variance of two samples, such as:

c) Paired sample t-test — Tests for the difference between two variables from the same population( pre-and post-test score). For example- In a training program performance score of the trainee before and after completion of the program.

The formula to calculate the t-statistic for a paired t-test is:

where,

  • t = t-statistic
  • m = mean of the group
  • s = standard deviation of the group
  • n = group size or sample size

Ques 5) Explain ANOVA Test and Chi-Square Test.

Ans) ANOVA Test- Analysis of variance (ANOVA) is a statistical technique that is used to check if the means of two or more groups are significantly different from each other. ANOVA checks the impact of one or more factors by comparing the means of different samples. If we use a t-test instead of an ANOVA test it won’t be reliable as the number of samples is more than two and it will give an error in the result.

The hypothesis being tested in ANOVA is

Ho: All pairs of samples are the same i.e. all sample means are equal

Ha: At least one pair of samples is significantly different

In ANOVA test we calculate F value and compare it with critical value

F= ((SSE1 — SSE2)/m)/ (SSE2/n-k)

where

SSE = residual sum of squares

m = number of restrictions

k = number of independent variables

Chi-square test( χ2 test)- chi-square test is used to compare two categorical variables. Calculating the Chi-Square statistic value and comparing it against a critical value from the Chi-Square distribution allows assessing whether the observed frequencies are significantly different from the expected frequency.

The hypothesis being tested for chi-square is-

Ho: Variable x and Variable y are independent

Ha: Variable x and Variable y are not independent.

Chi-square formula

where o=observed, e=expected.

Ques 6) Explain Central Limit Theorem with an example.

Ans) “The Central Limit Theorem (CLT) is a statistical theory states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population.”

Sample sizes equal to or greater than 30 are often considered sufficient for the CLT to hold. A key aspect of CLT is that the average of the sample means and standard deviations will equal the population mean and standard deviation. Sufficiently large sample size can predict the characteristics of a population more accurately.

Let’s understand the central limit theorem with the help of an example.

Consider that there are 15 sections in the engineering department of a university and each section hosts around 100 students. Our task is to calculate the average weight of students in the engineering department.

The basic approach is to simply calculate the average:

  • Measure the weights of all the students in the engineering department
  • Add all the weights
  • Finally, divide the total sum of weights by a total number of students to get the average

But what if the size of the data is humongous? Does this approach make sense? Not really — measuring the weight of all the students will be a very tiresome and long process. So, what can we do instead? Let’s look at an alternate approach.

  • First, draw groups of students at random from the class. We will call this a sample. We’ll draw multiple samples, each consisting of 30 students.
  • Calculate the individual mean of these samples
  • Calculate the mean of these sample means
  • This value will give us the approximate mean weight of the students in the engineering department
  • Additionally, the histogram of the sample mean weights of students will resemble a bell curve (or normal distribution)

This, in a nutshell, is what the central limit theorem is all about.

--

--