3 Weeks Beginners Guide to Ace Data Science Interview: #Day 6

Statistical Tests to have a deeper understanding of data

Vinay Vikram
Accredian
8 min readFeb 15, 2020

--

About the Series

Data Science field is an exciting career choice and seeing a lot of hiring across fresh, lateral and experienced job positions. It’s one thing to know the concepts and totally another to crack the rigorous interviews for data science positions. If a candidate is aware of the different questions and the interview process, he is on the right path to an excellent career in the evolving Data Science field.

This 3-week beginners guide to Ace Data Science Interview will be a useful asset for individuals who are preparing for the Data Science interviews. Every day for the next 21 days, we will talk about the different areas of the Data Science field and cover them elaborately. So sit back and start reading the article to get a finer understanding of the Data Science field and go prepared for the interviews.

In our previous Day5 blog, our focus was on descriptive statistics; in this blog, we’ll see different interview questions related to inferential statistics concepts. For the interview perspective, this blog is really important as all the concepts that fall under inferential statistics are complex.

Inferential Statistics

Inferential statistics uses statistical models to help you compare your sample data to other samples or to previous research. Most research uses statistical models called the Generalized Linear model and includes Student’s t-tests, ANOVA (Analysis of Variance), regression analysis and various other models that result in straight-line (“linear”) probabilities and results.

Giphy

Question 1: When to use t distribution and when to use z distribution?

The following conditions must be satisfied to use Z-distribution

  • Population variance is known.
  • Sample size > 30

Else we should use t-distribution i.e

  • Population variance is unknown.
  • Sample size < 30

Question 2: What do you mean by the degree of freedom?

DF is defined as the number of options we have

DF is used with t-distribution and not with Z-distribution

For a series, DF = n-1 (where n is the number of observations in the series)

Question 3: What are the effects of the width of confidence interval?

  • The confidence interval is used for decision making
  • As the confidence level increases the width of the confidence interval also increases
  • As the width of the confidence interval increases, we tend to get useless information also.
  • Useless information — wide CI
  • High risk — narrow CI

Question 4: What is the difference between 95% confidence level and 99% confidence level?

The confidence interval increases as I move from 95% confidence level to 99% confidence level.

Question 5: What are H0 and H1? What is H0 and H1 for a two-tail test?

H0 is known as the null hypothesis. It is the normal case / default case.

  • For one tail test x <= µ
  • For two-tail test x = µ

H1 is known as the alternate hypothesis. It is the other case.

  • For one tail test x > µ
  • For two-tail test x <> µ

Question 6: What is p-value in hypothesis testing?

If the p-value is more than the then critical value, then we fail to reject the H0

  • If p-value = 0.015 (critical value = 0.05) — strong evidence
  • If p-value = 0.055 (critical value = 0.05) — weak evidence

If the p-value is less than the critical value, then we reject the H0

  • If p-value = 0.055 (critical value = 0.05) — weak evidence
  • If p-value = 0.005 (critical value = 0.05) — strong evidence

Question 7: What do we mean by — making decision based on comparing p-value with significance level?

  • If the p-value is more than the then critical value, then we fail to reject the H0
  • If the p-value is less than the critical value, then we reject the H0

Question 8: What is the difference between one tail and two tail hypothesis testing?

2-tail test: Critical region is on both sides of the distribution

  • H0: x = µ
  • H1: x <> µ

1-tail test: Critical region is on one side of the distribution

  • H1: x <= µ
  • H1: x > µ

Question 9: Why is the t-value the same for 90% two tail and 95% one tail test?

  • P-value of 1-tail = P-value of 2-tail / 2
  • It is because in two tail there are 2 critical regions

Question 10: What is the difference between the sample standard deviation and population standard deviation?

“When the Going Gets Tough, the Tough Get Going”

Question 11: What are different types of Hypothesis Testing?

The different types of hypothesis testing are as follows:

  • T-test: The T-test is used when the standard deviation is unknown and the sample size is comparatively small(Sample size < 30).
  • Z-test: Z-test is used when the standard deviation is known and the sample size is large(Sample size>30).
  • Chi-Square Test for Independence: These tests are used to find out the significance of the association between categorical variables in the population sample.
  • Analysis of Variance (ANOVA): This kind of hypothesis testing is used to analyze differences between the means in various groups. This test is often used similarly to a T-test but, is used for more than two groups.

Note: Both T-test and Z-test are used to check “ What is the probability that two samples come from the same population?

Z-distribution( Known Variance & Bigger Sample size)

t-distribution( UnKnown Variance & Smaller Sample size)

Question 12 : What is p-value?

A p-value is used in hypothesis testing to help you support or reject the null hypothesis. The p-value is the evidence against a null hypothesis. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.

Question 13: You have given the following data for data science salary.

+--+------------------+
| | scientist salary |
+--+------------------+
| | set |
| | 17,313 |
| | 04,002 |
| | 13,038 |
| | 01,936 |
| | 4,560 |
| | 13,136 |
| | 0,740 |
| | 00,536 |
| | 05,052 |
| | 7,201 |
| | 1,986 |
| | 4,868 |
| | 0,745 |
| | 02,848 |
| | 5,927 |
| | 12,276 |
| | 08,637 |
| | 6,818 |
| | 2,307 |
| | 14,564 |
| | 09,714 |
| | 08,833 |
| | 15,295 |
| | 9,279 |
| | 1,720 |
| | 9,344 |
| | 14,426 |
| | 0,410 |
| | 5,118 |
| | 13,382 |
+--+------------------+

With the information

  • Sample mean $ 100,200
  • Population std $ 15,000

Find the Confidence interval for the significance level of 95%.

Solution: For a complete solution check the following sheet

Question 14: You have give the following data

 +-------------+
| Dataset |
+-------------+
| $ 78,000 |
| $ 90,000 |
| $ 75,000 |
| $ 117,000 |
| $ 105,000 |
| $ 96,000 |
| $ 89,500 |
| $ 102,300 |
| $ 80,000 |
+-------------+
  • Mean $ 92,533
  • St. deviation $ 13,932

Find the Confidence interval for the significance level of 95%.

Solution: For a complete solution check the following sheet for help

Question 15: Explain the chi-squared test?

A chi-square test is used to determine the probability of an observed frequency of events given an expected frequency.

Eg: If we flip a coin 18 times & observe that it comes up head 12 times, can we say that this is due to chance, or do we assume that our coin is biased.

Simply can be expressed as :

Chi-square test

In simple terms, A chi-square test for goodness of fit is used to verify whether an observed frequency distribution differs from a theoretical distribution or not.

Question 16: An automobile manufacturer believes that, out of every 100 cars sold, on average 25 are white, 20 are silver, 15 are black, and 40 are other colors like blue, red and green. To test their assumption, they gather data on 100 recent sales. Perform a chi-square test to determine if the observed sales match the manufacturer’s expectations, with a 95% level of confidence.

alpha=0.05, What conclusion can be drawn?

Question 17: In the game of Rock-paper-scissors, Abhi expectsto win, tie & lose with equal frequency. Abhi plays R-P-S often, but he suspected his own games were not following that pattern, so he took a random sample of 24 games and recorded their outcomes. Here are his results.

# +============================+
# | Outcome | Win | Loss | Tie |
# +============================+
# | Games | 4 | 13 | 7 |
# +----------------------------+

He wants to use these results to carry out a Chi-square(goodness of fit) test to determine the distribution of his outcome.

Choose the correct option


(a) χ2=5.24 and 0.05 <p-value<0.01 | (b) χ2=21.875 and p-value <0.0005
(c) χ2=5.25 and 0.15 <p-value<0.2 | (d) χ2=21.875 and 0.0005 <p-value < 0.001

Solution:

solution

If this blog helped you in any way, then do Follow and Clap👏, because your encouragement catalyzes inspiration and helps to create more cool stuff like this. As always, I welcome feedback and constructive criticism, love to hear from your end.

Check what’s on Day1, Day2, Day3, Day4, Day5

Final Thoughts and Closing Comments

There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this story, I recommend you to go with the Global Certificate in Data Science & AI because this one will cover your foundations, machine learning algorithms, and deep neural networks (basic to advance).

--

--

Vinay Vikram
Accredian

Artificial Intelligence Researcher at @MOTHERSON | Check My Data Science Portfolio: https://vikramvinay.github.io/