MATHS FOR DATA SCIENCE

Diving deep into Statistics for Data Science

Confidence Intervals and Hypothesis Testing

Vishal Sharma
The Startup

--

Photo by Alex Chambers on Unsplash

Learning statistics is very crucial when you are looking to make big in the data science career. Previously in my two articles, I have already covered the concepts of Binomial Distribution, Conditional Probability, Bayes Rule, Normal Distribution Theory, Sample Distribution, and Central Limit Theorem.

In this article article, I will cover the concepts of Confidence Level and Hypothesis Testing.

Confidence Intervals

In the last article, I talked about how Sampling Distribution can help in understanding the values of a statistic possible.

From Sampling Distribution to Confidence Intervals

It turns out we can use the sampling distribution to find our most likely parameter. Imagine we have a sampling distribution for some statistics in a bell-shaped curve.

We can now actually use this sampling distribution in building our confidence interval for the parameter of interest. In the above image, we have gone for a 95% confidence interval and cut off 2.5% from each side.

In simple terms, it is most likely that we will find our parameter in this shaded area with 95% confidence.

But, what are the applications of confidence intervals? We could look at the effectiveness of different drugs by comparing two groups who take those two different drugs. Or, a group that takes a particular drug to a group that doesn’t take at all.

A/B Testing in the picture!

One of the most common use case, when two groups are to be compared, is known as A/B Testing. For example, you compare one web page to another and determine which page layout drives large traffic.

Practical vs Statistical Significance

Imagine you are owning a cookies business and want to distribute ads in the newspaper. You have designed two ads for the audience — one being super interactive and interesting while the other one mediocre.

People will surely love the first one. And, the confidence interval of it will certainly be higher.

What about practicality?

Let’s say both the ads generate more traffic than your decided threshold. Now, what ad will you choose?

Looking at the aspects of time, space, and money spent on making these separate ads, the second one obviously seems to be inexpensive and less effort-driven than first one.

Even though the statistically second ad was better, you will still go with the first advertisement. The first ad will generate enough traffic while keeping the time and money efforts way lesser than the second one.

Practical Significance over Statistical Significance!

Hypothesis Testing

There are two possible outcomes: if the result confirms the hypothesis, then you’ve made a measurement. If the result is contrary to the hypothesis, then you’ve made a discovery. It is no good to try to stop knowledge from going forward. — Enrico Fermi

Every industry has its own question related to their business and platform growth. As a data scientist or data analyst, we have to answer those questions.

But, first, we have to translate those questions into what is known as a Hypothesis. Then, we have to collect data to justify which hypothesis is likely to be true.

Let’s take an example! You and your friend are having a debate on which ice cream flavor is the best in the world. You can say it’s Chocholate and your friend chooses Vanilla.

But, what is the best ice cream flavor?

Chocolate and Vanilla are just hypothesis created by you and your friend. But, does data support your hypothesis?

H0 — Chocholate and H1 — Vanilla

You can use hypothesis testing or confidence intervals on the given sample of the population to draw a conclusion — which is the best ice cream flavor?

When performing hypothesis testing, the first thing we have to do is to translate the question into two hypotheses.

Null(H0) and Alternative(H1) hypothesis

Setting up these hypothesis is subjective but some general rules are to be followed:

  1. The H0 is what is believed to be true before you collect any data.
  2. The H0​ has no effect or that two groups are equal.
  3. The H0​ and H1​ are competing, non-overlapping hypotheses.
  4. H1​ is what we would like to prove to be true.
  5. H0​ contains an equal sign of some kind — either =, ≤, or ≥.
  6. H1​ contains the opposition of the null — either ≠, >>, or <<.

In the US judicial system, it is said “Innocent until proven guilty”, which is exactly the case of our null and alternative hypothesis. For every innocent, the null hypothesis is innocent!

What are One-tailed test and Two-tailed test?

A one-tailed test aka directional hypothesis or directional test is a hypothesis test in which critical region is one-sided. it means critical region is either greater or less than a certain value. If a sample falls in the one-sided critical region, it will be accepted as an alternative hypothesis.

A two-tailed test is a hypothesis-testing technique where the critical region is two-sided distributed with a sample greater than or less than a certain range of values. If the sample falls in either of critical values, the null hypothesis is rejected.

Summary

I have talked about a few statistical concepts in the article above:

  1. Confidence Intervals
  2. Practical vs Statistical Significance
  3. Hypothesis Testing

The next article will feature Regression understandings — Linear and Logistic regression techniques.

Peace!

--

--