Mathematics & Statistics

Shreyal Gajare
Omni Data Science
Published in
5 min readJul 26, 2019

Part — 2

So, guys I hope you are well acquainted with the Part 1 story of descriptive statistics. Let’s proceed further with the next part i.e. Inferential statistics.

B. Inferential Statistics:

Inferential statistics deals with drawing inferences regarding the population from the sample data collected. This is useful when it is impossible to study every member of the population.

1. Distribution:

A distribution is a function that shows possible values for a variable & the frequency of their occurrence. Categorical data distribution can be observed as number or percentage of people in each group. Numerical data distribution can be arranged from smallest to largest broken into reasonable sized groups.

Types of Distributions

According to wikipedia “distribution can be thought of as providing the probabilities of occurrence of different possible outcomes in an experiment”. So, each probability distribution is associated with a graph describing the likelihood of occurrence of every event.

Normal Distribution:

The normal distribution is also known as Gaussian distribution or Bell Curve. It is the most commonly used distribution as it possesses the following properties:

  1. Distributions of sample means with large enough sample sizes could be approximated to normal.
  2. All the computable statistics are elegant
  3. It can approximate wide variety of random variables
  4. It has good track record
  5. Heavily used in regression analysis

Examples: Stock market analysis, IQ tests etc.

Standard Normal Distribution (z):

The Standard Normal Distribution is a particular case from Normal distribution where mean is 0 and Standard deviation is 1. Every normal distribution can be standardized.

Standardization is the process of turning a normally distributed variable to one with standard normal distribution. Standardization is necessary because,

  1. it detects normality
  2. it detects outliers
  3. it tests hypotheses
  4. it compares different normally distributed datasets
  5. it performs regression analysis

2. Central Limit Theorem:

It states “no matter the underlying distribution of the dataset, the sampling distribution of the means would approximate a normal distribution”. The mean of sampling distribution would be equal to the mean of the original distribution. The variance would also be n times smaller where n is the size of the samples. The distribution of sample means would become increasingly more normal as sample size increases.

3. Estimators:

An estimator is a mathematical function that depends solely on sample information and approximates the population parameter. Example of estimators & their parameters:

Two important properties of estimators:

  1. Bias: The estimator should be unbiased . An unbiased estimator has an expected value equal to the population parameter. If the expected value of an estimator is (parameter + b) then bias is b.
  2. Efficiency: The most efficient estimators are the ones with the least variance.

4. Estimates:

An estimate is the output that we get from the estimator. There are two types of estimates: point estimates and confidence interval estimates.

Point estimate is a single number located exactly in the middle of the confidence interval.

Confidence Interval: A confidence interval is an interval within which we are confident the population parameter will fall.

5. Hypothesis Testing:

Hypothesis testing is a technique that involves asking questions, information collection and verifying what the data tells about and how to proceed further. In short “A hypothesis is an idea that can be tested”. Steps involved in data driven decision making are,

  1. Formulate a hypothesis
  2. Find the right test
  3. Execute the test
  4. Make a decision

The hypothesis to be tested or assumes that the observations are the result of chance. They are denoted by Ho.

The alternative hypothesis denotes that the observations are result of real effect. They are denoted by Ha.

6. Type I Error vs Type II Error:

Type I Error is also known as False Positive. It occurs when we reject a true null hypothesis. It is denoted by alpha. The probability of getting Type I error is alpha which is the significance level for the hypothesis test.

Type II Error is also known as False negative. It occurs when we fail to reject a null hypothesis that is false. It is denoted by Beta. The probability of not getting a Type II error is called Power of the test.

7. P — value:

P — value is the number lying between 0 and 1 which helps to determine the strength of results. After choosing a significance level of alpha we compare the P — value to it. We should reject the null hypothesis if P — value is lower than the significance level.

Low p — value (≤ 0.05) denotes the strength against the null hypothesis i.e. we can reject the null hypothesis. High p — value (≥ 0.05) denotes the strength for null hypothesis i.e. we can accept the null hypothesis.

P — value is a universal concept that works with every distribution.

8. Correlation:

Correlation is the measure of association between two quantitative variables. It helps in predicting one entity with respect to another. For eg. height and weight are correlated measures, taller people tend to have more weight as compared to shorter people.

The main result coming from correlation is known as correlation coefficient.

Its value lies between -1 to 1 where,

  1. -1 denotes very strong negative correlation
  2. 0 denotes no correlation
  3. 1 denotes very strong positive correlation
Correlation measures

The above diagram denotes various phases of correlation.

So this is end for the section. I hope this is positive enough for you guys. For further reading of contents stay tuned with Omni Data Science! Happy Reading.

--

--