Statistics and Probability in Data Science | Pic Credit: Google

Part 2: Statistics and Probability in Data Science | Data Science 2020

Aman Kapri
Analytics Vidhya
Published in
6 min readFeb 13, 2020

--

This post is the continuation of Part 1, which covered the basics of Statistics. Please refer to the link → Statistics Basics.

Before starting with the probability basics, let’s try to understand the distance of the data points from the mean.

Calculating the distance of the data points from Mean

There are many approaches to calculating the distance of the data points from mean. Let us check a few of them.

· Average Distance to Mean: The average distance of all the points to the mean is always zero. (Negatives and Positives cancel each other out)

· Mean Absolute Deviation (MAD): The mean absolute deviation of a dataset is the average distance between each data point and the mean. It considers only the absolute value while calculating the average. It gives us an idea about the variability in a dataset. It gives equal importance to all the errors.

· Mean Squared Distance (MSE): The average of the mean squared distance gives the variance (σ² ). It is used in scenarios when we want the bigger error to be penalized and shrunken the smaller errors.

· Root Mean Squared Distance (RMSE): The root of the mean squared distance gives the standard deviation (σ).

Probability Basics

What is probability?

Predicting the likelihood of a future event or the extent to which something is likely to happen in the future.

Sample Space (S): Set of all the possible outcomes.

Event: A subset of the sample space.

Mutually Exclusive Event: When the occurrence of one event prevents the occurrence of other events, then it is said to be a mutually exclusive event.

P (A or B) = P (A) + P (B)

Independent Event: When the occurrence of one event does not affect the occurrence of the other event, then it is said to be an independent event.

P (A and B) = P (A)* P ( B)
P (A or B) = P (A) + P (B) — P (A)* P ( B)

Types of Probabilities:

- Marginal probability

- Joint probability

- Union probability

- Conditional probability

We can calculate the above-mentioned probability with the help of the contingency table.

Marginal probability: Probability of a single attribute.

E.G. P (Yes) = 0.184, P (Young) =0.301 etc.

Joint probability: Probability describing the combination of attributes.

E.G. P (Young and Yes) = 0.077, P (Middle and No) = 0.567

Union probability: It is a combination of marginal and joint probability.

E.G. P (Yes or Young) = P (Yes) + P (Young) — P (Yes and Young)

Conditional probability: Probability of A occurring given that B has already occurred. In this scenario, the sample space is restricted to a single row or column and the entire sample space becomes irrelevant.

E.G. P (Yes/Young) = P(Yes and Young)/P(Young)

Bayes Theorem

The Bayes theorem helps us to find the conditional probability based on the prior probability and the likelihood of an event occurring.

Bayes Theorem

Random Variable

A random variable is a real-valued function whose domain is the sample space of the random experiment. It can take multiple values with different probabilities.

E.G. Let X be the random variable, which is defined as the sum of two fair dice. Then the sum of the random experiment to be 2, 3, 4, 5 would be:

P (X=2) = P {(1,1)} = 1/36
P (X=3) = P {(2,1), (1,2)} = 2/36
P (X=4) = P {(2,2), (3,1), (1,3)} = 3/36
P (X=5) = P {(2,3), (3,2), (1,4), (4,1)} = 4/36

Probability Distribution

The mathematical function or the method of assigning probabilities to the random variables is known as a probability distribution.

Probability Distribution

Normal Distribution

The normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graph form, normal distribution will appear as a bell curve.

Normal Distribution

Equation:

Equation of Normal Distribution

In a normal distribution, Mean = Median = Mode

The normal distribution curve describes the spread of the data around mean, which is 50 % to the left of the mean and 50 % to the right of the mean.

The percentage of data around mean can be calculated by the standard deviation (s).

Distribution of Data around Mean.

In summary,
around 68% data falls between -1σ and 1σ
around 95% data falls between -2σ and 2σ
around 99.7% data falls between -3σ and 3σ

Detecting outlier with the Z-score

Any z-score greater than 3 or less than -3 is considered an outlier. This rule of thumb is based on the empirical rule. From this rule, we see that almost all of the data (99.7%) should be within three standard deviations from the mean.

By calculating the z-score we are standardizing the observation, meaning the standard deviation is now 1. Thus from the empirical rule, we expect 99.7% of the z-scores to be within -3 and 3.

Hypothesis Testing

Hypothesis tests give a way of using the samples to test whether or not the statistical claims are likely to be true or not.

It allows us to make a decision on the claims that we give on the samples whether it is true or not.

Steps of Hypothesis testing:

1) Select the hypothesis.

H0: The given condition
H1: The alternative condition

2) Choose the statistic based on the sample size.
E.G. Z-test, T-test, etc.

3) Define the significance level (α)
where α = 1 — Confidence Interval.
E.G. for 95% C.I, the α will be 1–0.95 = 0.05

4) Define the critical region

5) Find the P-value

6) Check whether the sample is in a critical region or not.

7) If it is in the critical region, the null hypothesis should be rejected (Make a Decision)

Null Hypothesis

A null hypothesis proposes that no significant difference exists in a set of given observations.
Null: Given two sample means are equal
Alternate: Given two sample means are not equal

Alternative Hypothesis

The statement that is opposite of the null hypothesis is called an alternative hypothesis.

P-value

In statistics, the p-value is the probability of obtaining the observed results of a test, assuming that the null hypothesis is correct.

The p-value is used as an alternative to rejection points to provide the smallest level of significance at which the null hypothesis would be rejected. A smaller p-value means that there is stronger evidence in favor of the alternative hypothesis.

Critical region

The critical region is the region of values that corresponds to the rejection of the null hypothesis at some chosen probability level.

That’s all for this article.

I will post the other topics related to statistical tests and the differences between them in the upcoming article.

Thanks for reading :)

--

--