Quantile, Percentile (one tail and two tail distribution), Confidence Interval, Box Plot
Well the more I read about probabilistic models in deep learning, I realize how important and confusing this basic topic could be, so here I am trying to decipher it.
Quantiles, Quartiles and Percentile
Quantiles are values that split sorted data into equal parts. In general terms, a q-quantile divides sorted data into q parts. The most commonly used quantiles have special names:
- Quartiles (4-quantiles): Three quartiles split the data into four parts.
- Deciles (10-quantiles): Nine deciles split the data into 10 parts.
- Percentiles (100-quantiles): 99 percentiles split the data into 100 parts
We are going to use Quartiles and Percentile.
We will see later that quartiles are the special cases of percentile
Let's take a data distribution of 15 samples
data = [10,20,30,40,50,60,70,80,90,100,110,120,130,140,150]
When we say what is the 20th percentile, it means what is the value in the above data below which we will have 20% of the data
Since we have 15 samples 20% of 15 is 3. So what is the value below which we will have 3 samples, In our case that value is 40
So the 20th percentile (value below which we have 20% of data) for the above dataset is 40
The quartiles are the three special percentile values that will divide data into 4 parts
Q1 — First quartile or Lower Quartile (25th percentile)
Q2 — SecondQuartile or Median (50th percentile)
Q3 — Third Quartile or Upper Quartile (75th percentile)
To compute the percentile value we can use the Pandas quantile function
import pandas as pd
data = pd.DataFrame([10,20,30,40,50,60,70,80,90,100,110,120,130,140,150])To get Q1, Q2 and Q3
data.quantile(0.25) ==> 45
data.quantile(0.5) ==> 80
data.quantile(0.25) ==> 115
you can use various other parameters in the quantile function: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html
To get any random percentile say 40th perecntile or 60th percentile
data.quantile(0.4) ==> 66
data.quantile(0.6) ==> 94
- In a sample or dataset, the quartiles divide the data into four groups with equal numbers of observations.
- In a probability distribution, the quartiles divide the distribution’s range into four intervals with equal probability.
For a Gaussian distribution, we can compute the percentile value using scipy norm.stats.ppf function
from scipy import stats as stats
norm = stats.norm(loc=0, scale=1)PPF — Percent point function (inverse of cdf= percentiles)
To get the quantiles value Q1, Q2, Q3
norm.ppf(0.25) → -0.67448
norm.ppf(0.5) → 0.0
norm.ppf(0.75) → 0.67448To get any other percentile
norm.ppf(.95) → 1.644853
norm.ppf(0.99) → 2.32634
Percentile for one-tail and two-tail distribution
The value going in the ppf function is called as significance level and is represented by alpha (α)
There is one more thing to understand — the one-tail distribution and the two-tail distribution.
For one tail distribution, it is pretty straightforward as can be seen from the plots below for the 25th, 50th, or 80th percentiles. To compute the value for all these percentiles we will be using the norm.ppf
the 25th percentile is the value below which 25% of data will lie, we will use alpha=0.25 ==> norm.ppf(0.25) → -0.67448
For a two-tail distribution, the data is taken about the mean, so when we compute the 50th percentile we compute the value for 50% of data about the mean. it means 25% from the left of the mean and 25% from the right of the mean. so the significance level will be divided to compute values of two-tail the lower and upper
data_percent = 0.5
# Significance level (alpha) for a two-tailed test
alpha = 1-data_perecent ==> 0.5# Calculate the critical values for the tails
critical_value_left = norm.ppf(alpha / 2) ==> norm.ppf(0.25)
critical_value_right = norm.ppf(1 — alpha / 2) ==> norm.ppf(0.75)
, and for 95% data the remaining 5% is divided into the left(2.5%) and right(2.5%)
data_percent = 0.95
# Significance level (alpha) for a two-tailed test
alpha = 1-data_perecent ==> 0.05# Calculate the critical values for the tails
critical_value_left = norm.ppf(alpha / 2) ==> norm.ppf(0.025)
critical_value_right = norm.ppf(1 — alpha / 2) ==> norm.ppf(0.975)
To Dos
Confidence Interval & Box Plot
References