Statistics for ML
Probability
Set theory
Bayes Theorem
- Given P (Past | Posterior), finding P(Posterior | Past)
Basic Stats
- central tendency / typical value: Central tendency is the middle point of a data’s distribution. mean, median, mode. mode is used for categorical variables. it cannot be calculated for continuous numerical variables, as it will be too scattered
- dispersion: Dispersion is the spread of data in that distribution. How different are the actual values wrt to mean. Common measures are: SD, SD² = Var, Range, percentiles,
- skewness
- kurtosis
Curves A and B are skewed towards left and right respectively, rather than being symmetrical around their centre. Skewness is a numerical metric to represent this deviation from symmetry.
Curves C and D share the same centre, however, C is more spread out as compared to D which is more peaked. Kurtosis is a measure of peakedness of the data distribution.
problem: whether a new counter is required at an ice cream parlour
solution:
step 1: find out if people are waiting in queue
Observe 10 samples at random times — noting the number of customers in queue.
Then we can calculate Mean and SD. Basis this, we can decide whether we need a new counter
Population Mean is denoted by μ while sample mean is denoted by x bar.
μ= Σ x / N, where N is the number of elements in population
x bar = Σ x / n, where n is the number of elements in sample
Calculating Mean from Grouped Data
If individual observations are not available, but their frequency distribution is available for each class
Saving Account Balance(Class) | Frequency
0-50k | 30
50k - 100k | 20
100k - 200k | 15
In this case, we can still derive approximate mean of this sample using
x bar = Σ (f * x) / n
where, f is frequency, x is mid point of each class, n is total observations in sample
Weighted Mean
If the observations in a sample have different level of importance, we should use a weighted mean instead of a simple arithmetic mean. Also, if the values in the sample do not occur with same frequency, then we should use a weighted mean.
Geometric Means should be used, when the metrics change over time and have a multiplicative effect. E.g. Annual Interest Rate in a multi year account balance calculation. In such case we take Nth Root of (product of all individual sample values).
Distributions
A distribution is chart between P(X) on y-axis and X on x-axis.
We will match the shape of our data to std / common 18–20 distributions. Then we can simply use the equation of that std distribution curve instead of calculating the equation of your own custom distribution. The matching doesnt have to be done visually. it will
In practice, using some transformations like log, non-normal variables start showing normal distribution
Std distributions have some properties also, which can be used for simplification of calculations.
There are discrete and continuous distributions. we are going to focus only on continuous distributions.
Examples of Distributions
Discrete Distributions
- Bernoulli
- Binomial
- Poisson
- Multi nomial
- Hypergeometric
Continuous Distributions
- Normal
- Uniform distribution
- Gamma
Normal Distributions and “Law of Large numbers”
Normal Distribution
It is a function of Mean and SD. The values of these affect the shape of the normal distribution.
Properties of Normal Distribution
- X ~ N(mu,sd) then (X — mu) / sd ~ N (0,1)
This is called “standard normal” distribution with mean = 0 and sd = 1
mu represents mean. sd represents standard deviation
TODO: practical application of std normal using Z Table
E.g. we have a normal distribution of salary, which depicts mu = 20k and sd = 500
if we want to calculate the probability that a person will earn 60K
(50K — 20K) / 500 = 60
Lookup 0.6 in N(0,1) table to get the probability.
- A normal distribution is symmetric about the mean. It has some implications:
a) P(x ≤ Mean) = P(X ≥ Mean)
b) Mean = Median=Mode
c) If X ~ N(0,1) THEN
P(mean — delta) = P(mean + delta)
P(x ≤ mean — delta) = P(X ≥ mean + delta)
d) IF X ~ N(mu, sd) THEN
P (mu — sd ≤ X ≤ mean + sd) = 67%
P (mu — 2 * sd ≤ X ≤ mean + 2 * sd) = 95%
P (mu — 3 * sd ≤ X ≤ mean + 3 * sd) = 99.7%
In stock markets, we use a “Log Normal” distribution
Sources of data: investing.com
- stock market prices
- macro variables like inflation
Why do we do Sampling?
- we will never have data of the entire population. It is too time consuming to capture such data
- Incremental benefit is not enough to justify the cost of capturing and analyzing entire population data
- To separate data, to avoid overfitting
There are 3 types of sampling
- SRSWR — with replacement
- SRSWOR — without replacement
- Stratified — we separate out the population based on one or more variables. This creates multiple pools, each being representative of each population type. Within these pools, we do either a SRSWR or SRSWOR
Estimation
- Point Estimation
- Interval Estimation
Hypothesis testing techniques
- testing for the mean — T-Test / Z-Test, Anova
Estimations
Degrees of freedom
Lets assume we have 4 variables: Z1, Z2, Z3, Z4
and Z1+ Z2 + Z3 + Z4 = 10
How many variables in the above equation can take any random value? It is 3. Because once you select the values of 3 variables freely, the 4th one gets fixed.
Chi Square distribution
If X1, X2, X3.. Xn ~ N(0,1)
Then Y = X1² + X2² + … + Xn²
The degrees of freedom of Y, for a n variables, the degrees of freedom = n
Shape of Chi Square distributions depend on n
For n = 1, it is half the normal distribution, with only right half of the shape of N(0,1). The probabilities get doubled bcoz left side probabilities also gets added to positive side probabilities
P(-1) is transformed into P(1) and gets added to existing P(1)
What is the use of Chi Square distributions and what statistics follow this distribution? Variance follows Chi-Sq.
Hiesenberg’s Uncertaining Principle
Probability of finding a particular value of a continuous variable is zero. It is from a practical standpoint.
T Distribution
It is another synthetic distribution, which represents the ratio of a Normal and a Chi Sq distribution variables.
If X ~ N(0,1)
Y ~ Chi Distribution, with df = k
T = X / Sqrt(Y/k) ~ t-dist with k df
Standard Deviation is a statistic which follows T distribution
Eg.
Mean is an example of Normal Distribution
n: number of observations in the distribution
mean(X) / sqrt(Var(x)/n) ~ T-dist
= mean(X) / (sd(x) / sqrt(n)) ~ T-dist with df = n
= mean(X) / sd(x) / sqrt(n) ~ T-dist with df = n
T-dist looks similar to N, but looks more spread out, pressed down version of N.
TODO: Model some data in excel and plot these values
F-Distribution
IF
Y is a variable which follows Chi Sq, with df = k1
Z is another variable which follows Chi Sq, with df = k2
Then
(Y/k1)/(Z/k2) ~ F-dist with df=k1,k2
(F-dist has two degrees of freedom)
Which statistics follows F-dist? Ratio of Variance
Estimations
Guessing the population mean and sd using a small sample
Point Estimation
Depending on the situation, mean / mode / median can be used to calculate the central tendency of a variable.
The unbiased estimator of a population mean is given by Sample Mean (Average)
The best possible estimator of the population sd is Sample SD
- Standard SD = 1/n * sum( ( x — mean(x) )^ 2 )
- Sample SD = 1/(n — 1) * sum( ( x — mean(x) )^ 2 )
The above formula is derived from the fact that, Standard SD is an unbiased estimator of (n-1)/n * population SD
Central Limit theorem and the deciding how to decide the optimal sample size, to ensure a certain % probability / confidence in estimation
- extract multiple samples from a large set of observation
- calculate their average
- The averages will follow normal distribution
sample mean(x) ~ Normal Distribution(mean, sd)
where,
= Normal Distribution(population mean, population sd/sqrt(n) )
and n = sample size
Plain text
- x is a random variable and can have a distribution. It can be any type of distribution — normal, chi sq or any other
- sample mean(x) is also a random variable and can have its own distribution. This distribution is similar to Normal Distribution based on CLT.
- Now, if we observe the distribution of sample mean, they also come mostly from the places, where the population is concentrated. Therefore, this distribution is reflective of population mean
- Now let’s observe the dispersion of this distribution. If the samples are far and few, the dispersion within the sample will be high. If the sample are large, the dispersion decreases. Therefore SD of this distribution is “Population SD / Sqrt(Sample Size)”
For a particular sample,
(mean(x) — population mean) / Std Error ~ N(0,1)
SE = Popln SD / sqrt (sample size)
Also, (mean(x) — population mean) / Std Error lies between +/- 3 with 99.7% confidence. It is because Mean =0 and SD=1 in N(0,1) so “Mean +/- 3 SD” = +/- 3
or in other words
- 3 ≤ (mean — popln mean) / Std Err ≤ 3 with 99.7% confidence
- -3 * SE — mean(x) ≤ -popln mean ≤ 3 * SE — mean(x)
- mean(x) — 3 SE ≤ popln mean ≤ mean(x) + 3 SE
population mean lies between sample mean +/- 3 SE with 99.7% confidence
Population Mean can be calculated using above equation:
- Sample mean can be computed from Sample
- SE is collected from secondary research. For macro economic variables, the dispersion / disparity remains constant over a period of time (say last 10–20 years), although the mean keeps changing. This however changes over very large period of time, like 50 years. If you don’t have this statistic, just use sample SD as proxy for population SD.
Interval Estimation
Hypothesis Testing
E.g. Test the elasticity of demand for a particular product.
- Null Hypothesis (represented by H0): The hypothesis statement that you want to test and validate. Eg, in Indian Penal Code, a “person is innocent” until proven guilty
H0: statistic = value
Null Hypothesis always has an equals sign
2. Alternate Hypothesis (H1): Eg. that the person is guilty
H1: can be
- statistic ≤ value (left tailed test)
- statistic ≥ value (right tailed test)
- statistic <> value (2-tailed test)
3. Test Statistic
- The statistics / calculation which forms the basis of experiment
- Designed in such a way that it follows a known distribution
h0: mean = 20
h1:
(sample mean — 20) / (sample SD / sqrt(n))
4. Each Test Stat ~ A Known Dist
5. Errors
- Reject the null hypothesis when it is actually true | Type 1 Error | False Negative
- Accept the null hypothesis when it is actually false | Type 2 Error | False Positive
6. Significance of Error
- confidence should be high
- probability of error should be low
Steps of doing Hypothesis Testing
- Define H0
- Define H1 (<, >, <>)
- calculate test stat
- assume H0 is true
- calculate P value. P value is the prob of finding a test stat as extreme (as high or as low) as what you have found under H0
- reject the Alternate, if P value ≥ significance level / alpha
- Accept the alternate if P value < sig level or alpha
Selecting the Test based on problem type
1. Testing mean for a particular population
H0: mean = value
we can possibly do 2 types of tests
- Z-test: used when sample size > 30 and popl SD is known. It is called Z test, because the test statistics follows a normal distribution.
- T-test: Every other case, we use T test. It is call t test because test statistics follows a T distribution.
T test gives the probability of H0 to be true for the entire population. If P is> alpha, the H0 is true.
If it is not, then mean of the sample tells the directionality of where does the population mean lies — below or after the H0 postulated mean.
2. Testing the mean of one sample with mean of another sample, from different populations
Eg. Retail price of petrol in Australia and India is same in Dollar equivalent terms.
H0: mean 1 = mean 2
Test Statistics
- 2 sample t-test
T test takes mean of both samples, gives the probability of H0 to be true for the entire population. If P is> alpha, the H0 is true.
If it is not, then pre and post means tells the directionality of change.
3. Testing mean of 2 samples drawn from the same population
Eg. No of people voting for BJP has increased / decreased after the BJP Govt came to power.
H0: mean 1 = mean 2
- Ask the same set of people, before and after the Govt change.
- The test is called “pair wise t-test”
4. Testing multiple means from same population
H0: mean 1 = mean 2 = mean 3 = mean 4
ANOVA — Analysis of Variance. It follows F distribution and has 2 degrees of freedom
H1: At least one of them is not equal to the another
Eg.
While doing Regression Analysis, Y = F(X), where X can have 4 categories. To test if X and Y are related, we compare following means usign ANOVA
- mean 1 = Mean(Y) When X=X1
- mean 2 = Mean(Y) When X=X2
- mean 3 = Mean(Y) When X=X3
- mean 4 = Mean(Y) When X=X4
Hypothesis Testing — Practice
Chi Square
# Chi Square Test
# ----------------t = stats.crosstab(data)stats.chi2_contingency(observed = t) # does chi square test
# returns chi square test stat, P, degrees of freedom, expected values# farther expected and observed are far apart, stronger the relationship
# lower the P value, stronger the relationship
Anova
# Anova
# ------
Optimization Problems
- Ant Colony optimization algorithm for Travelling Salesman type problem