Statistics for ML

Probability

Gaurav Madan

10 min readDec 17, 2019

Set theory

Bayes Theorem

Given P (Past | Posterior), finding P(Posterior | Past)

Basic Stats

central tendency / typical value: Central tendency is the middle point of a data’s distribution. mean, median, mode. mode is used for categorical variables. it cannot be calculated for continuous numerical variables, as it will be too scattered
dispersion: Dispersion is the spread of data in that distribution. How different are the actual values wrt to mean. Common measures are: SD, SD² = Var, Range, percentiles,
skewness
kurtosis

Curves A and B are skewed towards left and right respectively, rather than being symmetrical around their centre. Skewness is a numerical metric to represent this deviation from symmetry.

Curves C and D share the same centre, however, C is more spread out as compared to D which is more peaked. Kurtosis is a measure of peakedness of the data distribution.

problem: whether a new counter is required at an ice cream parlour

solution:

step 1: find out if people are waiting in queue

Observe 10 samples at random times — noting the number of customers in queue.

Then we can calculate Mean and SD. Basis this, we can decide whether we need a new counter

Population Mean is denoted by μ while sample mean is denoted by x bar.

μ= Σ x / N, where N is the number of elements in population

x bar = Σ x / n, where n is the number of elements in sample

Calculating Mean from Grouped Data

If individual observations are not available, but their frequency distribution is available for each class

Saving Account Balance(Class) | Frequency
0-50k                         | 30
50k - 100k                    | 20
100k - 200k                   | 15

In this case, we can still derive approximate mean of this sample using

x bar = Σ (f * x) / n

where, f is frequency, x is mid point of each class, n is total observations in sample

Weighted Mean

If the observations in a sample have different level of importance, we should use a weighted mean instead of a simple arithmetic mean. Also, if the values in the sample do not occur with same frequency, then we should use a weighted mean.

Geometric Means should be used, when the metrics change over time and have a multiplicative effect. E.g. Annual Interest Rate in a multi year account balance calculation. In such case we take Nth Root of (product of all individual sample values).

Distributions

A distribution is chart between P(X) on y-axis and X on x-axis.

We will match the shape of our data to std / common 18–20 distributions. Then we can simply use the equation of that std distribution curve instead of calculating the equation of your own custom distribution. The matching doesnt have to be done visually. it will

In practice, using some transformations like log, non-normal variables start showing normal distribution

Std distributions have some properties also, which can be used for simplification of calculations.

There are discrete and continuous distributions. we are going to focus only on continuous distributions.

Examples of Distributions

Discrete Distributions

Bernoulli
Binomial
Poisson
Multi nomial
Hypergeometric

Continuous Distributions

Normal
Uniform distribution
Gamma

Normal Distributions and “Law of Large numbers”

Normal Distribution

It is a function of Mean and SD. The values of these affect the shape of the normal distribution.

Properties of Normal Distribution

X ~ N(mu,sd) then (X — mu) / sd ~ N (0,1)

This is called “standard normal” distribution with mean = 0 and sd = 1

mu represents mean. sd represents standard deviation

TODO: practical application of std normal using Z Table

E.g. we have a normal distribution of salary, which depicts mu = 20k and sd = 500

if we want to calculate the probability that a person will earn 60K

(50K — 20K) / 500 = 60

Lookup 0.6 in N(0,1) table to get the probability.

A normal distribution is symmetric about the mean. It has some implications:

a) P(x ≤ Mean) = P(X ≥ Mean)

b) Mean = Median=Mode

c) If X ~ N(0,1) THEN

P(mean — delta) = P(mean + delta)

P(x ≤ mean — delta) = P(X ≥ mean + delta)

d) IF X ~ N(mu, sd) THEN

P (mu — sd ≤ X ≤ mean + sd) = 67%

P (mu — 2 * sd ≤ X ≤ mean + 2 * sd) = 95%

P (mu — 3 * sd ≤ X ≤ mean + 3 * sd) = 99.7%

In stock markets, we use a “Log Normal” distribution

Sources of data: investing.com

stock market prices
macro variables like inflation

Why do we do Sampling?

we will never have data of the entire population. It is too time consuming to capture such data
Incremental benefit is not enough to justify the cost of capturing and analyzing entire population data
To separate data, to avoid overfitting

There are 3 types of sampling

SRSWR — with replacement
SRSWOR — without replacement
Stratified — we separate out the population based on one or more variables. This creates multiple pools, each being representative of each population type. Within these pools, we do either a SRSWR or SRSWOR

Estimation

Point Estimation
Interval Estimation

Hypothesis testing techniques

testing for the mean — T-Test / Z-Test, Anova

Estimations

Degrees of freedom

Lets assume we have 4 variables: Z1, Z2, Z3, Z4

and Z1+ Z2 + Z3 + Z4 = 10

How many variables in the above equation can take any random value? It is 3. Because once you select the values of 3 variables freely, the 4th one gets fixed.

Chi Square distribution

If X1, X2, X3.. Xn ~ N(0,1)

Then Y = X1² + X2² + … + Xn²

The degrees of freedom of Y, for a n variables, the degrees of freedom = n

Shape of Chi Square distributions depend on n

For n = 1, it is half the normal distribution, with only right half of the shape of N(0,1). The probabilities get doubled bcoz left side probabilities also gets added to positive side probabilities

P(-1) is transformed into P(1) and gets added to existing P(1)

What is the use of Chi Square distributions and what statistics follow this distribution? Variance follows Chi-Sq.

Hiesenberg’s Uncertaining Principle

Probability of finding a particular value of a continuous variable is zero. It is from a practical standpoint.

T Distribution

It is another synthetic distribution, which represents the ratio of a Normal and a Chi Sq distribution variables.

If X ~ N(0,1)

Y ~ Chi Distribution, with df = k

T = X / Sqrt(Y/k) ~ t-dist with k df

Standard Deviation is a statistic which follows T distribution

Eg.

Mean is an example of Normal Distribution

n: number of observations in the distribution

mean(X) / sqrt(Var(x)/n) ~ T-dist

= mean(X) / (sd(x) / sqrt(n)) ~ T-dist with df = n

= mean(X) / sd(x) / sqrt(n) ~ T-dist with df = n

T-dist looks similar to N, but looks more spread out, pressed down version of N.

TODO: Model some data in excel and plot these values

F-Distribution

Y is a variable which follows Chi Sq, with df = k1

Z is another variable which follows Chi Sq, with df = k2

Then

(Y/k1)/(Z/k2) ~ F-dist with df=k1,k2

(F-dist has two degrees of freedom)

Which statistics follows F-dist? Ratio of Variance

Estimations

Guessing the population mean and sd using a small sample

Point Estimation

Depending on the situation, mean / mode / median can be used to calculate the central tendency of a variable.

The unbiased estimator of a population mean is given by Sample Mean (Average)

The best possible estimator of the population sd is Sample SD

Standard SD = 1/n * sum( ( x — mean(x) )^ 2 )
Sample SD = 1/(n — 1) * sum( ( x — mean(x) )^ 2 )

The above formula is derived from the fact that, Standard SD is an unbiased estimator of (n-1)/n * population SD

Central Limit theorem and the deciding how to decide the optimal sample size, to ensure a certain % probability / confidence in estimation

extract multiple samples from a large set of observation
calculate their average
The averages will follow normal distribution

sample mean(x) ~ Normal Distribution(mean, sd)

where,

= Normal Distribution(population mean, population sd/sqrt(n) )

and n = sample size

Plain text

x is a random variable and can have a distribution. It can be any type of distribution — normal, chi sq or any other
sample mean(x) is also a random variable and can have its own distribution. This distribution is similar to Normal Distribution based on CLT.
Now, if we observe the distribution of sample mean, they also come mostly from the places, where the population is concentrated. Therefore, this distribution is reflective of population mean
Now let’s observe the dispersion of this distribution. If the samples are far and few, the dispersion within the sample will be high. If the sample are large, the dispersion decreases. Therefore SD of this distribution is “Population SD / Sqrt(Sample Size)”

For a particular sample,

(mean(x) — population mean) / Std Error ~ N(0,1)

SE = Popln SD / sqrt (sample size)

Also, (mean(x) — population mean) / Std Error lies between +/- 3 with 99.7% confidence. It is because Mean =0 and SD=1 in N(0,1) so “Mean +/- 3 SD” = +/- 3

or in other words

3 ≤ (mean — popln mean) / Std Err ≤ 3 with 99.7% confidence
-3 * SE — mean(x) ≤ -popln mean ≤ 3 * SE — mean(x)
mean(x) — 3 SE ≤ popln mean ≤ mean(x) + 3 SE

population mean lies between sample mean +/- 3 SE with 99.7% confidence

Population Mean can be calculated using above equation:

Sample mean can be computed from Sample
SE is collected from secondary research. For macro economic variables, the dispersion / disparity remains constant over a period of time (say last 10–20 years), although the mean keeps changing. This however changes over very large period of time, like 50 years. If you don’t have this statistic, just use sample SD as proxy for population SD.

Interval Estimation

Hypothesis Testing

E.g. Test the elasticity of demand for a particular product.

Null Hypothesis (represented by H0): The hypothesis statement that you want to test and validate. Eg, in Indian Penal Code, a “person is innocent” until proven guilty

H0: statistic = value

Null Hypothesis always has an equals sign

2. Alternate Hypothesis (H1): Eg. that the person is guilty

H1: can be

statistic ≤ value (left tailed test)
statistic ≥ value (right tailed test)
statistic <> value (2-tailed test)

3. Test Statistic

The statistics / calculation which forms the basis of experiment
Designed in such a way that it follows a known distribution

h0: mean = 20

h1:

(sample mean — 20) / (sample SD / sqrt(n))

4. Each Test Stat ~ A Known Dist

5. Errors

Reject the null hypothesis when it is actually true | Type 1 Error | False Negative
Accept the null hypothesis when it is actually false | Type 2 Error | False Positive

6. Significance of Error

confidence should be high
probability of error should be low

Steps of doing Hypothesis Testing

Define H0
Define H1 (<, >, <>)
calculate test stat
assume H0 is true
calculate P value. P value is the prob of finding a test stat as extreme (as high or as low) as what you have found under H0
reject the Alternate, if P value ≥ significance level / alpha
Accept the alternate if P value < sig level or alpha

Selecting the Test based on problem type

1. Testing mean for a particular population

H0: mean = value

we can possibly do 2 types of tests

Z-test: used when sample size > 30 and popl SD is known. It is called Z test, because the test statistics follows a normal distribution.
T-test: Every other case, we use T test. It is call t test because test statistics follows a T distribution.

T test gives the probability of H0 to be true for the entire population. If P is> alpha, the H0 is true.

If it is not, then mean of the sample tells the directionality of where does the population mean lies — below or after the H0 postulated mean.

2. Testing the mean of one sample with mean of another sample, from different populations

Eg. Retail price of petrol in Australia and India is same in Dollar equivalent terms.

H0: mean 1 = mean 2

Test Statistics

2 sample t-test

T test takes mean of both samples, gives the probability of H0 to be true for the entire population. If P is> alpha, the H0 is true.

If it is not, then pre and post means tells the directionality of change.

3. Testing mean of 2 samples drawn from the same population

Eg. No of people voting for BJP has increased / decreased after the BJP Govt came to power.

H0: mean 1 = mean 2

Ask the same set of people, before and after the Govt change.
The test is called “pair wise t-test”

4. Testing multiple means from same population

H0: mean 1 = mean 2 = mean 3 = mean 4

ANOVA — Analysis of Variance. It follows F distribution and has 2 degrees of freedom

H1: At least one of them is not equal to the another

Eg.

While doing Regression Analysis, Y = F(X), where X can have 4 categories. To test if X and Y are related, we compare following means usign ANOVA

mean 1 = Mean(Y) When X=X1
mean 2 = Mean(Y) When X=X2
mean 3 = Mean(Y) When X=X3
mean 4 = Mean(Y) When X=X4

Hypothesis Testing — Practice

Chi Square

# Chi Square Test
# ----------------t = stats.crosstab(data)stats.chi2_contingency(observed = t) # does chi square test
# returns chi square test stat, P, degrees of freedom, expected values# farther expected and observed are far apart, stronger the relationship
# lower the P value, stronger the relationship

Anova

# Anova
# ------

Optimization Problems

Ant Colony optimization algorithm for Travelling Salesman type problem

Statistics for ML

Probability

Basic Stats

Distributions

Examples of Distributions

Estimation

Hypothesis testing techniques

Estimations

Degrees of freedom

Chi Square distribution

T Distribution

F-Distribution

Estimations

Hypothesis Testing

Hypothesis Testing — Practice

Optimization Problems

Written by Gaurav Madan