# Week3 t-distribution, Comparing two populations, ANOVA test

### T distribution

When σ is unknown or you're in a small sample (n <30) , use t-distribution to address the uncertainty of the standard error estimate.

Z具有標準常態分配。

Comparing two populations —

CI: 直接使用新的估計點與SE
HT: 會產生新的母體與樣本標準差

— still need to be studied

### ANOVA testing

• ANOVA is used to compare three or more population means to determine whether they could be equal.
• is used to determine if the variability in the sample means is so large that it seems unlikely to be from chance alone. ( by simultaneously considering many groups at once)
• Compare the means of A,B,C and D! If use t-test, we need 6 different t tests. Assume the significant level =0.05.
P(all correct)=(0.95)⁶=0.735
The buildup Type-I error = 1–0.735 = 0.265
• Recognize that the H0 in ANONA sets all means equal to each otehr:
Ho :µ1 = µ2 = ... = µk
H1 : At least one mean is different

#### ANOVA only answers whether there us evidence of difference in at least one pair of groups, it doesn’t tell us which groups are different.

• the F statistic is caculated as the ratio of the:
MSG ( variability between groups) /
MSE ( variability within error)
• ANOVA has a right skewed distribution with 2 different measures of degree of freedom:
Numerator ( dfG = K-1 )
Denominator (dfE = n-K )

Sum of Squares Total (SST)

Sum of Squares Group (SSG)
- measures the variability between group.

Sum of Squares Error (SSE)
- measures the variability within groups.

SST = SSG + SSE

>R: pf ( F-value , dfG , dfE , lower.tail = FALSE )

if P-value is small, reject H0.

The data provide convincing evidence that at least one pair of population means is different from each other.

### Condition for ANOVA

1. Independence : sampled observations within and between groups must be independent. (non-paired)
2. Approximate normality
3. Equal variance : Group should have roughly equal variability.

How I use R to solve this problem:

`worker1 <- c(6.13,6.06,6.11,6.09,6.20,NA,NA)worker2 <- c(6.06,6.16,6.03,6.07,5.99,6.30,6.02)worker3 <- c(6.24,6.18,6.38,6.34,6.36,6.14,NA)worker4 <- c(6.09,6.03,5.99,6.01,NA,NA,NA)# list all datadata_t <- data.frame(cbind(worker1,worker2,worker3,worker4))# build up and combine a data frame and you need to let all length of column be the same.Stack_t <- stack(data_t)# stackingAov_t <- aov (values~ind,data=Stack_t)summary(Aov_t)`

### Multiple Comparison

more stringent significance level is more appropriate for these test. ( adjust σ ) σ* = σ/K

K = k (k-1) / 2 (k=number of comparisons)

ex. The social class variables has 4 level. If σ = 0.05 for the original ANOVA, what should the modified significance level be for two sample t tests for determining which pairs of groups have significantly different mean?

k=4 K=4*3/2 = 6

σ* = 0.05/6 = 0.0083

Slides sources from: http://web.ntpu.edu.tw/~wtp/statpdf/Ch_11.pdf

more about How to use R to deal with ANOVA:

Quick-R:ANOVA/MANOVA

`datafilename="http://personality-project.org/R/datasets/R.appendix1.data"`
`data.ex1=read.table(datafilename,header=T)   #read the data into a tableaov.ex1 = aov(Alertness~Dosage,data=data.ex1)  #do the analysis of variance`
`summary(aov.ex1)                                    #show the summary table`
`print(model.tables(aov.ex1,"means"),digits=3)       #report the means and the number of subjects/cell`
`boxplot(Alertness~Dosage,data=data.ex1)        #graphical summary`

<upgrade> Then, I try to use R to solve those practice questions, here lists one of the question:

1. 先建立一個名為 ‘produce’ 的資料庫

produce<-data.frame(xa=c(6.7,7.8,6.6,6.2,5.9,4.8,6.6,5.0,6.5,7.1,7.8,7.4,6.1,5.2,6.1,4.5,7.5,6.2,6.0,5.0,7.3,7.0,6.4,5.9,4.4),xb=c(5.8,7.6,6.0,6.4,5.3,5.5,7.9,4.8,7.0,6.5,4.5,5.8,7.1,5.6,4.4,7.0,4.5,6.7,6.6,5.9,5.3,5.2,6.7,6.9,4.9))

2. 設定H0:u1=u2 ; H1:u1≠u2

3. 建立一個新欄位xc，求數值 xa-xb，因為我們要求兩者差的平均值：

produce<- produce %>% mutate(xc=xa-xb)

4. 使用 ’inference’ 計算HT:

inference(y=xc , data=produce , statistic = “mean” , type=”ht” , null=0 , alternative=”greater”, method=”theoretical”)

Single numerical variable
n = 25, y-bar = 0.244, s = 1.3799
H0: mu = 0
HA: mu > 0
t = 0.8841, df = 24
p_value = 0.1927

y : the respond variable that we are interested in

x: variable that splits the data into two group

statistic : the sample statistic we're using and the population parameter we're estimating. ( "mean", "median" & "proportion")

type — ht : hypothesis test ( null = o ; alternative = "twosided","greater","smaller")

type — ci : confidence interval

method : "theoretical" & "simulation"

This is a note from my online course through Coursera, if you have interest in whole coueses, click here to know more information. Thanks.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.