Week3 t-distribution, Comparing two populations, ANOVA test

T distribution

When σ is unknown or you're in a small sample (n <30) , use t-distribution to address the uncertainty of the standard error estimate.

當我們在小樣本時,並無法像大樣本n>30的時候,無論母體是否是標準常態分配,樣本一定都是標準常態分配

但是在小樣本時,如果母體不為常態,則抽樣出來的分配,也將不是常態,而是依照母體來決定。

並且實際的百分位點將隨『自由度』變動,自由度越大,也就是n越大,則越近似常態分配。

Z具有標準常態分配。

但因為 σ 為母體標準差在n很大的時候,用S來估計 σ 可以接受,但是在樣本數小的時候,無法用S來估計 σ,估計時會很不精確,並且確實存在一些差異。

那這個問題要怎樣解決呢?就是加入了自由度這個觀念。

我們發現到,當自由度越高時,越接近常態分配


Comparing two populations —

兩獨立母體平均數的統計推論:大樣本 (Z)
兩獨立母體平均數的統計推論:小樣本 (t)
成對母體平均數差的統計推論 ( paired/dependent samples )
兩個母體比例差的統計推論 -> 下週併入porpotion討論

兩獨立母體平均數的統計推論:大樣本 (Z)

先視兩母體標準差是否已知,並合併出新的SE,新的σ*=σ/2

CI: 直接使用新的估計點與SE
HT: 會產生新的母體與樣本標準差

兩獨立母體平均數的統計推論:小樣本 (t)


成對母體平均數差的統計推論 ( paired )

— still need to be studied


ANOVA testing

  • ANOVA is used to compare three or more population means to determine whether they could be equal.
  • is used to determine if the variability in the sample means is so large that it seems unlikely to be from chance alone. ( by simultaneously considering many groups at once)
  • Compare the means of A,B,C and D! If use t-test, we need 6 different t tests. Assume the significant level =0.05.
P(all correct)=(0.95)⁶=0.735
 The buildup Type-I error = 1–0.735 = 0.265
  • Recognize that the H0 in ANONA sets all means equal to each otehr:
Ho :µ1 = µ2 = ... = µk
H1 : At least one mean is different

ANOVA only answers whether there us evidence of difference in at least one pair of groups, it doesn’t tell us which groups are different.

應用於例如商品銷售位置的選擇與銷售量的關係、廣告對不同商品的效益或不同教學方法的成效是否相等的分析等

  • the F statistic is caculated as the ratio of the:
MSG ( variability between groups) /
MSE ( variability within error)
  • ANOVA has a right skewed distribution with 2 different measures of degree of freedom:
Numerator ( dfG = K-1 )
Denominator (dfE = n-K )

Sum of Squares Total (SST)

Sum of Squares Group (SSG)
- measures the variability between group.

Sum of Squares Error (SSE)
- measures the variability within groups.

SST = SSG + SSE

>R: pf ( F-value , dfG , dfE , lower.tail = FALSE )

if P-value is small, reject H0.

The data provide convincing evidence that at least one pair of population means is different from each other.

Condition for ANOVA

  1. Independence : sampled observations within and between groups must be independent. (non-paired)
  2. Approximate normality
  3. Equal variance : Group should have roughly equal variability.

How I use R to solve this problem:

worker1 <- c(6.13,6.06,6.11,6.09,6.20,NA,NA)
worker2 <- c(6.06,6.16,6.03,6.07,5.99,6.30,6.02)
worker3 <- c(6.24,6.18,6.38,6.34,6.36,6.14,NA)
worker4 <- c(6.09,6.03,5.99,6.01,NA,NA,NA)
# list all data
data_t <- data.frame(cbind(worker1,worker2,worker3,worker4))
# build up and combine a data frame and you need to let all length of column be the same.
Stack_t <- stack(data_t)
# stacking
Aov_t <- aov (values~ind,data=Stack_t)
summary(Aov_t)

Multiple Comparison

df = min ( n1–1 , n2–1)

more stringent significance level is more appropriate for these test. ( adjust σ ) σ* = σ/K

K = k (k-1) / 2 (k=number of comparisons)

ex. The social class variables has 4 level. If σ = 0.05 for the original ANOVA, what should the modified significance level be for two sample t tests for determining which pairs of groups have significantly different mean?

k=4 K=4*3/2 = 6

σ* = 0.05/6 = 0.0083


Slides sources from: http://web.ntpu.edu.tw/~wtp/statpdf/Ch_11.pdf

more about How to use R to deal with ANOVA:

Quick-R:ANOVA/MANOVA

datafilename="http://personality-project.org/R/datasets/R.appendix1.data"
data.ex1=read.table(datafilename,header=T)   
#read the data into a table

aov.ex1 = aov(Alertness~Dosage,data=data.ex1)
#do the analysis of variance
summary(aov.ex1)                                    
#show the summary table
print(model.tables(aov.ex1,"means"),digits=3)       #report the means and the number of subjects/cell
boxplot(Alertness~Dosage,data=data.ex1)        
#graphical summary

<upgrade> Then, I try to use R to solve those practice questions, here lists one of the question:

  1. 先建立一個名為 ‘produce’ 的資料庫

produce<-data.frame(xa=c(6.7,7.8,6.6,6.2,5.9,4.8,6.6,5.0,6.5,7.1,7.8,7.4,6.1,5.2,6.1,4.5,7.5,6.2,6.0,5.0,7.3,7.0,6.4,5.9,4.4),xb=c(5.8,7.6,6.0,6.4,5.3,5.5,7.9,4.8,7.0,6.5,4.5,5.8,7.1,5.6,4.4,7.0,4.5,6.7,6.6,5.9,5.3,5.2,6.7,6.9,4.9))

2. 設定H0:u1=u2 ; H1:u1≠u2

3. 建立一個新欄位xc,求數值 xa-xb,因為我們要求兩者差的平均值:

produce<- produce %>% mutate(xc=xa-xb)

4. 使用 ’inference’ 計算HT:

inference(y=xc , data=produce , statistic = “mean” , type=”ht” , null=0 , alternative=”greater”, method=”theoretical”)

得到:

Single numerical variable
n = 25, y-bar = 0.244, s = 1.3799
H0: mu = 0
HA: mu > 0
t = 0.8841, df = 24
p_value = 0.1927

由於 P-value >α (0.05) , 故無法拒絕假設檢定,兩條生產線零件裝配時間顯著不同。


簡介 ‘inference’ 函數

y : the respond variable that we are interested in

x: variable that splits the data into two group

statistic : the sample statistic we're using and the population parameter we're estimating. ( "mean", "median" & "proportion")

type — ht : hypothesis test ( null = o ; alternative = "twosided","greater","smaller")

type — ci : confidence interval

method : "theoretical" & "simulation"


This is a note from my online course through Coursera, if you have interest in whole coueses, click here to know more information. Thanks.