Week3 t-distribution, Comparing two populations, ANOVA test

T distribution

When σ is unknown or you're in a small sample (n <30) , use t-distribution to address the uncertainty of the standard error estimate.





但因為 σ 為母體標準差在n很大的時候,用S來估計 σ 可以接受,但是在樣本數小的時候,無法用S來估計 σ,估計時會很不精確,並且確實存在一些差異。



Comparing two populations —

兩獨立母體平均數的統計推論:大樣本 (Z)
兩獨立母體平均數的統計推論:小樣本 (t)
成對母體平均數差的統計推論 ( paired/dependent samples )
兩個母體比例差的統計推論 -> 下週併入porpotion討論

兩獨立母體平均數的統計推論:大樣本 (Z)


CI: 直接使用新的估計點與SE
HT: 會產生新的母體與樣本標準差

兩獨立母體平均數的統計推論:小樣本 (t)

成對母體平均數差的統計推論 ( paired )

— still need to be studied

ANOVA testing

  • ANOVA is used to compare three or more population means to determine whether they could be equal.
  • is used to determine if the variability in the sample means is so large that it seems unlikely to be from chance alone. ( by simultaneously considering many groups at once)
  • Compare the means of A,B,C and D! If use t-test, we need 6 different t tests. Assume the significant level =0.05.
P(all correct)=(0.95)⁶=0.735
 The buildup Type-I error = 1–0.735 = 0.265
  • Recognize that the H0 in ANONA sets all means equal to each otehr:
Ho :µ1 = µ2 = ... = µk
H1 : At least one mean is different

ANOVA only answers whether there us evidence of difference in at least one pair of groups, it doesn’t tell us which groups are different.


  • the F statistic is caculated as the ratio of the:
MSG ( variability between groups) /
MSE ( variability within error)
  • ANOVA has a right skewed distribution with 2 different measures of degree of freedom:
Numerator ( dfG = K-1 )
Denominator (dfE = n-K )

Sum of Squares Total (SST)

Sum of Squares Group (SSG)
- measures the variability between group.

Sum of Squares Error (SSE)
- measures the variability within groups.


>R: pf ( F-value , dfG , dfE , lower.tail = FALSE )

if P-value is small, reject H0.

The data provide convincing evidence that at least one pair of population means is different from each other.

Condition for ANOVA

  1. Independence : sampled observations within and between groups must be independent. (non-paired)
  2. Approximate normality
  3. Equal variance : Group should have roughly equal variability.

How I use R to solve this problem:

worker1 <- c(6.13,6.06,6.11,6.09,6.20,NA,NA)
worker2 <- c(6.06,6.16,6.03,6.07,5.99,6.30,6.02)
worker3 <- c(6.24,6.18,6.38,6.34,6.36,6.14,NA)
worker4 <- c(6.09,6.03,5.99,6.01,NA,NA,NA)
# list all data
data_t <- data.frame(cbind(worker1,worker2,worker3,worker4))
# build up and combine a data frame and you need to let all length of column be the same.
Stack_t <- stack(data_t)
# stacking
Aov_t <- aov (values~ind,data=Stack_t)

Multiple Comparison

df = min ( n1–1 , n2–1)

more stringent significance level is more appropriate for these test. ( adjust σ ) σ* = σ/K

K = k (k-1) / 2 (k=number of comparisons)

ex. The social class variables has 4 level. If σ = 0.05 for the original ANOVA, what should the modified significance level be for two sample t tests for determining which pairs of groups have significantly different mean?

k=4 K=4*3/2 = 6

σ* = 0.05/6 = 0.0083

Slides sources from: http://web.ntpu.edu.tw/~wtp/statpdf/Ch_11.pdf

more about How to use R to deal with ANOVA:


#read the data into a table

aov.ex1 = aov(Alertness~Dosage,data=data.ex1)
#do the analysis of variance
#show the summary table
print(model.tables(aov.ex1,"means"),digits=3)       #report the means and the number of subjects/cell
#graphical summary

<upgrade> Then, I try to use R to solve those practice questions, here lists one of the question:

  1. 先建立一個名為 ‘produce’ 的資料庫


2. 設定H0:u1=u2 ; H1:u1≠u2

3. 建立一個新欄位xc,求數值 xa-xb,因為我們要求兩者差的平均值:

produce<- produce %>% mutate(xc=xa-xb)

4. 使用 ’inference’ 計算HT:

inference(y=xc , data=produce , statistic = “mean” , type=”ht” , null=0 , alternative=”greater”, method=”theoretical”)


Single numerical variable
n = 25, y-bar = 0.244, s = 1.3799
H0: mu = 0
HA: mu > 0
t = 0.8841, df = 24
p_value = 0.1927

由於 P-value >α (0.05) , 故無法拒絕假設檢定,兩條生產線零件裝配時間顯著不同。

簡介 ‘inference’ 函數

y : the respond variable that we are interested in

x: variable that splits the data into two group

statistic : the sample statistic we're using and the population parameter we're estimating. ( "mean", "median" & "proportion")

type — ht : hypothesis test ( null = o ; alternative = "twosided","greater","smaller")

type — ci : confidence interval

method : "theoretical" & "simulation"

This is a note from my online course through Coursera, if you have interest in whole coueses, click here to know more information. Thanks.