Week3 t-distribution, Comparing two populations, ANOVA test

T distribution

When σ is unknown or you're in a small sample (n <30) , use t-distribution to address the uncertainty of the standard error estimate.





但因為 σ 為母體標準差在n很大的時候,用S來估計 σ 可以接受,但是在樣本數小的時候,無法用S來估計 σ,估計時會很不精確,並且確實存在一些差異。



Comparing two populations —

兩獨立母體平均數的統計推論:大樣本 (Z)
兩獨立母體平均數的統計推論:小樣本 (t)
成對母體平均數差的統計推論 ( paired/dependent samples )
兩個母體比例差的統計推論 -> 下週併入porpotion討論

兩獨立母體平均數的統計推論:大樣本 (Z)


CI: 直接使用新的估計點與SE
HT: 會產生新的母體與樣本標準差

兩獨立母體平均數的統計推論:小樣本 (t)

成對母體平均數差的統計推論 ( paired )

— still need to be studied

ANOVA testing

  • ANOVA is used to compare three or more population means to determine whether they could be equal.
  • is used to determine if the variability in the sample means is so large that it seems unlikely to be from chance alone. ( by simultaneously considering many groups at once)
  • Compare the means of A,B,C and D! If use t-test, we need 6 different t tests. Assume the significant level =0.05.
P(all correct)=(0.95)⁶=0.735
 The buildup Type-I error = 1–0.735 = 0.265
  • Recognize that the H0 in ANONA sets all means equal to each otehr:
Ho :µ1 = µ2 = ... = µk
H1 : At least one mean is different

ANOVA only answers whether there us evidence of difference in at least one pair of groups, it doesn’t tell us which groups are different.


  • the F statistic is caculated as the ratio of the:
MSG ( variability between groups) /
MSE ( variability within error)
  • ANOVA has a right skewed distribution with 2 different measures of degree of freedom:
Numerator ( dfG = K-1 )
Denominator (dfE = n-K )

Sum of Squares Total (SST)

Sum of Squares Group (SSG)
- measures the variability between group.

Sum of Squares Error (SSE)
- measures the variability within groups.


>R: pf ( F-value , dfG , dfE , lower.tail = FALSE )

if P-value is small, reject H0.

The data provide convincing evidence that at least one pair of population means is different from each other.

Condition for ANOVA

  1. Independence : sampled observations within and between groups must be independent. (non-paired)
  2. Approximate normality
  3. Equal variance : Group should have roughly equal variability.

How I use R to solve this problem:

worker1 <- c(6.13,6.06,6.11,6.09,6.20,NA,NA)
worker2 <- c(6.06,6.16,6.03,6.07,5.99,6.30,6.02)
worker3 <- c(6.24,6.18,6.38,6.34,6.36,6.14,NA)
worker4 <- c(6.09,6.03,5.99,6.01,NA,NA,NA)
# list all data
data_t <- data.frame(cbind(worker1,worker2,worker3,worker4))
# build up and combine a data frame and you need to let all length of column be the same.
Stack_t <- stack(data_t)
# stacking
Aov_t <- aov (values~ind,data=Stack_t)

Multiple Comparison

df = min ( n1–1 , n2–1)

more stringent significance level is more appropriate for these test. ( adjust σ ) σ* = σ/K

K = k (k-1) / 2 (k=number of comparisons)

ex. The social class variables has 4 level. If σ = 0.05 for the original ANOVA, what should the modified significance level be for two sample t tests for determining which pairs of groups have significantly different mean?

k=4 K=4*3/2 = 6

σ* = 0.05/6 = 0.0083

Slides sources from: http://web.ntpu.edu.tw/~wtp/statpdf/Ch_11.pdf

more about How to use R to deal with ANOVA:


#read the data into a table

aov.ex1 = aov(Alertness~Dosage,data=data.ex1)
#do the analysis of variance
#show the summary table
print(model.tables(aov.ex1,"means"),digits=3)       #report the means and the number of subjects/cell
#graphical summary

<upgrade> Then, I try to use R to solve those practice questions, here lists one of the question:

  1. 先建立一個名為 ‘produce’ 的資料庫


2. 設定H0:u1=u2 ; H1:u1≠u2

3. 建立一個新欄位xc,求數值 xa-xb,因為我們要求兩者差的平均值:

produce<- produce %>% mutate(xc=xa-xb)

4. 使用 ’inference’ 計算HT:

inference(y=xc , data=produce , statistic = “mean” , type=”ht” , null=0 , alternative=”greater”, method=”theoretical”)


Single numerical variable
n = 25, y-bar = 0.244, s = 1.3799
H0: mu = 0
HA: mu > 0
t = 0.8841, df = 24
p_value = 0.1927

由於 P-value >α (0.05) , 故無法拒絕假設檢定,兩條生產線零件裝配時間顯著不同。

簡介 ‘inference’ 函數

y : the respond variable that we are interested in

x: variable that splits the data into two group

statistic : the sample statistic we're using and the population parameter we're estimating. ( "mean", "median" & "proportion")

type — ht : hypothesis test ( null = o ; alternative = "twosided","greater","smaller")

type — ci : confidence interval

method : "theoretical" & "simulation"

This is a note from my online course through Coursera, if you have interest in whole coueses, click here to know more information. Thanks.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.