Case Study: Tips 數據集 Speedrun

Bananish

Published in

The whispers of a data analyst

15 min readMar 12, 2023

2022 開始在東吳上資料科學碩士在職專班的課程
這是來自多變量分析課程中老師用來介紹 rcommander GUI 的範例
其實，用按的並不比直接寫還快～

案例介紹

被收錄在 Practical data analysis : case studies in business statistics 中的 Tips 數據集中記錄了每個餐會的小費、帳單等資訊。

Practical Data Analysis: Case Studies in Business Statistics

Practical Data Analysis: Case Studies in Business Statistics is a collection of 75 class tested case studies for use in…

www.amazon.com

一位服務生記錄了他在一家餐廳工作的幾個月間，收到的小費與相關的訊息

這個研究的目標在於探索小費多寡與其他因素之間的關係。

其中的資訊包含：

tip in dollars
bill in dollars
sex of the bill payer
whether there were smokers in the party
day of the week
time of day
size of the party

Step 1：引入資源

主要使用資源

data.table、dplyr 用於資料的取用與整理
ggplot2、plotly 用於繪圖
abn 用於結構探索

library(data.table)
library(dplyr)
library(ggplot2)
library(plotly)
library(abn)

次要使用的資源

未被在一開始載入至環境中會零散的出現在各個角落

skimr 摘要與描述資料集
scales 格式化或美化數字的呈現
colorspace 調整 ggplot2 圖中的顏色
ggthemes 調整 ggplot2 圖中的版型
forcats 處理 factor
jtools 展示回歸分析的結果
mediation 評估中介效應

Step 2：資料獲取

var_path <- "data/20230311_tips.csv"

data <- var_path %>% fread()

Step 3：預整理

文獻多、資料定義清楚，等先備資訊充足時可以馬上開始整理。

如果先備資訊不足可能會先進行探索在整理。

data <- data %>% mutate(
  sex = sex %>% factor(., c("Female", "Male")),
  smoker = smoker %>% factor(., c("No", "Yes")),
  day = day %>% factor(., c("Thur", "Fri", "Sat", "Sun")),
  time = time %>% factor(., c("Lunch", "Dinner"))
)

Step 4：資料探索

描述性統計

此階段注意的重點包含：

資料大小
資料正確性，是否有奇怪的資訊
類別與連續變項有哪些、大概的分佈
類別變量有沒有可能要重新分組
連續變量是否有需要轉換以符合常態分佈，顯然有幾個可能需要

data %>% skimr::skim()

可視化

以 3D 散點圖表示三個連續變項 tip、bill、size 之間的關係

data %>%
  plot_ly(x = ~size, y = ~total_bill, z = ~tip, type = "scatter3d", mode = "markers

用 plotly 可以旋轉 3D 圖，顯然是有某種相關

以 3D 散點圖表示三個連續變項 tip、bill、size 之間的關係，並加上以顏色表示是否為 smoker

data %>%
  plot_ly(x = ~size, y = ~total_bill, z = ~tip, color = ~smoker, type = "scatter3d", mode = "markers")

**3D 散點圖： tip、bill、size 間的關係，以及是否為 smoker**

smoker 可能不會對三個連續變量的關係有所影響

更複雜一點，想知道不同 day、smoker、sex 中 total_bill 與 tip 的關係

在 R 中用 ggplot 做 sub plot 好像比較快一點，調整顏色與版型也比較快

ggplotly(
  data %>%
    mutate(
      day = day %>% paste("day =", .),
      smoker = smoker %>% paste("smoker =", .)
    ) %>%
    ggplot(aes(x = total_bill, y = tip, color = sex)) +
    geom_smooth(method = "lm", se = F, size = .5, linetype = 2, formula = "y ~ x") +
    geom_point(alpha = .8) +
    facet_wrap(smoker ~ day, ncol = 4, scales = "free") +
    scale_x_continuous(labels = scales::label_comma()) +
    scale_y_continuous(labels = scales::label_comma()) +
    colorspace::scale_color_discrete_diverging("Green-Orange") +
    colorspace::scale_fill_discrete_diverging("Green-Orange") +
    ggthemes::theme_fivethirtyeight() +
    theme(legend.position = "bottom")
)

**散點圖 + 線性趨勢： tip、bill 間的關係在不同 smoker、day、sex 組合中的關係的差異**

在這樣多層的狀況下，少數的 smoker X day 的組合好像有顯示出 Female、Male 的 bill vs tips 的斜率差異，但可能要檢定才會知道

用性別畫個盒狀圖看看

ggplotly(data %>%
  mutate(
    day = day %>% paste("day =", .),
    smoker = smoker %>% paste("smoker =", .)
  ) %>%
  ggplot(aes(x = sex, y = tip, color = sex)) +
  geom_boxplot() +
  facet_wrap(~day, ncol = 4, scales = "free") +
  scale_y_continuous(labels = scales::label_comma()) +
  colorspace::scale_color_discrete_diverging("Green-Orange") +
  colorspace::scale_fill_discrete_diverging("Green-Orange") +
  ggthemes::theme_fivethirtyeight() +
  theme(legend.position = "bottom"))

好像也很難說不同性別（結帳的人）tips 有什麼不一樣

Step 5：假設檢定

研究的環境限制

因為沒有這個主題（bills vs tips）的 data context、historical conclusion。

此外，這顯然是一個類似 Cohort study 的觀察性研究（exposure：bills、outcome：tip）。

所以很難確定：

Bias：有沒有錯誤分類的可能、有沒有可能這個服務生都被分配到某種特徵的客人、有沒有可能這是一間沒有代表性的餐廳
Confounder：不太確定實際 casual DAG 長什麼樣子，但很理性的可以想像 bills 應該某種程度會影響 tips，這也是要檢查的主要項目，但其他的變因會不會對這兩者有影響而成為混淆因子、又或者是有沒有其他沒被觀察的混淆，例如吃飯的人今天心情與身體好不好、對這間店的態度如何

所以，不管怎麼做，這篇研究就算可以屏除 random error 的影響，在資訊很不充足的情形之下，對魚研究最後發現所謂有顯著的成果，可能都很難上升到因果效應的這種層次。也就是在研就的內部有效性上應該存在著許多未解之謎。

結構探索

既然，都不太確定 DAG 是什麼，所以用了 abn 中的方法自動找到合理的 DAG。

此時，把 day 進行重新分組，bill、tip 進行 log transformation

data_abn <- data %>%
  mutate(day = forcats::fct_recode(day, "Weekday" = "Thur", "Weekday" = "Fri", "Weekend" = "Sat", "Weekend" = "Sun")) %>%
  mutate_at(.vars = c("total_bill", "tip"), ~ log(.)) %>%
  data.frame()

data.dists <- list(
  "total_bill" = "gaussian",
  "tip" = "gaussian",
  "sex" = "binomial",
  "smoker" = "binomial",
  "day" = "binomial",
  "time" = "binomial",
  "size" = "gaussian"
)


dag.banned <- matrix(0, ncol(data_abn), ncol(data_abn), dimnames = list(names(data.dists), names(data.dists)))

dag.banned["size", "total_bill"] <- 1
dag.banned["total_bill", "tip"] <- 1

data_abn %>%
  buildScoreCache(data.df = ., data.dists = data.dists, max.parents = 1, dag.banned = dag.banned) %>%
  mostProbable() %>%
  fitAbn() %>%
  plot()

透過比較不同的 max.parent 設置下，可以找到 mlik 比較優秀的 max.parent 參數，這邊設定為 1（沒有寫在上面）。

另外，原則上常規的決策路徑中：

bill 影響 size：像是因為不想付太多錢所以不要找太多人
tip 影響 bill：像是不想付太多小費所以不要點太多

這兩種因果路徑可能都有一點合理又有一點不合理，但不像是常態，所以用矩陣 dag.banned 確保這個路徑被移除在外。

這是最後 abn 判斷認為比較合理的 causal DAG。

性別決定哪天吃飯（似乎可以接受）、選擇哪天吃飯後決定中午還是晚上（免搶接受）

smoker 被晾在一旁，表示跟其他變量沒關係，跟前面看到的差不多

size 透過中介 bill 進而影響 tip，感覺這個好像非常合理

越多人吃飯總金額越高、總金額越高對 tip 影響越大

回歸分析

雖然已經有路徑係數了，但想再用回歸檢查一次

主要想檢查的是：

透過可視化跟 abn ，已經對 sex、time、day、smoker 沒啥興趣，原則上應該檢查
但應該先確定 size、bill、tip 的 chain 關係

list_model <- c("total_bill", "size", "total_bill + size") %>%
  paste("tip ~", .) %>%
  c(., "total_bill ~ size") %>%
  lapply(., as.formula) %>%
  lapply(., function(x) {
    glm(data = data_abn, formula = x)
  })

suppressMessages({
  list_model %>%
    jtools::export_summs(.,
      error_format = "[{conf.low}, {conf.high}]",
      model.names = c("total bill", "size", "both", "size by total bill")
    )
})

tip ~ bill：有顯著差異，r2 也不差，把 r2、ai、bic 先當一個標準看其他的模型
tip ~ size：r2、aic、bic 好像表現在比 model (1) 差一些
tip ~ size + bill：如果有一個是中介，原則上期待兩個變量的 variance 應該會有一個變差一些，畢竟兩個相依，確實從 size 的 se 或 p 上可以看到類似的反應。r2 只有微乎其微的增加一點，表示多加另一個變量對模型的解釋力沒有太多幫助，雖然 aic 變少但 bic 變多；這些都是存在中介可能會引發的徵兆（但也有可能不是）
bill ~ size：如預期的關聯

順便看了殘差

par(mfrow = c(2, 2))
list_model[[1]] %>% plot()

par(mfrow = c(2, 2))
list_model[[3]] %>% plot()

勉強還行（殘差散佈沒有太明顯詭譎的趨勢）

中介分析

其實差不多意思，但可以試一次

model_mediate = mediation::mediate(model.m = list_model[[4]], model.y = list_model[[3]],
                                   treat = "size", mediator = "total_bill")
model_mediate%>%summary
model_mediate%>%plot

ACME 平均中介效應算是對驗證中介的一項資訊，但 ADE 平均直接效應顯示 size 對 tips 有點影響，也就是在說為什麼 total bill 加上 size 成為 both 模型後 r2 還是有一點點增加。

參考

以上資訊同步分享在 github 中

2023–03–11 Tips

Edit description

seimwiwa.github.io

資料集相關資訊

tips: Tipping data in reshape2: Flexibly Reshape Data: A Reboot of the Reshape Package

One waiter recorded information about each tip he received over a period of a few months working in one restaurant. He…

rdrr.io

GitHub - mwaskom/seaborn-data: Data repository for seaborn examples

Data repository for seaborn examples. This repository exists only to provide a convenient target for the…

github.com

其他高手的 case study

Tips Dataset

Explore and run machine learning code with Kaggle Notebooks | Using data from A Waiter's Tips

www.kaggle.com

Part3 - angela1c.com

Part 3 Analysis: Analyse the relationship between the variables within the dataset. Analyse: analyse the relationship…

www.angela1c.com

中介分析

Doing and reporting your first mediation analysis in R

A quick tutorial on how to perform and report a mediation analysis

towardsdatascience.com

Case Study: Tips 數據集 Speedrun

案例介紹

Practical Data Analysis: Case Studies in Business Statistics

Practical Data Analysis: Case Studies in Business Statistics is a collection of 75 class tested case studies for use in…

Step 1：引入資源

主要使用資源

次要使用的資源

Step 2：資料獲取

Step 3：預整理

Step 4：資料探索

描述性統計

可視化

Step 5：假設檢定

研究的環境限制

結構探索

回歸分析

中介分析

參考

2023–03–11 Tips

Edit description

tips: Tipping data in reshape2: Flexibly Reshape Data: A Reboot of the Reshape Package

One waiter recorded information about each tip he received over a period of a few months working in one restaurant. He…

GitHub - mwaskom/seaborn-data: Data repository for seaborn examples

Data repository for seaborn examples. This repository exists only to provide a convenient target for the…

Tips Dataset

Explore and run machine learning code with Kaggle Notebooks | Using data from A Waiter's Tips

Part3 - angela1c.com

Part 3 Analysis: Analyse the relationship between the variables within the dataset. Analyse: analyse the relationship…

Doing and reporting your first mediation analysis in R

A quick tutorial on how to perform and report a mediation analysis

Written by Bananish