Case study: Apple watch 健康數據分析 — 你睡了嗎？

Bananish

Published in

The whispers of a data analyst

15 min readMar 13, 2023

這會是一個非常基本簡單的探索教學，目的是在能從攜帶式裝置中獨立進行資料的取得與使用。不會涉及太複雜的 R 語言操作、資料處理、模型調適、與模型驗證。

嘗試使用 Apple watch 記錄到的心律資料
推敲每個小時的活動狀態

會分幾個部分說明：

導出 iPhone 中的健康資料
在 R 中使用導出的健康資料
看看心律資料
簡單的用隱碼可夫模型推敲活動狀態

Step 1：導出 iPhone 中的健康資料

因為長年配戴 Apple Watch，該裝置會再把數據同步至手機中，因此導出數據十分簡單：

進入手機健康 APP 中，點擊右上角的頭像
在畫面最下方，找到並點擊「輸出所有健康資料」
在元件中，點擊輸出並等候，需要一陣子
接著可以在畫面中點擊要輸出的位置

這裡的健康數據大概有 3 年以上，約 40 MB

Step 2：在 R 中使用導出的健康資料

導出的資料是一個「輸出.zip」的檔案
解壓縮後得到「apple_health_export」資料夾
找到資料夾中的「輸出.xml」這個就是主要的健康資料來源
此處已經將「輸出.xml」改為「export.xml」

Step 3：引入資源

主要使用資源

data.table、dplyr 用於資料的取用與整理
xml2 用於取用導出的 xml 文件
ggplot2 用於繪圖
ggridges 用於繪圖

library(xml2)
library(dplyr)
library(data.table)
library(ggplot2)
library(ggridges)

次要使用的資源

未被在一開始載入至環境中會零散的出現在各個角落

tidyr：用於解開 data table 中巢狀的 list 數據
skimr 摘要與描述資料集
ggthemes 調整 ggplot2 圖中的版型
colorspace 調整 ggplot2 圖中的顏色
feather 存取 Feather 格式資料
depmixS4 使用 HMM 模型
imputeTS 使用 moving average 進行插補

Step 4：資料獲取

因為是 xml 檔案，不算最常用，特別說明；以下方法缺乏效率，但簡單好懂，依序是：

指定路徑
讀取 xml 檔案
在檔案中找到根目錄中的 Record 紀錄
取得紀錄中的所有屬性值
轉為 data.table
將 data.table 中的巢狀資料解開

data <- "data/20230311/apple_health_export/export.xml" %>%
  read_xml() %>%
  xml_find_all(".//Record") %>%
  xml_attrs() %>%
  data.table() %>%
  tidyr::unnest_wider(".")

因為很慢，可能會想先把讀取好的資料存下來

（feather 不一定是最適合用於存取 ft 的 package）

儲存成 ft

data%>%
  feather::write_feather("data/20230311_data.ft")

讀取 ft

data = eather::read_feather("data/20230311_data.ft")

Step 5：預整理

此處取得 type 為 HKQuantityTypeIdentifierHeartRate 的心律資料。

因為只會選擇 startDate 與 value 使用，只需將時間轉換為 POSIXct 格式，value 轉為整數

剩下就是一些日期時間的處理

data_hr_raw <- data %>%
  filter(
    type == "HKQuantityTypeIdentifierHeartRate"
  ) %>%
  mutate_at(.vars = c("startDate"), ~ as.POSIXct(.)) %>%
  mutate(value = value %>% as.integer()) %>%
  select(startDate, value) %>%
  rename_all(~ c("start_date", "value")) %>%
  filter(!is.na(value)) %>%
  mutate(
    year = start_date %>% year() %>% as.integer() %>% factor(., 2020:2023),
    month = start_date %>% month() %>% as.integer() %>% factor(., 1:12),
    weekdays = start_date %>% weekdays() %>% tolower() %>% factor(., c("sunday", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday")),
    weekday_end = ifelse(weekdays %in% c("sunday", "saturday"), "weekend", "weekday") %>% factor(),
    hour = start_date %>% hour() %>% as.integer() %>% factor(., 0:23)
  )

如果想知道還有哪些 type 或那些 type 什麼意思可以參考 Apple 開發者文件

Data types | Apple Developer Documentation

Specify the kind of data used in HealthKit.

developer.apple.com

稍微確定一下資料

data_hr_raw %>%
  skimr::skim()

總共有將近 30 萬筆，看起來 HR 數據大部分應該不會太怪；就是 min 28 有點低；時間大概是 2020 ~ 2023，幾乎都是 unique 的。

Step 6：可視化

接下來，想看看心律的分佈

心律分佈與 quantile

data_hr_raw %>%
  ggplot(aes(x = value, y = "Heart rate")) +
  geom_density_ridges(
    quantile_lines = TRUE, scale = 10, alpha = 0.7,
    vline_size = 1, vline_color = "red",
    point_size = 0.4, point_alpha = 1
  ) +
  ggthemes::theme_fivethirtyeight()

心律分佈 by hour

data_hr_raw %>%
  ggplot(aes(x = value, y = hour, fill = stat(x))) +
  geom_density_ridges_gradient(scale = 6, color = "white", rel_min_height = 0.02) +
  colorspace::scale_fill_continuous_sequential("sunset") +
  ggthemes::theme_fivethirtyeight() +
  labs(fill = "Heart\nRate") +
  theme(
    legend.position = "right",
    legend.direction = "vertical"
  )

心律分佈 by day of week

data_hr_raw %>%
  ggplot(aes(x = value, y = weekdays, fill = weekdays)) +
  geom_density_ridges(scale = 3, color = "white", rel_min_height = 0.02, alpha = .5) +
  colorspace::scale_fill_discrete_qualitative("Dark 2") +
  ggthemes::theme_fivethirtyeight() +
  labs(fill = "DOW") +
  theme(
    legend.position = "right",
    legend.direction = "vertical"
  )

心律分佈 by hour and weekday/weekend

data_hr_raw %>%
  mutate(
    hour = start_date %>% hour() %>% as.integer() %>% factor(., 0:23),
    weekdays = start_date %>% weekdays(),
    weekdays = ifelse(weekdays %in% c("Sunday", "Saturday"), "Weekend", "Weekday") %>% factor()
  ) %>%
  ggplot(aes(x = value, y = hour, fill = weekdays)) +
  geom_density_ridges(scale = 5, color = "white", rel_min_height = 0.02, alpha = .5) +
  colorspace::scale_fill_discrete_qualitative("Dark 2") +
  ggthemes::theme_fivethirtyeight() +
  labs(fill = "Weekday/Weekend") +
  theme(
    legend.position = "right",
    legend.direction = "vertical"
  )

心律分佈 by year and month

data_hr_raw %>%
  mutate(
    month = start_date %>% month() %>% as.integer() %>% factor(., 1:12),
    year = start_date %>% year() %>% as.integer() %>% factor(.),
  ) %>%
  ggplot(aes(x = value, y = month, fill = year)) +
  geom_density_ridges(scale = 3, color = "white", rel_min_height = 0.02, alpha = .5) +
  colorspace::scale_fill_discrete_qualitative("Dark 2") +
  ggthemes::theme_fivethirtyeight() +
  labs(fill = "Year") +
  theme(
    legend.position = "right",
    legend.direction = "vertical"
  )

數據特徵

假設對 data context 一無所知，觀察數據後有以下發現：

所有 HR 資料的分佈不常態，可能是兩個或多個常態態再一起，也可以考慮 log transform
HR 分佈有小時差異（可能跟睡不睡覺有關）
有明顯的 dow 差異，也就是 weekday/weekend 不太一樣（可能要上班有差）
weekday/weekend 每小時的 HR 分佈也不太一樣（可能與作息有關）
year / month 似乎也有不太一樣（可能天氣或作息的長期變化）

Step 7：用 HMM 猜睡眠狀態

希望盡可能的可以只用每小時心律的狀態來猜當下的狀態，想象中應該只有睡了、醒著兩種。

彙整資料

這邊選了今年 1 月的歷史資料來嘗試，因為每小時 apple watch 會有數筆的 HR 數據，首先得把資料都彙整成小時的單位，並計算中位數。

data_hr_summary <- data_hr_raw %>%
  filter(start_date >= "2023-01-01", start_date < "2023-02-01") %>%
  mutate(start_date = start_date %>% round(., "hour") %>% as.POSIXct()) %>%
  group_by(start_date) %>%
  summarise(
    q2 = value %>% quantile(., probs = .50)
  ) %>%
  data.frame()

因為不太可能每個小時都有數據（手表要充電），所以補一下缺失的數據

針對缺漏的數據，用 imputeTS 的 na_ma 以線性的移動平均進行補值

（這不是一定是最適合補遺漏值的方法）

data_hr_summary <- seq.POSIXt(
  from = "2023-01-01 00:00:00" %>% as.POSIXlt(),
  to = "2023-01-31 23:00:00" %>% as.POSIXlt(),
  by = "hour"
) %>%
  data.frame(start_date = .) %>%
  merge(x = ., y = data_hr_summary, by = "start_date", all.x = T) %>%
  arrange(start_date)%>%
  mutate_if(is.numeric, ~ imputeTS::na_ma(., k = 3, weighting = "linear")) %>%
  mutate(
    date = start_date %>% as.Date(),
    hour = start_date %>% hour()
  )

原則上上面這些在嘗試的時候要反覆確認數據的狀態，因為怕太冗長就不列出了。

HMM 模型擬和

這邊用了 depmixS4 的 depmix 一階（只有一個隱藏狀態）的隱碼可夫模型

不做任何調適，就是簡單的試試

model_hr_status <- model_hr <- depmixS4::depmix(
  data = data_hr_summary, q2 ~ 1, ns = 2,
  family = depmixS4::multinomial(link = "identity")
) %>%
  depmixS4::fit()

整理一下模型的輸出

data_model_predict <- data_hr_summary %>%
  cbind(
    .,
    model_hr_status@posterior %>% data.table() %>% select(state)
  ) %>%
  data.table()

看一下模型的結果

圖中標記為 status = 1 的確實蠻像睡覺的時間

蠻像一回事的，以後可以嘗試用 Apple watch 的睡眠紀錄交叉驗證！！

（請用 type = “HKCategoryTypeIdentifierSleepAnalysis”）

參考

有點累不想寫太多、而且又怕篇幅太長，所以上面做得很淺，就別見怪

script 要整理整理在放上來 XD

下面有些資源可以看看～

Analyzing Apple Health Data with Python

Our smartphones are incredible mini-trackers that can be used for self-improvement and self-transformation.

medium.com

Wearable Device Heart Rate and Activity Data in an Unsupervised Approach to Personalized Sleep…

Monitoring Editor: Gunther Eysenbach Reviewed by Qi Huang, Xinyue Li, and Tien Lung Sun Jiaxing Liu, MSc, 1 Yang Zhao…

www.ncbi.nlm.nih.gov

A Two-Stage HMM Model for Sleep/Wake Identification via Commercial Wearable Device

Good sleep habit is essential to maintain a quality life. Sensor-based wearable devices have been increasingly deployed…

ieeexplore.ieee.org

GitHub - ziyujia/Physiological-Signal-Classification-Papers: A list of papers for physiological…

If you have any suggested papers, please contact me ziyujia{at}bjtu.edu.cn We conduct the overall statistical analysis…

github.com

A Hidden Markov Model Based Unsupervised Algorithm for Sleep/Wake Identification Using Actigraphy

Actigraphy is widely used in sleep studies but lacks a universal unsupervised algorithm for sleep/wake identification…

arxiv.org