撰寫 R 語言函數：Learn from the Wickhams

Yao-Jen Kuo

Published in

數聚點文摘

13 min readNov 24, 2017

To understand computations in R: everything that happens is a function call.
John Chambers

R 語言是根基於 S 語言的開源軟體計畫，John Chambers 身為 S 語言之父以及 R 語言的核心開發成員，利用他描述 R 語言函數編程的這項特質作為我們的第一個印象再適切不過了。

R, at its heart, is a functional programming (FP) language.
Hadley Wickham

又或者引用將 R 語言推向資料科學高峰的 Hadley Wickham，在其知名著作 The Advanced R 一書中提到 R 語言本質就是一個函數編程程式語言，相信撰寫函數必定是在 R 語言的學習之旅中不能略過的一站。

成為 DataInPoint 的贊助者

Writing Functions in R

DataCamp 邀請到 Hadley 與 Charlotte 姊弟共同來教授 Writing Functions in R 課程，這也是 DataCamp 課程中唯一有 Hadley 親自來指導教學的內容，是粉絲朝聖的首選；課程時數約四小時，完成這個課程使用者將會對 R 語言的自訂函數有更深一層的理解，這門課程並不適合初學者修習，適合已經暸解變數類型、資料結構、流程控制與迴圈等 Base R 觀念的使用者修習。

R Functions

This course will teach you the fundamentals of writing functions in R so that you can make your code more readable…

www.datacamp.com

既然函數這麼重要，究竟撰寫函數能夠在什麼時候派上用場？Hadley 在課程中有一個很清晰的指引：

If you have copy-and-pasted twice, it’s time to write a function.

假如使用者發現自己開始複製貼上自己的程式碼兩次，就應該要停下來並改寫為函數（這個定義太美，粉絲模式不自覺地開啟！）

課程所舉的範例是一個將資料數值標準化的情境，這是很常見當數值單位不同時所採取的資料預處理。假如我們要將內建資料 iris 的四個數值變數標準化（Min Max Scale 方法）：

iris_std <- iris
iris_std$Sepal.Length <- (iris_std$Sepal.Length - min(iris_std$Sepal.Length, na.rm = TRUE)) / (max(iris_std$Sepal.Length, na.rm = TRUE) - min(iris_std$Sepal.Length, na.rm = TRUE))
# 複製貼上第一次
iris_std$Sepal.Width <- (iris_std$Sepal.Width - min(iris_std$Sepal.Width, na.rm = TRUE)) / (max(iris_std$Sepal.Width, na.rm = TRUE) - min(iris_std$Sepal.Width, na.rm = TRUE))
# 複製貼上第二次
iris_std$Petal.Length <- (iris_std$Petal.Length - min(iris_std$Petal.Length, na.rm = TRUE)) / (max(iris_std$Petal.Length, na.rm = TRUE) - min(iris_std$Petal.Length, na.rm = TRUE))
# 該停下來囉...

複製貼上修改變數名稱兩次之後，就意識到必須停下來改寫為函數，首先撰寫可以將輸入向量標準化的函數 min_max_scale()：

# 宣告函數
min_max_scale <- function(x) {
  x_min <- min(x, na.rm = TRUE)
  x_max <- max(x, na.rm = TRUE)
  output <- (x - x_min) / (x_max - x_min)
  return(output)
}

接著呼叫函數將前兩個數值變數 Sepal.Length 與 Sepal.Width 標準化：

# 宣告函數
min_max_scale <- function(x) {
  x_min <- min(x, na.rm = TRUE)
  x_max <- max(x, na.rm = TRUE)
  output <- (x - x_min) / (x_max - x_min)
  return(output)
}# 呼叫函數
iris_std <- iris
iris_std$Sepal.Length <- min_max_scale(iris_std$Sepal.Length)
iris_std$Sepal.Width <- min_max_scale(iris_std$Sepal.Width)
View(iris_std)

有四個數值變數要標準化，勢必也不能複製貼上三次，因此改寫成一個迴圈：

# 宣告函數
min_max_scale <- function(x) {
  x_min <- min(x, na.rm = TRUE)
  x_max <- max(x, na.rm = TRUE)
  output <- (x - x_min) / (x_max - x_min)
  return(output)
}# 呼叫函數
iris_std <- iris
for (i in 1:4) {
  iris_std[, i] <- min_max_scale(iris_std[, i])
}

除了使用迴圈將 min_max_scale() 函數應用到 iris_std 以外，是否還有其他的方法能夠完成呢？我們可以透過 lapply() 或者 purrr 套件中的 map() 這兩個函數都能夠將完成標準化的向量儲存在 list 中回傳：

library(purrr)# 宣告函數
min_max_scale <- function(x) {
  x_min <- min(x, na.rm = TRUE)
  x_max <- max(x, na.rm = TRUE)
  output <- (x - x_min) / (x_max - x_min)
  return(output)
}# 呼叫函數
lapply(iris[, 1:4], FUN = min_max_scale)
map(iris[, 1:4], .f = min_max_scale)

purrr 套件

purrr 套件是 Hadley 開發的套件，目的是盡可能地減少 for 迴圈並提供函數編程更完善的支援，同時亦提供了 apply 函數家族以外的另一個選擇，套件的入門函數除了我們剛才使用過的 map() 外尚有特別用來處理圖形的 walk()。像是我們想將內建資料 iris 的四個數值變數繪畫出直方圖觀察分布的情況，以迴圈可以這樣實作：

par(mfrow = c(2, 2))
plot_title <- "Hist of"
iris_vars <- names(iris)[-5]
plot_titles <- paste(plot_title, iris_vars)
for (i in 1:4) {
  hist(iris[, i], xlab = "", col = rgb(1, 0, 0, 0.4), main = plot_titles[i])
}

以 walk() 函數實作則是這樣撰寫：

library(purrr)walk(iris[, 1:4], .f = hist, col = rgb(1, 0, 0, 0.4), xlab = "")

接著我們用 pwalk() 來處理難看的標題並且依照變數上不同的顏色，注意要將資料、標題參數以及顏色參數用一個 list 包括起來：

plot_title <- "Hist of"
iris_vars <- names(iris)[-5]
plot_titles <- paste(plot_title, iris_vars)
hist_colors <- c(
  rgb(1, 0, 0, 0.4),
  rgb(0, 1, 0, 0.4),
  rgb(0, 0, 1, 0.4),
  rgb(1, 0.5, 0, 0.4)
)
pwalk(list(iris[, 1:4], main = plot_titles, col = hist_colors), .f = hist, xlab = "")

例外處理

min_max_scale() 函數只能用來處理數值向量的輸入，假如在前述應用時候沒有將內建資料 iris 的 Species 變數排除，會出現 min not meaningful for factors 的錯誤，因為對於因素向量（factors）來說計算最小值是沒有意義的：

# 宣告函數
min_max_scale <- function(x) {
  x_min <- min(x, na.rm = TRUE)
  x_max <- max(x, na.rm = TRUE)
  output <- (x - x_min) / (x_max - x_min)
  return(output)
}# 呼叫函數
lapply(iris, FUN = min_max_scale)
map(iris, .f = min_max_scale)

面對這種情境我們可以採用兩種方法來處理，一是在函數宣告的主體中加入 tryCatch() 函數：

# 宣告函數
min_max_scale <- function(x) {
  tryCatch({
    x_min <- min(x, na.rm = TRUE)
    x_max <- max(x, na.rm = TRUE)
    output <- (x - x_min) / (x_max - x_min)
    return(output)
  }, error = function(e){
    return("x 必須為數值型變數")
  })
}# 呼叫函數
lapply(iris, FUN = min_max_scale)
map(iris, .f = min_max_scale)

另一種處理方法是透過 purrr 套件提供的 safely() 函數，它會為我們原本的自訂函數做好例外處理：

library(purrr)# 宣告函數
min_max_scale <- function(x) {
  x_min <- min(x, na.rm = TRUE)
  x_max <- max(x, na.rm = TRUE)
  output <- (x - x_min) / (x_max - x_min)
  return(output)
}# 呼叫 safely() 為函數加上例外處理
safe_min_max_scale <- safely(min_max_scale)# 呼叫函數
lapply(iris, FUN = safe_min_max_scale)
map(iris, .f = safe_min_max_scale)

幾個注意事項

課程中 Hadley 與 Charlotte 還特別叮嚀了幾個撰寫函數時的注意事項：

先用慣常的方式解決問題，接著改為撰寫函數再解決一次
函數命名使用動詞並以底線分隔多個單字，務必讓使用者容易理解
參數命名使用名詞，先擺放資料參數，再放置細節參數，細節參數記得要給定預設值
暸解 R 語言的編程風格
撰寫函數的首要目的是解決我們的問題，而不是漂亮簡潔的程式碼，不要因為使用 for 迴圈而感到不開心
短期試著先將問題中較簡單的 80% 用函數解決，這時會顯得吃力費時
長期就能夠將問題中的 99% 用函數解決，這時會顯得輕鬆快捷

R Functions

This course will teach you the fundamentals of writing functions in R so that you can make your code more readable…

www.datacamp.com

Hadley Wickham

Hi! I'm Hadley Wickham, Chief Scientist at RStudio, and an Adjunct Professor of Statistics at the University of…

hadley.nz

Charlotte Wickham

Part-time Assistant Professor of Statistics at Oregon State University, specialist in R training and course developer…

cwick.co.nz

Functions · Advanced R.

Functions are a fundamental building block of R: to master many of the more advanced techniques in this book, you need…

adv-r.had.co.nz

Style guide · Advanced R.

Good style is important because while your code only has one author, it'll usually have multiple readers. This is…

adv-r.had.co.nz

撰寫 R 語言函數：Learn from the Wickhams

成為 DataInPoint 的贊助者

Writing Functions in R

R Functions

This course will teach you the fundamentals of writing functions in R so that you can make your code more readable…

purrr 套件

例外處理

幾個注意事項

延伸閱讀

R Functions

This course will teach you the fundamentals of writing functions in R so that you can make your code more readable…

Hadley Wickham

Hi! I'm Hadley Wickham, Chief Scientist at RStudio, and an Adjunct Professor of Statistics at the University of…

Charlotte Wickham

Part-time Assistant Professor of Statistics at Oregon State University, specialist in R training and course developer…

Functions · Advanced R.

Functions are a fundamental building block of R: to master many of the more advanced techniques in this book, you need…

Style guide · Advanced R.

Good style is important because while your code only has one author, it'll usually have multiple readers. This is…

Published in 數聚點文摘

Written by Yao-Jen Kuo

No responses yet