R on Hadoop streaming example

Deon
Deon
Nov 21, 2015 · 5 min read

版本資訊
R 3.2.1
Hadoop 2.6.0

確認各節點正確運作後 (例如啟動Hadoop word count 範例)

Hadoop streaming 主要是透過 stdin , stdout來傳遞內容
並且可以支援 大多數語言來實作mapper,reducer
舉例來說: mapper 可以指定java class, 或是 可執行檔 (例如R,Python)

1.首先準備Mapper.R 注意 第一行要是你這腳本的執行方式
用which Rscript 可找到路徑

#! /usr/bin/env Rscript
# mapper.R — Wordcount program in R
# script for Mapper (R-Hadoop integration)

trimWhiteSpace <- function(line) gsub(“(^ +)|( +$)”, “”, line)
splitIntoWords <- function(line) unlist(strsplit(line, “[[:space:]]+”))

con <- file(“stdin”, open = “r”)
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {

line <- trimWhiteSpace(line)
words <- splitIntoWords(line)

for (w in words)
cat(w, “\t1\n”, sep=””)}

close(con)

2.reducer.R如下

#! /usr/bin/env Rscript
# reducer.R — Wordcount program in R
# script for Reducer (R-Hadoop integration)
trimWhiteSpace <- function(line) gsub(“(^ +)|( +$)”, “”, line)
splitLine <- function(line) { val <- unlist(strsplit(line, “\t”)) list(word = val[1], count = as.integer(val[2]))
}
env <- new.env(hash = TRUE)
con <- file(“stdin”, open = “r”)
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) } else assign(word, count, envir = env)
}
close(con)
for (w in ls(env, all = TRUE)) cat(w, “\t”, get(w, envir = env), “\n”, sep = “”)

3. 先測試腳本是否能正確運作
echo “test hadoop streaming for hadoop” | Rscript mapper.R | Rscript reducer.R

4.將測試的txt檔案上傳至hdfs
hadoop dfs -put /home/stream/input.txt /input

5.先確認output路徑已經刪除了
hadoop dfs -rm -r /output

執行 hadoop streaming 注意事項
1.檔案路徑與權限 務必開啟
2.可使用-file , -files 來傳遞腳本到各節點上
3.-input -output 是指hdfs的路徑
4.腳本檔案務必注意編碼格式,

hadoop jar /opt/hadoop-2.6.0/share/tools/lib/hadoop-streaming-2.6.0.jar \
-file /home/hduser/stream/mapper.R \
-mapper mapper.R \
-file /home/hduser/stream/reducer.R \
-reducer reducer.R \
-input /input
-output /output

hadoop dfs -cat /output/part-00000 即可檢查結果

常見錯誤
1. Permission denied :使用chmod ,chown 給檔案權限
2.No such file Rscript :使用which Rscript 找到路徑並寫入R腳本第一行
3.cannot run mapper.R or No such file ./mapper.R… :例如在windows上儲存的文件 換行符號與linux上不一樣, 而透過stdin,會讀取到windows上的換行符號而造成錯誤

    Deon

    Written by

    Deon

    A software engineer