家庭電表資料探索
Johns Hopkins Exploratory Data Analysis 的第一週作業:Electric Power Consumption
這個作業所使用的資料是一個家庭四年的用電紀錄,每分鐘一筆;我們必須要解決兩個問題:讀入資料與作圖,其中作圖是這份作業中主要被評分的項目。
作業來源
Many people have gotten jobs in machine learning just by completing that MOOC. There’re other similar online courses that help; for example the John Hopkins Data Science specialization.
Andrew Ng answering How should you start a career in Machine Learning? on Quora
Johns Hopkins University 在 Coursera 開設的資料科學專項課程是一門非常完整且質量俱佳的課程,但是獨立完成專項課程中的十門課程,對初學者並不是一件簡單的事情;由於課程中的 Programming Assignment 的難度比上課內容或 swirl 小作業都難上許多,容易在遇到要繳交 Programming Assignment 的時候,就澆熄了滿懷雄心壯志的學習熱情。
跟 DataInPoint 一起完成課程中的專題或作業吧,今天我們要解第四門課 Exploratory Data Analysis 第一週要繳交的作業:Electric Power Consumption。
作業規定要以 GitHub Repository 繳交,可以參考這個範例連結:
概述
Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.
讀入資料
在讀入資料時我們必須注意幾點:
- 資料共有 2,075,259 個觀測值與 9 個變數
- 只需要讀入 2007–02–01 與 2007–02–02 這兩天的資料即可
- 資料中有日期與時間這兩個變數,可以利用
strptime()
與as.Date()
函數做型別轉換 - 資料中的遺漏值是以 ? 紀錄
同時也可以參考這 9 個變數的資訊:
- Date: Date in format dd/mm/yyyy
- Time: time in format hh:mm:ss
- Global_active_power: household global minute-averaged active power (in kilowatt)
- Global_reactive_power: household global minute-averaged reactive power (in kilowatt)
- Voltage: minute-averaged voltage (in volt)
- Global_intensity: household global minute-averaged current intensity (in ampere)
- Sub_metering_1: energy sub-metering №1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered).
- Sub_metering_2: energy sub-metering №2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light.
- Sub_metering_3: energy sub-metering №3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner.
接著用文字編輯器觀察一下這個 TXT 檔案的內容:
透過觀察,我們曉得 2007–02–01 與 2007–02–02 這兩天的資料是從第 66,638 個觀測值開始,而兩天共有 60 x 24 x 2 = 2,880 分鐘,因此應該讀 2,880 列,除此之外也要注意這幾個小細節:
- 變數之間的分隔符號是
;
- 遺失值為
?
- 變數名稱另外由第一列讀入
- 先將日期解析完成,再與時間結合後做型別轉換
接著就可以自訂 get_data()
函數將這 2,880 列資料讀入 R:
library(magrittr)# get_data
get_data <- function(dest_file, ex_dir) {
data_url <- "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip"
download.file(data_url, destfile = dest_file) # 下載壓縮檔
unzip(dest_file, exdir = ex_dir) # 解壓縮
txt_path <- paste0(ex_dir, "/household_power_consumption.txt")
df_header <- txt_path %>%
readLines(n = 1) %>%
strsplit(split = ";") %>%
unlist() # 取得變數名稱
df <- read.table(txt_path, sep = ";", na.strings = "?", skip = 66637, nrows = 2880, stringsAsFactors = FALSE, col.names = df_header) # 讀入資料
df$Date <- as.Date(df$Date, format = "%d/%m/%Y") # 轉換為日期型別
df$DateTime <- paste(df$Date, df$Time) %>%
as.POSIXct() # 轉換為日期時間型別
return(df)
}
household_power_consumption <- get_data(dest_file = "~/Downloads/household_power_consumption.zip", ex_dir = "~/ExData_Plotting1")
View(household_power_consumption)
作圖
Our overall goal here is simply to examine how household energy usage varies over a 2-day period in February, 2007. Your task is to reconstruct the following plots below, all of which were constructed using the base plotting system.
作圖部分共有四小題,我們的目標是使用 base plotting system 做出跟題目相同的圖形,並輸出成 480 x 480 的 PNG 圖檔。
圖形一
要完成圖形一有幾點要注意:
- plot1.R 要包含讀入資料的程式
- 使用
hist()
函數繪製直方圖 - 設定
col
參數更改為紅色 - 設定
main
參數加入圖表標題 - 設定
xlab
參數修改 X 軸標題 - 使用
png()
函數將圖形輸出,預設長度與高度為 480 像素,故採預設值即可
# plot1.R
library(magrittr)# get_data
get_data <- function(dest_file, ex_dir) {
data_url <- "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip"
download.file(data_url, destfile = dest_file) # 下載壓縮檔
unzip(dest_file, exdir = ex_dir) # 解壓縮
txt_path <- paste0(ex_dir, "/household_power_consumption.txt")
df_header <- txt_path %>%
readLines(n = 1) %>%
strsplit(split = ";") %>%
unlist() # 取得變數名稱
df <- read.table(txt_path, sep = ";", na.strings = "?", skip = 66637, nrows = 2880, stringsAsFactors = FALSE, col.names = df_header) # 讀入資料
df$Date <- as.Date(df$Date, format = "%d/%m/%Y") # 轉換為日期型別
df$DateTime <- paste(df$Date, df$Time) %>%
as.POSIXct() # 轉換為日期時間型別
return(df)
}plot1 <- function(df) {
png(filename = "~/ExData_Plotting1/plot1.png")
par(bg = NA)
hist(df$Global_active_power, col = "red", main = "Global Active Power", xlab = "Global Active Power (kilowatts)", bg = "transparent")
dev.off()
}household_power_consumption <- get_data(dest_file = "~/Downloads/household_power_consumption.zip", ex_dir = "~/ExData_Plotting1")
plot1(household_power_consumption)
圖形二
要完成圖形二有幾點要注意:
- 使用
plot()
函數繪製線圖 - 設定
type = "l"
參數指定為線圖 - 設定
main
參數加入圖表標題 - 設定
xlab
參數修改 X 軸標題、設定ylab
參數修改 Y 軸標題 - 使用
png()
函數將圖形輸出
# plot2.R
plot2 <- function(df) {
png(filename = "~/ExData_Plotting1/plot2.png")
par(bg = NA)
plot(x = df$DateTime, y = df$Global_active_power, type = "l", xlab = "", ylab = "Global Active Power (kilowatts)", bg = "transparent")
dev.off()
}
plot2(household_power_consumption)
圖形三
要完成圖形三有幾點要注意:
- 使用
plot()
函數畫黑色的Sub_metering_1
- 使用
lines()
函數加上紅色的Sub_metering_2
與藍色的Sub_metering_3
- 使用
legend()
函數加上圖例,調整cex
參數讓圖例大小不要太大、調整位置到右上角"topright"
# plot3.R
plot3 <- function(df) {
png(filename = "~/ExData_Plotting1/plot3.png")
par(bg = NA)
plot(x = df$DateTime, y = df$Sub_metering_1, type = "l", col = "black", xlab = "", ylab = "Energy sub metering", bg = "transparent")
lines(x = df$DateTime, y = df$Sub_metering_2, col = "red")
lines(x = df$DateTime, y = df$Sub_metering_3, col = "blue")
legend("topright", legend = c("Sub_metering_1", "Sub_metering_2", "Sub_metering_3"), col = c("black", "red", "blue"), lty = 1, cex = 0.9)
dev.off()
}
plot3(household_power_consumption)
圖形四
要完成圖形四有幾點要注意:
- 使用
par(mfrow = c(2, 2))
將畫布切割成 2 x 2 - 依序將圖放入四個 Grid 之中
# plot4.R
plot4 <- function(df) {
png(filename = "~/ExData_Plotting1/plot4.png")
par(bg = NA)
par(mfrow = c(2, 2))
plot(x = df$DateTime, y = df$Global_active_power, ylab = "Global Active Power", xlab = "", type = "l", bg = "transparent")
plot(x = df$DateTime, y = df$Voltage, ylab = "Voltage", xlab = "datetime", type = "l", bg = "transparent")
plot(x = df$DateTime, y = df$Sub_metering_1, type = "l", col = "black", xlab = "", ylab = "Energy sub metering", bg = "transparent")
lines(x = df$DateTime, y = df$Sub_metering_2, col = "red")
lines(x = df$DateTime, y = df$Sub_metering_3, col = "blue")
legend("topright", legend = c("Sub_metering_1", "Sub_metering_2", "Sub_metering_3"), col = c("black", "red", "blue"), lty = 1, cex = 0.9, bty = "n")
plot(x = df$DateTime, y = df$Global_reactive_power, ylab = "Global_reactive_power", xlab = "datetime", type = "l", bg = "transparent")
dev.off()
}
plot4(household_power_consumption)
如果您喜歡這篇文章,請多按下方的「拍手」圖像幾次、分享到社群網站、成為我們的贊助者以及訂閱 DataInPoint 的新文章!
喜歡 DataInPoint 的文章嗎?成為我們的贊助者吧!
延伸閱讀
如果您覺得這篇文章用到的 R 程式有點難度,推薦參考 DataCamp 的 R 語言視覺化課程: