家庭電表資料探索

Johns Hopkins Exploratory Data Analysis 的第一週作業:Electric Power Consumption

Yao-Jen Kuo
數聚點文摘

--

Photo by Yung Chang on Unsplash

這個作業所使用的資料是一個家庭四年的用電紀錄,每分鐘一筆;我們必須要解決兩個問題:讀入資料與作圖,其中作圖是這份作業中主要被評分的項目。

作業來源

Johns Hopkins University Data Science Specialization: Exploratory Data Analysis

Many people have gotten jobs in machine learning just by completing that MOOC. There’re other similar online courses that help; for example the John Hopkins Data Science specialization.

Andrew Ng answering How should you start a career in Machine Learning? on Quora

Johns Hopkins University 在 Coursera 開設的資料科學專項課程是一門非常完整且質量俱佳的課程,但是獨立完成專項課程中的十門課程,對初學者並不是一件簡單的事情;由於課程中的 Programming Assignment 的難度比上課內容或 swirl 小作業都難上許多,容易在遇到要繳交 Programming Assignment 的時候,就澆熄了滿懷雄心壯志的學習熱情。

跟 DataInPoint 一起完成課程中的專題或作業吧,今天我們要解第四門課 Exploratory Data Analysis 第一週要繳交的作業:Electric Power Consumption。

作業規定要以 GitHub Repository 繳交,可以參考這個範例連結:

概述

Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.

讀入資料

在讀入資料時我們必須注意幾點:

  • 資料共有 2,075,259 個觀測值與 9 個變數
  • 只需要讀入 2007–02–01 與 2007–02–02 這兩天的資料即可
  • 資料中有日期與時間這兩個變數,可以利用 strptime()as.Date() 函數做型別轉換
  • 資料中的遺漏值是以 ? 紀錄

同時也可以參考這 9 個變數的資訊:

  1. Date: Date in format dd/mm/yyyy
  2. Time: time in format hh:mm:ss
  3. Global_active_power: household global minute-averaged active power (in kilowatt)
  4. Global_reactive_power: household global minute-averaged reactive power (in kilowatt)
  5. Voltage: minute-averaged voltage (in volt)
  6. Global_intensity: household global minute-averaged current intensity (in ampere)
  7. Sub_metering_1: energy sub-metering №1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered).
  8. Sub_metering_2: energy sub-metering №2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light.
  9. Sub_metering_3: energy sub-metering №3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner.

接著用文字編輯器觀察一下這個 TXT 檔案的內容:

TXT 檔案的內容

透過觀察,我們曉得 2007–02–01 與 2007–02–02 這兩天的資料是從第 66,638 個觀測值開始,而兩天共有 60 x 24 x 2 = 2,880 分鐘,因此應該讀 2,880 列,除此之外也要注意這幾個小細節:

  • 變數之間的分隔符號是 ;
  • 遺失值為 ?
  • 變數名稱另外由第一列讀入
  • 先將日期解析完成,再與時間結合後做型別轉換

接著就可以自訂 get_data() 函數將這 2,880 列資料讀入 R:

library(magrittr)# get_data
get_data <- function(dest_file, ex_dir) {
data_url <- "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip"
download.file(data_url, destfile = dest_file) # 下載壓縮檔
unzip(dest_file, exdir = ex_dir) # 解壓縮
txt_path <- paste0(ex_dir, "/household_power_consumption.txt")
df_header <- txt_path %>%
readLines(n = 1) %>%
strsplit(split = ";") %>%
unlist() # 取得變數名稱
df <- read.table(txt_path, sep = ";", na.strings = "?", skip = 66637, nrows = 2880, stringsAsFactors = FALSE, col.names = df_header) # 讀入資料
df$Date <- as.Date(df$Date, format = "%d/%m/%Y") # 轉換為日期型別
df$DateTime <- paste(df$Date, df$Time) %>%
as.POSIXct() # 轉換為日期時間型別
return(df)
}
household_power_consumption <- get_data(dest_file = "~/Downloads/household_power_consumption.zip", ex_dir = "~/ExData_Plotting1")
View(household_power_consumption)
順利將 TXT 資料讀入成為資料框

作圖

Our overall goal here is simply to examine how household energy usage varies over a 2-day period in February, 2007. Your task is to reconstruct the following plots below, all of which were constructed using the base plotting system.

作圖部分共有四小題,我們的目標是使用 base plotting system 做出跟題目相同的圖形,並輸出成 480 x 480 的 PNG 圖檔。

圖形一

要完成圖形一有幾點要注意:

  • plot1.R 要包含讀入資料的程式
  • 使用 hist() 函數繪製直方圖
  • 設定 col 參數更改為紅色
  • 設定 main 參數加入圖表標題
  • 設定 xlab 參數修改 X 軸標題
  • 使用 png() 函數將圖形輸出,預設長度與高度為 480 像素,故採預設值即可
# plot1.R
library(magrittr)
# get_data
get_data <- function(dest_file, ex_dir) {
data_url <- "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip"
download.file(data_url, destfile = dest_file) # 下載壓縮檔
unzip(dest_file, exdir = ex_dir) # 解壓縮
txt_path <- paste0(ex_dir, "/household_power_consumption.txt")
df_header <- txt_path %>%
readLines(n = 1) %>%
strsplit(split = ";") %>%
unlist() # 取得變數名稱
df <- read.table(txt_path, sep = ";", na.strings = "?", skip = 66637, nrows = 2880, stringsAsFactors = FALSE, col.names = df_header) # 讀入資料
df$Date <- as.Date(df$Date, format = "%d/%m/%Y") # 轉換為日期型別
df$DateTime <- paste(df$Date, df$Time) %>%
as.POSIXct() # 轉換為日期時間型別
return(df)
}
plot1 <- function(df) {
png(filename = "~/ExData_Plotting1/plot1.png")
par(bg = NA)
hist(df$Global_active_power, col = "red", main = "Global Active Power", xlab = "Global Active Power (kilowatts)", bg = "transparent")
dev.off()
}
household_power_consumption <- get_data(dest_file = "~/Downloads/household_power_consumption.zip", ex_dir = "~/ExData_Plotting1")
plot1(household_power_consumption)
plot1.png

圖形二

要完成圖形二有幾點要注意:

  • 使用 plot() 函數繪製線圖
  • 設定 type = "l" 參數指定為線圖
  • 設定 main 參數加入圖表標題
  • 設定 xlab 參數修改 X 軸標題、設定 ylab 參數修改 Y 軸標題
  • 使用 png() 函數將圖形輸出
# plot2.R
plot2 <- function(df) {
png(filename = "~/ExData_Plotting1/plot2.png")
par(bg = NA)
plot(x = df$DateTime, y = df$Global_active_power, type = "l", xlab = "", ylab = "Global Active Power (kilowatts)", bg = "transparent")
dev.off()
}
plot2(household_power_consumption)
plot2.png

圖形三

要完成圖形三有幾點要注意:

  • 使用 plot() 函數畫黑色的 Sub_metering_1
  • 使用 lines() 函數加上紅色的 Sub_metering_2 與藍色的 Sub_metering_3
  • 使用 legend() 函數加上圖例,調整 cex 參數讓圖例大小不要太大、調整位置到右上角 "topright"
# plot3.R
plot3 <- function(df) {
png(filename = "~/ExData_Plotting1/plot3.png")
par(bg = NA)
plot(x = df$DateTime, y = df$Sub_metering_1, type = "l", col = "black", xlab = "", ylab = "Energy sub metering", bg = "transparent")
lines(x = df$DateTime, y = df$Sub_metering_2, col = "red")
lines(x = df$DateTime, y = df$Sub_metering_3, col = "blue")
legend("topright", legend = c("Sub_metering_1", "Sub_metering_2", "Sub_metering_3"), col = c("black", "red", "blue"), lty = 1, cex = 0.9)
dev.off()
}
plot3(household_power_consumption)
plot3.png

圖形四

要完成圖形四有幾點要注意:

  • 使用 par(mfrow = c(2, 2)) 將畫布切割成 2 x 2
  • 依序將圖放入四個 Grid 之中
# plot4.R
plot4 <- function(df) {
png(filename = "~/ExData_Plotting1/plot4.png")
par(bg = NA)
par(mfrow = c(2, 2))
plot(x = df$DateTime, y = df$Global_active_power, ylab = "Global Active Power", xlab = "", type = "l", bg = "transparent")
plot(x = df$DateTime, y = df$Voltage, ylab = "Voltage", xlab = "datetime", type = "l", bg = "transparent")
plot(x = df$DateTime, y = df$Sub_metering_1, type = "l", col = "black", xlab = "", ylab = "Energy sub metering", bg = "transparent")
lines(x = df$DateTime, y = df$Sub_metering_2, col = "red")
lines(x = df$DateTime, y = df$Sub_metering_3, col = "blue")
legend("topright", legend = c("Sub_metering_1", "Sub_metering_2", "Sub_metering_3"), col = c("black", "red", "blue"), lty = 1, cex = 0.9, bty = "n")
plot(x = df$DateTime, y = df$Global_reactive_power, ylab = "Global_reactive_power", xlab = "datetime", type = "l", bg = "transparent")
dev.off()
}
plot4(household_power_consumption)
plot4.png
如果您喜歡這篇文章,請多按下方的「拍手」圖像幾次、分享到社群網站、成為我們的贊助者以及訂閱 DataInPoint 的新文章!

喜歡 DataInPoint 的文章嗎?成為我們的贊助者吧!

--

--