家庭電表資料探索

Johns Hopkins Exploratory Data Analysis 的第一週作業：Electric Power Consumption

Yao-Jen Kuo

Published in

數聚點文摘

17 min readMar 4, 2018

這個作業所使用的資料是一個家庭四年的用電紀錄，每分鐘一筆；我們必須要解決兩個問題：讀入資料與作圖，其中作圖是這份作業中主要被評分的項目。

喜歡 DataInPoint 的文章嗎？成為我們的贊助者吧！

DataInPoint is creating Data Science Tutorials | Patreon

Become a patron of DataInPoint today: Read posts by DataInPoint and get access to exclusive content and experiences on…

www.patreon.com

作業來源

Johns Hopkins University Data Science Specialization: Exploratory Data Analysis

Many people have gotten jobs in machine learning just by completing that MOOC. There’re other similar online courses that help; for example the John Hopkins Data Science specialization.
Andrew Ng answering How should you start a career in Machine Learning? on Quora

Johns Hopkins University 在 Coursera 開設的資料科學專項課程是一門非常完整且質量俱佳的課程，但是獨立完成專項課程中的十門課程，對初學者並不是一件簡單的事情；由於課程中的 Programming Assignment 的難度比上課內容或 swirl 小作業都難上許多，容易在遇到要繳交 Programming Assignment 的時候，就澆熄了滿懷雄心壯志的學習熱情。

跟 DataInPoint 一起完成課程中的專題或作業吧，今天我們要解第四門課 Exploratory Data Analysis 第一週要繳交的作業：Electric Power Consumption。

作業規定要以 GitHub Repository 繳交，可以參考這個範例連結：

yaojenkuo/ExData_Plotting1

ExData_Plotting1 - Johns Hopkins Exploratory Data Analysis Course Project 1

github.com

概述

Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.

讀入資料

在讀入資料時我們必須注意幾點：

資料共有 2,075,259 個觀測值與 9 個變數
只需要讀入 2007–02–01 與 2007–02–02 這兩天的資料即可
資料中有日期與時間這兩個變數，可以利用 strptime() 與 as.Date() 函數做型別轉換
資料中的遺漏值是以 ? 紀錄

同時也可以參考這 9 個變數的資訊：

Date: Date in format dd/mm/yyyy
Time: time in format hh:mm:ss
Global_active_power: household global minute-averaged active power (in kilowatt)
Global_reactive_power: household global minute-averaged reactive power (in kilowatt)
Voltage: minute-averaged voltage (in volt)
Global_intensity: household global minute-averaged current intensity (in ampere)
Sub_metering_1: energy sub-metering №1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered).
Sub_metering_2: energy sub-metering №2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light.
Sub_metering_3: energy sub-metering №3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner.

接著用文字編輯器觀察一下這個 TXT 檔案的內容：

透過觀察，我們曉得 2007–02–01 與 2007–02–02 這兩天的資料是從第 66,638 個觀測值開始，而兩天共有 60 x 24 x 2 = 2,880 分鐘，因此應該讀 2,880 列，除此之外也要注意這幾個小細節：

變數之間的分隔符號是 ;
遺失值為 ?
變數名稱另外由第一列讀入
先將日期解析完成，再與時間結合後做型別轉換

接著就可以自訂 get_data() 函數將這 2,880 列資料讀入 R：

library(magrittr)# get_data
get_data <- function(dest_file, ex_dir) {
  data_url <- "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip"
  download.file(data_url, destfile = dest_file) # 下載壓縮檔
  unzip(dest_file, exdir = ex_dir) # 解壓縮
  txt_path <- paste0(ex_dir, "/household_power_consumption.txt")
  df_header <- txt_path %>%
    readLines(n = 1) %>%
    strsplit(split = ";") %>%
    unlist() # 取得變數名稱
  df <- read.table(txt_path, sep = ";", na.strings = "?", skip = 66637, nrows = 2880, stringsAsFactors = FALSE, col.names = df_header) # 讀入資料
  df$Date <- as.Date(df$Date, format = "%d/%m/%Y") # 轉換為日期型別
  df$DateTime <- paste(df$Date, df$Time) %>%
    as.POSIXct() # 轉換為日期時間型別
  return(df)
}
household_power_consumption <- get_data(dest_file = "~/Downloads/household_power_consumption.zip", ex_dir = "~/ExData_Plotting1")
View(household_power_consumption)

作圖

Our overall goal here is simply to examine how household energy usage varies over a 2-day period in February, 2007. Your task is to reconstruct the following plots below, all of which were constructed using the base plotting system.

作圖部分共有四小題，我們的目標是使用 base plotting system 做出跟題目相同的圖形，並輸出成 480 x 480 的 PNG 圖檔。

圖形一

要完成圖形一有幾點要注意：

plot1.R 要包含讀入資料的程式
使用 hist() 函數繪製直方圖
設定 col 參數更改為紅色
設定 main 參數加入圖表標題
設定 xlab 參數修改 X 軸標題
使用 png() 函數將圖形輸出，預設長度與高度為 480 像素，故採預設值即可

# plot1.R
library(magrittr)# get_data
get_data <- function(dest_file, ex_dir) {
  data_url <- "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip"
  download.file(data_url, destfile = dest_file) # 下載壓縮檔
  unzip(dest_file, exdir = ex_dir) # 解壓縮
  txt_path <- paste0(ex_dir, "/household_power_consumption.txt")
  df_header <- txt_path %>%
    readLines(n = 1) %>%
    strsplit(split = ";") %>%
    unlist() # 取得變數名稱
  df <- read.table(txt_path, sep = ";", na.strings = "?", skip = 66637, nrows = 2880, stringsAsFactors = FALSE, col.names = df_header) # 讀入資料
  df$Date <- as.Date(df$Date, format = "%d/%m/%Y") # 轉換為日期型別
  df$DateTime <- paste(df$Date, df$Time) %>%
    as.POSIXct() # 轉換為日期時間型別
  return(df)
}plot1 <- function(df) {
  png(filename = "~/ExData_Plotting1/plot1.png")
  par(bg = NA)
  hist(df$Global_active_power, col = "red", main = "Global Active Power", xlab = "Global Active Power (kilowatts)", bg = "transparent")
  dev.off()
}household_power_consumption <- get_data(dest_file = "~/Downloads/household_power_consumption.zip", ex_dir = "~/ExData_Plotting1")
plot1(household_power_consumption)

圖形二

要完成圖形二有幾點要注意：

使用 plot() 函數繪製線圖
設定 type = "l" 參數指定為線圖
設定 main 參數加入圖表標題
設定 xlab 參數修改 X 軸標題、設定 ylab 參數修改 Y 軸標題
使用 png() 函數將圖形輸出

# plot2.R
plot2 <- function(df) {
  png(filename = "~/ExData_Plotting1/plot2.png")
  par(bg = NA)
  plot(x = df$DateTime, y = df$Global_active_power, type = "l", xlab = "", ylab = "Global Active Power (kilowatts)", bg = "transparent")
  dev.off()
}
plot2(household_power_consumption)

圖形三

要完成圖形三有幾點要注意：

使用 plot() 函數畫黑色的 Sub_metering_1
使用 lines() 函數加上紅色的 Sub_metering_2 與藍色的 Sub_metering_3
使用 legend() 函數加上圖例，調整 cex 參數讓圖例大小不要太大、調整位置到右上角 "topright"

# plot3.R
plot3 <- function(df) {
  png(filename = "~/ExData_Plotting1/plot3.png")
  par(bg = NA)
  plot(x = df$DateTime, y = df$Sub_metering_1, type = "l", col = "black", xlab = "", ylab = "Energy sub metering", bg = "transparent")
  lines(x = df$DateTime, y = df$Sub_metering_2, col = "red")
  lines(x = df$DateTime, y = df$Sub_metering_3, col = "blue")
  legend("topright", legend = c("Sub_metering_1", "Sub_metering_2", "Sub_metering_3"), col = c("black", "red", "blue"), lty = 1, cex = 0.9)
  dev.off()
}
plot3(household_power_consumption)

圖形四

要完成圖形四有幾點要注意：

使用 par(mfrow = c(2, 2)) 將畫布切割成 2 x 2
依序將圖放入四個 Grid 之中

# plot4.R
plot4 <- function(df) {
  png(filename = "~/ExData_Plotting1/plot4.png")
  par(bg = NA)
  par(mfrow = c(2, 2))
  plot(x = df$DateTime, y = df$Global_active_power, ylab = "Global Active Power", xlab = "", type = "l", bg = "transparent")
  plot(x = df$DateTime, y = df$Voltage, ylab = "Voltage", xlab = "datetime", type = "l", bg = "transparent")
  plot(x = df$DateTime, y = df$Sub_metering_1, type = "l", col = "black", xlab = "", ylab = "Energy sub metering", bg = "transparent")
  lines(x = df$DateTime, y = df$Sub_metering_2, col = "red")
  lines(x = df$DateTime, y = df$Sub_metering_3, col = "blue")
  legend("topright", legend = c("Sub_metering_1", "Sub_metering_2", "Sub_metering_3"), col = c("black", "red", "blue"), lty = 1, cex = 0.9, bty = "n")
  plot(x = df$DateTime, y = df$Global_reactive_power, ylab = "Global_reactive_power", xlab = "datetime", type = "l", bg = "transparent")
  dev.off()
}
plot4(household_power_consumption)

如果您喜歡這篇文章，請多按下方的「拍手」圖像幾次、分享到社群網站、成為我們的贊助者以及訂閱 DataInPoint 的新文章！

喜歡 DataInPoint 的文章嗎？成為我們的贊助者吧！

DataInPoint is creating Data Science Tutorials | Patreon

Become a patron of DataInPoint today: Read posts by DataInPoint and get access to exclusive content and experiences on…

www.patreon.com

延伸閱讀

如果您覺得這篇文章用到的 R 程式有點難度，推薦參考 DataCamp 的 R 語言視覺化課程：

Data Visualization in R | DataCamp

This course provides a comprehensive introduction to working with base graphics in R.

www.datacamp.com

Data Visualization with ggplot2 | DataCamp

Recommended by ggplot2 author Hadley Wickham, this online course teaches you how to create meaningful data…

www.datacamp.com

R ggplot2 Tutorial For Data Visualizations | DataCamp

Ggplot2 author Hadley Wickham recommends this tutorial that teaches you how to create data visualizations with ggplot2…

www.datacamp.com

Data Visualization with ggplot2 (Part 3) | DataCamp

This course covers some advanced topics including strategies for handling large data sets and specialty plots.

www.datacamp.com

家庭電表資料探索

Johns Hopkins Exploratory Data Analysis 的第一週作業：Electric Power Consumption

喜歡 DataInPoint 的文章嗎？成為我們的贊助者吧！

DataInPoint is creating Data Science Tutorials | Patreon

Become a patron of DataInPoint today: Read posts by DataInPoint and get access to exclusive content and experiences on…

作業來源

yaojenkuo/ExData_Plotting1

ExData_Plotting1 - Johns Hopkins Exploratory Data Analysis Course Project 1

概述

讀入資料

作圖

圖形一

圖形二

圖形三

圖形四

喜歡 DataInPoint 的文章嗎？成為我們的贊助者吧！

DataInPoint is creating Data Science Tutorials | Patreon

Become a patron of DataInPoint today: Read posts by DataInPoint and get access to exclusive content and experiences on…

延伸閱讀

Data Visualization in R | DataCamp

This course provides a comprehensive introduction to working with base graphics in R.

Data Visualization with ggplot2 | DataCamp

Recommended by ggplot2 author Hadley Wickham, this online course teaches you how to create meaningful data…

R ggplot2 Tutorial For Data Visualizations | DataCamp

Ggplot2 author Hadley Wickham recommends this tutorial that teaches you how to create data visualizations with ggplot2…

Data Visualization with ggplot2 (Part 3) | DataCamp

This course covers some advanced topics including strategies for handling large data sets and specialty plots.

Exploratory Data Analysis | Coursera

About this course: This course covers the essential exploratory techniques for summarizing data. These techniques are…

Written by Yao-Jen Kuo