Extracting Metadata from Dates for Better Summarization

A few days ago I needed to extract metadata from a column of dates in order to summarize my data in a more meaningful way. In my original data set, I had 5 columns one of which was a date — the other columns were metric aggregations for each day. In order to get the data set into a format that could be summarized by the day of the week, I needed to use several packages for R: dplyr, tidy, lubridate, and ggplot2.

I’ve recreated a new data set with three columns: Date, Downloads (randomly generated), and Revenue. Below, I’ll show the code I used to separate a date into 3 columns (month, day, and year), add the day of week to the data set, summarize the data, and visualize them in a chart.

To begin, I read in the CSV file, which stores the data in a data frame, and assigned it to the variable dat.

dat <- read.csv('downloadRev.csv', stringsAsFactors = F)

Then, in order to preserve the date column, I created a new column that will be separated into three columns a little later. The code below creates a new column in the data frame called dateSep and sets the value to what is currently in the Date column (essentially copying the data from Date into the new dateSep column).

dat$dateSep <- dat$Date

This is where tidyr’s separate function comes in handy especially when combined with dplyr/magrittr’s piping (%>%). The separate function does exactly what it says — it separates a given column into new column(s). The dates in my data frame are formatted as “1/1/2016”, so I’ll be splitting the dateSep column into month, day, and year using the “/” as a separator. This is helpful when grouping data by month or year if it’s not provided in the data set. The resulting data frame is stored in a new variable, datSep.

datSep <- dat %>% separate(dateSep, into = c("Month", "Day", "Year"), sep = "/")

Next up, I’ll use the lubridate package to extract the weekday from each of the dates. Again, this is helpful when summarizing the data by day of the week to identify trends. The code below wraps the date in a mdy() function that tells lubridate the date pattern it should be expecting — in this case MonthDayYear. It also converts the date into a format instead of just a character string. Then, it is wrapped in the wday() function with the label parameter set to TRUE. When the label parameter is set to TRUE, it returns the word values corresponding to the day of the week (i.e., Sun, Mon, Tue, etc.) If label was set to FALSE, it would return the numbers 1–7 corresponding to the day of the week. The returned values are then stored in a new column, DOW, in the datSep data frame.

datSep$DOW <- wday(mdy(datSep$Date), label = TRUE)

To summarize the data by day of week, we’ll use dplyr functions to select, group, and summarize . The resulting data frame, revDOW, has also been shown.

revDOW <- datSep %>% select(DOW, Month, Downloads, Revenue) %>% group_by(DOW) %>% summarise(TOTALREV = sum(Revenue))
data frame showing total revenue summarized by day of week

And finally, to put a visual to the data, we’ll use ggplot2 to produce a chart showing the total revenue by the day of week. The chart below is the result of the following ggplot command.

revDOW %>% ggplot(aes(x = DOW, y = TOTALREV)) + geom_bar(stat = "identity")
ggplot2 chart showing total revenue by day of week

Full code:


# Read in the csv file
dat <- read.csv('downloadRev.csv', stringsAsFactors = F)

# Create a new column containing the date column that will be separated later
dat$dateSep <- dat$Date

# Use tidyr's separate command to separate the date into three columns and store it
# in a new column
datSep <- dat %>% separate(dateSep, into = c("Month", "Day", "Year"), sep = "/")

# Extract the day of the week from the date column
datSep$DOW <- wday(mdy(datSep$Date), label = T)

# use dplyr to select, group, and summarize the data
revDOW <- datSep %>% select(DOW, Month, Downloads, Revenue) %>% group_by(DOW) %>% summarise(TOTALREV = sum(Revenue))

#use ggplot2 to create a chart of TOTALREV by DOW
revDOW %>% ggplot(aes(x = DOW, y = TOTALREV)) + geom_bar(stat="identity")

Originally published at what do the data say?.