R4DS Week 3: Just Breathe

Jesse Maegan
6 min readSep 20, 2017

--

It’s all going to be OK!

Your gentle reminder that you don’t have to learn all the things right away

We’ve covered a lot of ground in our first three weeks, and we’ve covered it very quickly. If you’re feeling lost, that’s OK and completely normal. Instead of focusing on mastering every last detail of the text, become a tourist. Look for the main points in the text, and stick to the big ideas that are presented. It’s more important to know that you have the ability to make a graph than to memorize what every last command in ggplot2 does.

Wait, we’re in Week 3? What happened to Weeks 1 & 2?!?!

Check out Week 1 here, and Week 2 over here.

Questions? Go ahead and ask them in the weekly Slack channels — we’d love to hear from you!

Week 3 is all about data transformations

Week 3 has been dedicated to Section 5: Data transformation.

So wait, we spent a week graphing, and now we’re going to mess around with our data?

Lana knows what’s up

From the authors:

Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need.

From Twitter:

R4DS Week 3 walk-through:

Our methodology

First we’re going to get to know our data, and then start changing things to see what happens, much like we did in our Week 2 catch-up guide.

This is how we do it

Why you should take the time to get to know your data

You can absolutely run an analysis on data that you’re unfamiliar with, and there are 8 million reasons why this is a really bad idea. When you don’t understand the data and variables that you’re working with, it can be difficult to spot outliers, you may not understand the relationships between variables, and you can definitely run the wrong types of analyses.

For example, I ran an analysis two weeks ago on data that I was completely unfamiliar with, and all I did was make hours of more work for myself by not spending the time up front trying to understand what exactly I was working with. Don’t do this to yourself — spend some time getting to know your data first!

the more you know, the easier your analysis should (theoretically) be

Set up your workspace

Open RStudio, create a new script (ideally within your R4DS Project), and run the following:

library(nycflights13)
library(tidyverse)
# if you get an error on nycflights13, you may need to install it # first by running install.packages("nycflights13")

Understanding the nycflights13 dataset

Pull up the documentation for flights by typing and running the following:

?flights

If this is too small for you to work with comfortably, you can “pop out” the window by clicking here:

Using the document you pulled up as reference, run each of the following:

flights
View(flights)

Write a brief summary of the nycflights13 dataset in your R script, using “#” to comment out your summary.

If you want to get really fancy, go ahead and type out a several sentence summary in your R script, highlight it, and then press Cmd/Ctrl+Shift+C.

Using filter()

filter() is a fantastic way to subset or exclude rows of your data that meet (or don’t meet!) a specific criteria. Run the following:

filter(flights, month == 1, day == 1)

We can translate this by saying “Filter the flights data set, and return rows where the month equals 1 and the day equals 1.”

Use your flights documentation to figure out what we’re looking at in the data frame we’ve just created, and write a one sentence summary in your script.

Now try this one:

jan1 <- filter(flights, month == 1, day == 1)

And this one:

(jan1 <- filter(flights, month == 1, day == 1))

Optional stretch exercises:

  • Create a filtered data frame for December 25th, 2013
  • Create a filtered data frame that includes every month except for June
  • Create a filtered data frame that includes all flights for September or October
  • Create a filtered data frame that excludes flights with an arrival delay of less than or equal to 350 minutes (re-wording this sentence may help you arrive at the solution)

Using arrange()

arrange() lets you change the order of rows in your data. Run each of the following lines of code separately, and note the differences in output by paying close attention to the “dep_time” column.

flightsarrange(flights, dep_time)arrange(flights, desc(dep_time))

We can translate the second line of code by saying “We’re going to arrange the flights data frame in order of departure time”.

The third line of code reads “We’re going to arrange the flights data frame in descending order, based on departure time.”

Optional stretch exercises:

  • Find the fastest flights
  • Find the slowest flights
  • Find the most delayed flights

Using select()

select() is going to allow you to choose which variables (columns) you’re interested in analyzing, which can be exceptionally helpful when you have thousands of columns worth of data.

Run each of the following lines of code separately , and note the differences in output.

flightsselect(flights, year, month, day)select(flights, year:day)select(flights, -(year:day))select(flights, time_hour, air_time, everything())

Optional stretch exercise:

  • Translate each of the lines of code in the above section into a sentence that explains what’s being accomplished in each line.

Using mutate()

mutate() allows us to create new variables in our data frame. Run the following lines of code to get an idea of what mutate() can accomplish:

colnames(flights)flights_new <- mutate(flights, speed = distance / air_time * 60)colnames(flights_new)summary(flights_new$summary)

Optional stretch exercises:

  • Create a variable called gain, which is the departure delay subtracted from the arrival delay.
  • Create new variables using the flights dataset. It’s OK at this point if the variables don’t quite make sense — the goal is to get comfortable using the mutate() function.
  • Combine the select() function with mutate() to create a data frame with fewer variables to work with.
Don’t give up — we’re rooting for you!

Using summarise()

summarise() is hands-down the function I use the most in my day-to-day analysis. It’s awesome because it can collapse an entire data frame into a single row.

As we did previously, run the following lines of code separately, and note what changes:

summarise(flights, delay = mean(dep_delay))summarise(flights, delay = mean(dep_delay, na.rm = TRUE))

Optional stretch exercises:

Type out and run the two following sets of code. Translate each into a descriptive sentence that explains what each chunk of code does.

Which chunk of code to your prefer? Why?

by_dest <- group_by(flights, dest)delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")summary(delay)

In the next chunk of code you can use the keyboard shortcut Cmd/Ctrl+Shift+M to quickly type the pipe (%>%).

delays <- flights %>% 
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
summary(delays)
Why don’t these ever close?!

One step further

We’ve skipped a lot of the more in-depth content, and that’s OK. The goal is to get you caught up with the main points of the section, not re-hash the entire section in a blog post.

One of the best resources are the RStudio cheat sheets. The one related to what we’ve done this week is the Data Wrangling Cheat Sheet.

And as always, ask questions in our Slack channels! Our mentors are friendly, helpful, and genuinely committed to helping you learn.

--

--

Jesse Maegan

molecular biologist turned public school teacher before falling in ❤️ with non-profit data science. perpetual #rstats noob.