R4DS Week 6: Tidying up your data

Introduction

If this week’s topics were new to you, some of the biggest challenges may have come from conceptualizing what you’re being asked to do with a given dataset. To help address the concepts, this re-cap is going to rely heavily on illustrations to explain what’s happening in our ever-expanding tidyverse. For data we’ll be using fictional data that’s inspired by pigeon-racing, which actually has real datasets if you’re interested in exploring this hobby!

gather()

gather() allows us to take two or more columns, gather them together into a single column, and create a new column from the values we’ve moved. Clear as mud, right?

You’ll know you need to use gather when you see column names that are values, not variables. Think of variables like a cardboard box that we’ve labeled, while the items that we put inside of this box are values. We can change the label of the box as much as we want, but that’s not going to change the values that are inside the box.

Here’s what a tibble (which I’ve named dataset) that we could use gather() on looks like:

a tibble named: dataset

Notice how the two column headers 2016 and 2017 are values that we could assign to the variable year? This is a clear sign that we should use gather()!

Knowing that our tibble is named “dataset”, we could run the following in R:

gather(dataset, `2016`, `2017`, key = "year", value = "speed_mps")

To break this down a bit further, here’s an annotated explanation of the code:

annotated explanation of running gather() on our dataset

Once we’ve run gather() on our dataset, our tibble should look like this:

You can also click here for a full image of the entire gather() process.


spread()

spread() allows us to take variables that have been spread into rows and move them into their own columns.

You’ll know you need to use spread when you see variable names listed as values.

Here’s what a dataset that we could use spread() on looks like:

notice how the column “type” has two variables in it: speed and distance

Notice how the the column named “type” has two variables — speed and distance — in it? This is a clear sign that we should use spread()!

Knowing that our tibble is named “dataset”, we could run the following in R:

spread(dataset, key = type, value = count)

To break this down a bit further, here’s an annotated explanation of the code:

annotated explanation of running spread() on our dataset

Once we’ve run spread() on our dataset, our tibble should look like the table below. Notice that the column headers “type” and “count” have disappeared. That’s because we’ve removed them when we created the “distance” and “speed” column headers (variables). “Type” was redundant — think of it as a variable that was storing the additional variables of “distance” and “speed.”

You can also click here for a full image of the entire gather() process.


separate()

separate() allows us to take the values of two variables that are in the same column, and give them each their own respective columns.

You’ll know you need to use separate when you see two values of variables within the same column. This is frequently done when there’s a rate, or change over time.

Here’s what a dataset that we could use separate() on looks like:

our tibble before using separate()

Notice how there are two values of variables within the rate column! We can assume that our two variables are related to distance and time.

Knowing that our tibble is named “dataset”, we could run the following in R:

separate(dataset, col_name, into = c("distance", "time"), convert = TRUE)

To break this down a bit further, here’s an annotated explanation of the code:

annotated explanation of running separate() on our tibble

Once we’ve run separate() on our table, we’ve removed the “rate” column and added in two new columns: “distance” and “time.”

R will separate values within a column whenever it comes across something that isn’t a number or a letter. In our case it was the forward slash, or “/” that told R to split the column into two columns at this point.

Our “separated” tibble will look like this:

our tibble after we’ve run separate() to split up the “rate” column into “distance” and “time”

You can also click here for a full image of the entire separate() process.


Still catching up? Here’s what we’ve done so far:

Week 1: Setting up your RStudio environment

Week 2: Data visualization

Week 3: Data transformations

Week 4: Exploratory data analysis

Week 5: Tibbles

Like what you read? Give Jesse Maegan a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.