R4DS Week 8: factors & forcats

Introduction

The temptation to make this week’s recap nothing but a collection of my favorite cat gifs was strong, but I resisted. Sort of.

The text does a fantastic job of walking you through a multitude of examples, and so this recap is largely a series of images that are annotated versions of some of the code in the text.

If you’re looking for resources to supplement the content in “R for Data Science”, check out this walkthrough from Jenny Brian.


Factors and Forcats

What is a factor?

Factors are how categorical information gets stored in R. Remember that categorical variables are a type of variable that take on a name or label. Examples of categorical variables include:

For the purposes of this recap, we are not going to differentiate between ordinal and nominal categorical variables.

This cat is 0% worried about ordinal and nominal right now

Factors and levels

Factors can trip you up pretty quickly, because they often look like strings, but behave more like numbers. This is because R is storing each level within a factor as an integer. R assigns integers to each level using alphabetical order.

In other words:

Yes, I know that Antarctica and Australia were left out. No hard feelings, I hope!
This cat is from Australia, and clearly unhappy at being left out

Annotated General Social Survey

To get the data shown below, you can run:

> library(tidyverse)
> library(forcats)
> gss_cat
output of running gss_cat after we’ve loaded the tidyverse and forcats

In addition to the dataset, marital factor, and levels being annotated, try to answer the following questions:

  1. How many total observations (rows) are present in the dataset?
  2. How many total factors are present in the dataset?
  3. How many levels are there of each factor? How do you know?

If you’re not sure how to answer number three, run the following:

gss_cat %>%
count(race)
### %>% is called a pipe, and the keyboard shortcut is Cmd/Ctrl + Shift + m
using the count() function within forcats

The count() function within forcats will give you the total number of observations within each level, as well as the number of levels within a factor.

Now that you know that there are six factors within our gss dataset, determine the following:

  1. The total number of levels within each factor
  2. The total number of observations within each level
You can do it!

Annotated modifying factor order

It’s all well and good to count levels and get the number of observations, but we often need to do slightly more complex work with our data. The example in the book is great, and ultimately produces two graphs that highlight how putting your factors in a specific order can be helpful.

The original code provided was:

I’ve also created an annotated version of the code:

When we plot the data points, we get the following graph:

This graph isn’t awful, but it is difficult to pick out patterns

We can’t really tell from the above graph if our data follows a particular trend. In order to make it easier for anyone looking at our graph (ourselves included!) we can reorder our factors using fct_reorder() within our ggplot() code, as such:

In the graph below — which has the exact same data, just put in order of least amount of tvhours to most — we can see how the tvhours variables changes by religion:

Stretch exercise

Create graphs for other factors within the gss_cat dataset. You can use the code provided above as a template for creating new variables, or you can work with the existing variables within the dataset.

Be sure to share your results in our Week 8 channel!


If you’ve made it this far…

You should feel fairly confident in your understanding of factors. Sure, you might not quite know what to do with them every time you encounter them, but you know what they are, you know that you can use the tidyverse and forcats to work with them, and you have a supportive group of people who would be more than happy to answer any questions you might have about factors!

So what are you waiting for? Go wrangle some data!

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.