# R4DS Week 8: factors & forcats

### Introduction

The temptation to make this week’s recap nothing but a collection of my favorite cat gifs was strong, but I resisted. Sort of.

The text does a fantastic job of walking you through a multitude of examples, and so this recap is largely a series of images that are annotated versions of some of the code in the text.

If you’re looking for resources to supplement the content in “R for Data Science”, check out this walkthrough from Jenny Brian.

### Factors and Forcats

#### What is a factor?

Factors are how categorical information gets stored in R. Remember that categorical variables are a type of variable that take on a name or label. Examples of categorical variables include:

- Countries
- Hair color
- Car make
- A Likert scale

For the purposes of this recap, we are ** not** going to differentiate between ordinal and nominal categorical variables.

#### Factors and levels

Factors can trip you up pretty quickly, because they often look like strings, but behave more like numbers. This is because R is storing each level within a factor as an integer. R assigns integers to each level using alphabetical order.

*In other words:*

### Annotated General Social Survey

To get the data shown below, you can run:

> library(tidyverse)

> library(forcats)

> gss_cat

In addition to the **dataset**, *marital* **factor**, and **levels** being annotated, try to answer the following questions:

- How many total observations (rows) are present in the dataset?
- How many total factors are present in the dataset?
- How many levels are there of each factor? How do you know?

If you’re not sure how to answer number three, run the following:

gss_cat %>%

count(race)

### %>% is called a pipe, and the keyboard shortcut is Cmd/Ctrl + Shift + m

The **count()** function within **forcats** will give you the total number of observations within each **level**, as well as the number of **levels** within a factor.

Now that you know that there are six **factors** within our gss **dataset**, determine the following:

- The total number of
**levels**within each**factor** - The total number of observations within each
**level**

### Annotated modifying factor order

It’s all well and good to count levels and get the number of observations, but we often need to do slightly more complex work with our data. The example in the book is great, and ultimately produces two graphs that highlight how putting your factors in a specific order can be helpful.

The original code provided was:

I’ve also created an annotated version of the code:

When we plot the data points, we get the following graph:

We can’t really tell from the above graph if our data follows a particular trend. In order to make it easier for anyone looking at our graph (ourselves included!) we can reorder our factors using **fct_reorder()** within our **ggplot() **code, as such:

In the graph below — which has the exact same data, just put in order of least amount of tvhours to most — we can see how the tvhours variables changes by religion:

**Stretch exercise**

Create graphs for other factors within the gss_cat dataset. You can use the code provided above as a template for creating new variables, or you can work with the existing variables within the dataset.

Be sure to share your results in our Week 8 channel!

### If you’ve made it this far…

You should feel fairly confident in your understanding of factors. Sure, you might not quite know what to do with them every time you encounter them, but you know what they are, you know that you can use the `tidyverse`

and `forcats`

to work with them, and you have a supportive group of people who would be more than happy to answer any questions you might have about factors!

So what are you waiting for? Go wrangle some data!