R4DS Week 8: factors & forcats
The temptation to make this week’s recap nothing but a collection of my favorite cat gifs was strong, but I resisted. Sort of.
The text does a fantastic job of walking you through a multitude of examples, and so this recap is largely a series of images that are annotated versions of some of the code in the text.
Factors and Forcats
What is a factor?
Factors are how categorical information gets stored in R. Remember that categorical variables are a type of variable that take on a name or label. Examples of categorical variables include:
- Hair color
- Car make
- A Likert scale
For the purposes of this recap, we are not going to differentiate between ordinal and nominal categorical variables.
Factors and levels
Factors can trip you up pretty quickly, because they often look like strings, but behave more like numbers. This is because R is storing each level within a factor as an integer. R assigns integers to each level using alphabetical order.
In other words:
Annotated General Social Survey
To get the data shown below, you can run:
In addition to the dataset, marital factor, and levels being annotated, try to answer the following questions:
- How many total observations (rows) are present in the dataset?
- How many total factors are present in the dataset?
- How many levels are there of each factor? How do you know?
If you’re not sure how to answer number three, run the following:
### %>% is called a pipe, and the keyboard shortcut is Cmd/Ctrl + Shift + m
The count() function within forcats will give you the total number of observations within each level, as well as the number of levels within a factor.
Now that you know that there are six factors within our gss dataset, determine the following:
- The total number of levels within each factor
- The total number of observations within each level
Annotated modifying factor order
It’s all well and good to count levels and get the number of observations, but we often need to do slightly more complex work with our data. The example in the book is great, and ultimately produces two graphs that highlight how putting your factors in a specific order can be helpful.
The original code provided was:
I’ve also created an annotated version of the code:
When we plot the data points, we get the following graph:
We can’t really tell from the above graph if our data follows a particular trend. In order to make it easier for anyone looking at our graph (ourselves included!) we can reorder our factors using fct_reorder() within our ggplot() code, as such:
In the graph below — which has the exact same data, just put in order of least amount of tvhours to most — we can see how the tvhours variables changes by religion:
Create graphs for other factors within the gss_cat dataset. You can use the code provided above as a template for creating new variables, or you can work with the existing variables within the dataset.
Be sure to share your results in our Week 8 channel!
If you’ve made it this far…
You should feel fairly confident in your understanding of factors. Sure, you might not quite know what to do with them every time you encounter them, but you know what they are, you know that you can use the
forcats to work with them, and you have a supportive group of people who would be more than happy to answer any questions you might have about factors!
So what are you waiting for? Go wrangle some data!