7 Levels of Data Manipulation with the tidyverse: Part 1

Published in

CompassRed Data Blog

9 min readJul 16, 2020

Previously, we explored ways to “level up” a ggplot2 graph from basic to beautiful. This allowed us to tell the story about our data more easily by using progressively more advanced concepts from the ggplot2 package (and a few others).

One of the first “levels” that we explored was focused on data cleaning. In the previous use case, this was a relatively simple exercise. In practice, though, there are many tools that we need to utilize in order to effectively clean, shape, and manipulate our data. The tidyverse (which ggplot2 is part of) offers several packages that simplify data manipulation.

In this article I’ll explore different tools for data manipulation using tidyverse functions. This article assumes that you have a beginner’s understanding of R and how the pipe operator (%>%) works. This article can serve as scaffolding as you grow your data manipulation toolkit. There are many more patterns that are useful to learn and starting with some of these techniques puts you in a position for more success.

The article will be split into two parts:

Part 1 will focus on the basic dplyr and tidyr verbs
Part 2 will explore a few more advanced data manipulation concepts

Data to Explore

StackOverflow recently released the results of their Annual Developer Survey. We’ll be taking a look at the anonymized responses. In the code, this dataset will be called so. In addition, we’ll be augmenting the StackOverflow data with 2020 World Happiness Report data. This dataset will be called happiness. To facilitate looking at the different “levels” of data manipulation, we’ll be considering a simple model to predict whether or not a developer is satisfied with their role and what the differences are by region of the world.

Also, as we’re exploring the data, you can follow along with all of the code for it here:

rharrington31/7_levels_of_data_manipulation_with_tidyverse

This repository was created as a companion to this article. The Analysis.Rmd file contains all of the code from the…

github.com

Level 1: Subsetting

One of the simplest and most essential ways to explore data is to subset it. This means that we can remove rows or columns from the dataset in order to focus on parts of the data that we care most about. In order to do that in the tidyverse, we take advantage of the filter() and select() functions from dplyr. Each of these functions does exactly what they sound like they do. The filter() function filters rows based upon a logical condition in a column. The select() function allows us to choose columns to work with by name or characteristic.

In our use case, we’re interested in answering the question of whether or not a developer will be satisfied with their role. This is represented by the JobSat field in the StackOverflow dataset.

Not every person answered the JobSat question. We’ll want to remove the respondents that did not answer this question from the dataset. In order to do that, we can take advantage of filter(). Specifically, we want to only keep values that are not NA. Here’s how we can do that

so %>% 
  filter(! is.na(JobSat))

Inside of the filter() function we pass ! is.na(JobSat), which identifies NA values in the JobSat field and evaluates to TRUE or FALSE for every row in the dataset. Only rows that evaluate to TRUE will be kept. In general, that is all that is needed for filter() — a statement that evaluates to TRUE or FALSE for every row in the dataset. We can use any logical operator that we want to accomplish this: ==, >, <, amongst many others. In addition, filter() can accept multiple logical conditions at once by separating the conditions with a comma.

Next, we can use select() to consider the columns that we want to keep for modeling. We can directly name columns in the order that we would like them to appear within the select() function. We want to track the respondents, so we will keep the Respondent column. The JobSat field is also important, so we want to keep that as well.

This dataset includes a large number of columns, so it would be very time consuming to name all of them in this way. In this case, there are several “tidy select helpers” that we can make use of to reduce that overhead. Here’s a few that I use frequently in practice:

: allows you to select a range of consecutive variables.
everything() matches all variables. Great for when you have many variables to include.
starts_with() starts with a prefix. Great for when you have many columns that begin with the same phrase. For example, this dataset has several variables that begin with the prefix NEW.

In this case, we have many variables that we want to include, so the everything() helper function would be very useful. Last, there are a few columns that we don’t want to include. These are columns about demographics and about the survey itself. We can remove those by using the — operator in front of the name of the column.

All together, these functions come together to look like:

so_subset <-   
  so %>% 
  filter(! is.na(JobSat)) %>% 
  select(Respondent, JobSat, everything(), 
         -Ethnicity, -Gender, -Sexuality, 
         -Trans, -SurveyEase, -SurveyLength)

Level 2: Transforming

After subsetting our data, we typically would begin to transform key variables. Functions such as arrange(), mutate(), summarize(), and group_by() help to make this possible.

Each of these functions help to simplify the process of creating and modifying columns. Like the subsetting functions, the actions that they perform are what you would expect from their names.

The arrange() function allows you to arrange a data frame based upon the values in a selected column (or columns). For example, perhaps we want to organize our data frame so that the different levels in theJobSat variable are together. arrange() allows us to accomplish this easily:

so_subset %>% 
  arrange(JobSat)

The mutate() function allows you to add new variables to your data frame. The resulting data frame will have the same number of observations as your original data frame.

We can use this in our example in order to transform the JobSat variable into a logical field. Our goal is for any observation where JobSat is either “Strongly agree” or “Slightly agree” to be labeled as TRUE and for all other observations to be labeled as FALSE. Here’s how that would look:

so_subset %>% 
  mutate(JobSat = ! str_detect(JobSat, "dissatisfied"))

When we write JobSat = ! str_detect(JobSat, "dissatisfied")), it tells the mutate function to detect any strings in the JobSat field that contain the string “dissatisfied”. When detected, the observation will be labeled as TRUE. The ! then negates the logical value to FALSE, which is what we want. These logical values are then stored back to JobSat, overwriting the original data.

The summarize() function tends to work in pair with the group_by() function. It creates a new data frame with summarized variables based upon the number of groups that exist in the variable passed to the group_by() function.

Because we’re looking to preserve all of the records, the summarize function wouldn’t be particularly helpful for us here. However, it would be very useful when doing exploratory analysis for this dataset. Perhaps we think that a useful feature might be how many hours a respondent works in comparison to the rest of the respondents in their country. We would want to first understand how much differentiation there is between respondents in different countries. The summarize() function is extremely useful for this. Here is how that code would look:

so_subset %>% 
  group_by(Country) %>% 
  summarize(Avg_Hours_Work = mean(WorkWeekHrs, na.rm = T)) %>% 
  ungroup()

In this case, we pass the Country variable to group_by(), indicating that we want to reduce the number of observations we have to be equal to the unique number of levels in the Country variable. There are 170 countries in our subset data, so we should end up with 170 rows after using summarize(). When we write Avg_Hours_Work = mean(WorkWeekHrs, na.rm = T), we are creating a new column in the summarized data frame that is the mean() of WorkWeekHrs (there are plenty of NA values in this field, necessitating the use of the na.rm parameter). Finally, we need to remove the grouping variable by using the ungroup() function. The functions group_by() and ungroup() should always be used as bookends.

The group_by() function does not have to just be used in conjunction with summarize(). If we want to implement this feature into our data frame, we would want to preserve the number of observations, so we would use mutate() instead. Here is how we might implement this feature:

so_transform <- 
  so_subset %>% 
  arrange(JobSat) %>% 
  mutate(JobSat = ! str_detect(JobSat, "dissatisfied")) %>% 
  group_by(Country) %>% 
  mutate(Avg_Hours_Work = mean(WorkWeekHrs, na.rm = T)) %>% 
  ungroup() %>% 
  mutate(Hours_Work_Diff = WorkWeekHrs - Avg_Hours_Work)

Level 3: Pivoting

Frequently data is in the wrong shape when we begin working with it. A common issue is the difference between data in a wide versus a long format. Both formats are valuable, but are useful for different applications. Wide data tends to be more useful for modeling, whereas long data tends to be more useful for graphing (like all rules, these will be broken).

Credit: Garrick Aden-Buie’s (@grrrck) Tidy Animated Verbs

The tidyverse provides functions to help make the transition between these formats easier: pivot_wider() and pivot_longer(). Once again, these functions do exactly what it sounds like they would do.

A great use case for pivot_longer() is to quickly visualize distributions of many variables quickly. For our use case, we might want to augment our StackOverflow dataset with information about how happy people in other countries are. The World Happiness Index provides that information. Before we augment the StackOverflow data with the World Happiness Index data, it would be helpful to understand the distributions of each variable. This would be much simpler to accomplish using a long data format, but the data is in a wide format. Here’s how we could accomplish that:

happiness %>% 
  pivot_longer(cols = ladder_score:dystopia_residual,
               names_to = "field",
               values_to = "value")

The pivot_longer() function accepts several parameters, but the most important are cols, names_to, and values_to. The cols parameter tells the function which columns from the happiness data frame we would like to pivot (in this case, all of the columns from ladder_score to dystopia_residual). The names_to parameter tells the function the name of the field that we would like for the original column names. The values_to parameter tells the function the name of the field that we would like for the data stored in the cell values.

We can simply pass this to ggplot2 and quickly understand the distributions of each field:

happiness %>% 
  pivot_longer(cols = ladder_score:dystopia_residual,
               names_to = "field",
               values_to = "value") %>% 
  ggplot(aes(x = "",
             y = value)) +
  geom_jitter(color = CR_cols("CR_green"),
              alpha = 0.5) +
  facet_wrap(~field, ncol = 6, scales = "free") +
  labs(x = "",
       y = "")

Level 4: Joining

One of the most important concepts for transforming data is the idea of “joining”. Joins are concept from SQL based upon relational algebra. Joins allow us to combine data from multiple tables into a single table. This is exactly what we will need to augment our StackOverflow data with the World Happiness Index data.

There are several types of joins to consider: inner_join(), left_join(), right_join(), and full_join() amongst a few other variants.

In our particular use case, we want to add country level data from the World Happiness Index dataset into the respondent level data from the StackOverflow dataset. To do this, we are best served by a left_join(), allowing us to preserve all of the data from the data frame on the “left hand side” of the join (StackOverflow) and adding in data from the “right hand side” of the join (World Happiness Index). The only data that will not be included in our final data frame will be from countries that do not occur in both datasets.

Here’s how that looks:

so_joined <- 
  so_transform %>% 
  left_join(happiness, by = c("Country" = "country_name"))

In this case, the “left hand side” of the join is so_transform, our transformed StackOverflow dataset. The “right hand side” of the join is happiness, our original World Happiness Index dataset. The by parameter in left_join() allows us to dictate what field is common between the two datasets. We intend to join the two datasets based upon common countries. Those are represented by the Country field in so_transform and the country_name field in happiness. The so_joined data frame will now include all of the fields from so_transform in addition to the fields from happiness.