7 Levels of Data Manipulation with the tidyverse: Part 2

Published in

CompassRed Data Blog

7 min readJul 22, 2020

In the first part of our exploration of data manipulation tools in the tidyverse, we concentrated on some of the basic dplyr and tidyr verbs. These tools allowed us to subset, transform, pivot, and join data in our quest to understand StackOverflow’s Annual Developer Survey and the 2020 World Happiness Report.

In this article, we will continue exploring data manipulation tools in the tidyverse, building on our learnings from Part 1.

You can continue to find all of the code here:

rharrington31/7_levels_of_data_manipulation_with_tidyverse

This repository was created as a companion to this article. The Analysis.Rmd file contains all of the code from the…

github.com

Level 5: Colwise Transforming

When preparing data for modeling, it is frequently necessary to transform several variables in the same way. For example, when working with continuous variables for modeling, we may want to scale or center them. Most of the World Happiness Index variables that we have augmented our data with are continuous variables that would benefit from this treatment. We can accomplish this with mutate, but it would be tedious. Here’s how we might do this:

so_joined %>% 
  mutate(ladder_score   = scale(ladder_score),
         social_support = scale(social_support),
         generosity     = scale(generosity))

We apply the same transformation to each variable within mutate: ladder_score, social_support, and generosity. For each, we are transforming the raw continuous values into z-scores using the scale() function. This is only 3 variables of the 18 variables that need transformation. Besides being tedious, this approach is error prone.

A solution is to utilize colwise transformations by using the across() function. The across() function can be utilized inside mutate() or summarize() in order to apply the same transformation to multiple columns simultaneously. Here is how the same code as above can be applied to all 18 of the continuous variables from the World Happiness Index.

so_scoped <- 
  so_joined %>% 
  mutate(across(.cols = ladder_score:dystopia_residual, 
                .fns = scale))

The across function is placed inside of mutate(). The across() function can accept the same type of tidy helpers as the select() function in the .cols parameter. In this case, we selected all of the columns from ladder_score to dystopia_residual by using ladder_score:dystopia_residual. The .fns parameter accepts a function which is then applied to all of the columns. This solution is simpler and significantly less error prone.

Level 6: Nesting

If we are aiming to predict whether or not a respondent is satisfied with their job, we might also be interested in understanding what drives that decision across different regions of the world. In a typical modeling process, we would accomplish this by creating an individual model for each world region. This might involve subsetting the data as an additional layer of data prep work. That might look something like this:

so_western_europe <- 
  so_scoped %>% 
  filter(regional_indicator == "Western Europe")so_na_anz <- 
  so_scoped %>% 
  filter(regional_indicator == "North America and ANZ")so_s_asia <- 
  so_scoped %>% 
  filter(regional_indicator == "South Asia")

If we want to build a model for each region of the world in this way, then we would have to individually subset the data for each region. We would then need to go through the modeling process for each region after that. This presents a similar problem to transforming multiple columns — it is error prone and tedious.

In order to solve this problem, we must introduce the concept of “list columns”. Instead of storing types of data in columns — integers, doubles, strings, or logicals — we instead create a column of type list. This can be accomplished with the nest() function (and then undone later with the unnest() function. This dramatically simplifies the process of subsetting the data for each region. Here is what this looks like in practice:

so_nested <- 
  so_scoped %>% 
  group_by(regional_indicator) %>% 
  nest()

The output of this ends up looking like:

The data frame has been reduced to one row per regional_indicator with an additional column called data, which is of type list. We can see that each observation of data is <tibble>. If we explored each <tibble>, we would find that each is extremely similar to our original data frame with two key differences:

It is subset to only the observations from the respective regional_indicator.
It contains every original column with the exception of regional_indicator.

We can prove this to ourselves using the pluck() function from the purrr package, which provides us with a convenient way of “plucking” data from a nested list structure.

so_nested %>% 
  pluck(2, 1)

Manipulating the data in this way gives us an excellent framework for building many models at once. We now just need a set of tools to build models (or anything else!) from this framework.

Level 7: Mapping

To take full advantage of the nest() paradigm, which takes advantage of lists, we need better tools for working with lists. Those tools are provided by the purrr package. The purrr package expands on R’s natural strength in functional programming. The star of the package is map(), which allows us to apply the same function to each element of a list. There are many variants to the map() function, such as map_lgl(), map_int(), map_dbl(), and map_chr(). These are particularly useful when it is important to induce whatever is being returned from a function as a particular type (logical, integer, double, and character respectively).

Here is a simple example of map() being paired with mutate(). Pretend that we want to get a count of how many records are in each of our subset tibbles after using nest(). The map_dbl() function makes this trivial.

so_nested %>% 
  mutate(row_count = map_dbl(.x = data, .f = nrow))

The output of this is a data frame with an additional column called row_count. It was created by iterating over the tibbles in the data list column and applying the function nrow() to each tibble. These are represented by the parameters .x and .f respectively. Because we used map_dbl() as opposed to map(), it forced the row_count column to be type double instead of type list (which is what would have happened if we used map() instead).

We can use the power of map() for much less trivial exercises. In preparation for modeling, we may want to set train-test splits for data. Here’s how we would do that:

so_nested %>% 
  mutate(train = map(.x = data, 
                     .f = slice_sample, 
                     prop = .75))

This is very similar to our last example. We paired map() with mutate() and applied the slice_sample() function to every tibble in the list column called data. The key difference is that there is now an additional parameter, prop = .75. The map() function (and its variants) can all accept additional parameters to be passed along to the function being applied across the list. In this case, prop = .75 is indicating that we want to randomly sample 75% of the observations to be included in the newly created train column.

Putting it all together, we can also create list columns for our test data and even for the model. We’ll use other variants of map() in order to facilitate this:

so_mapped <- 
  so_nested %>% 
  mutate(train = map(.x = data, 
                     .f = slice_sample, 
                     prop = .75),
         test = map2(.x = data, 
                     .y = train, 
                     .f = anti_join, 
                     by = "Respondent"),
         model = map(.x = data, 
                     ~rpart::rpart(formula = JobSat ~ ., 
                                   data = .x, 
                                   minbucket = 50)))

The way that test and model are created each represent slightly different ways to utilize the map() function.

test is created in a similar way to train, but utilizes map2(). This function is one of several that allow us to iterate over multiple list columns at the same time. Notice that map2() includes parameters for .x and .y. This is necessary because the anti_join() function requires two arguments, x and y, in order to work.

model is built slightly differently, using a formula instead. This is indicated by the ~. Notice how this looks more like a traditional function call, with parameters for rpart() listed in parentheses: formula, data, and minbucket. Also notice the .x inside the rpart() function: data = .x. This is directly referencing the field that we use in the first part of the map() function, .x = data. After applying these steps, our data frame now looks like:

In particular, it is interesting to notice the model column. While it is still a list column, instead of containing a <tibble>, it contains the output of our model as an <S3: rpart> object. In fact, we can contain any type of object in a list column — from a modeling object to a ggplot2 object. This unlocks many powerful ways for us to interact with our data moving forward.

Wrapping Up

In these two articles, we explored how we can use a variety of tools from the tidyverse in order to manipulate our data. We went through a sample analysis that allowed us to use simple techniques like select() and filter() and more complex techniques like map() and nest(). Every data analysis will require different tools, but understanding each of the 7 levels presented here will help you to learn and use those techniques more effectively.