Eight R Tidyverse tips for everyday data engineering

Tomaz Kastrun
6 min readJul 14, 2022

--

Tidyverse is a collection of R packages, primarily for data engineering and analytics. These packages are ggplot2, purrr, tibble, dplyr, tidyr, stringr, readr, and forcats. And all combine the same language, design and “grammar” structures.

Collection of Tidyverse resources. Source: Tidyverse

1. Use pipe “%>%” for nesting functions within the pipe

Piping (or chaining) is a great way to link the data manipulation functions without storing intermediate results.

# inner piping inside mutate
airlines %>%
mutate(name_short = name %>%
str_to_upper() %>%
str_replace_all (" (INC|CO)\\.?$", "") %>%
str_replace_all (" AIR ?(LINES|WAYS)?( CORPORATION)?$", "") %>%
str_to_title() %>%
str_replace_all("\\bUs\b", "US")
) %>%
mutate(FullName_length = nchar(name)) %>%
select (name_short, FullName_length) %>%
arrange(desc(FullName_length))

This code is clearly readable, understandable and easy to curate, despite the heavy use of REGEX and str_replace_all function. If we would replace the inner piping with normal function wrapping, the mutate for “name_short” would be fairly unreadable.

# without inner piping
airlines %>%
mutate(name_short = str_replace_all( str_to_title( str_replace_all(str_replace_all(str_to_upper(name)," (INC|CO)\\.?$", "")," AIR ?(LINES|WAYS)?( CORPORATION)?$", "")),"\\bUs\b","US")) %>%
mutate(FullName_length = nchar(name)) %>%
select (name_short, FullName_length) %>%
arrange(desc(FullName_length))

Both code snippets return the same result and use essentially the same functions. Whereas, the first one intelligently does this, using pipes.

2. Operations across multiple columns at once

Tidyverse provides a handful of great functions for operating across multiple columns simultaneously. Across is a function, that makes it easy to apply the same transformation over numerous columns in summarise() and mutate() functions.

Across accepts two arguments; a) array of columns and b) function or list of functions to be applied to selected columns.

# Simple across with two selected columns
flights %>%
mutate(across(c(dep_delay, arr_delay), abs))
# Using function where to select columns
flights %>%
summarise(across(where(is.integer), n_distinct))
# Using where and purr-style lamba function
flights %>%
summarise(across(where(is.integer), ~ sum(.x, na.rm = TRUE)))
# using list of functions, defining column names and removing NA
flights %>%
group_by(carrier) %>%
summarise(across(ends_with("time"), list(AVG = mean, SD = sd, GrandTotal= ~ sum(is.na(.x))), na.rm=TRUE, .names = "{.col}.{.fn}")) %>%
ungroup()

3. Case statement to create a column based on a condition

Creating case statements is often a required task and case_when() function enables simple, fast transformation. A new column is simply added on top of other tidyverse functions; such as group_by and count.

flights %>%
group_by(carrier) %>%
mutate(new_classification = case_when(
(origin == "EWR") & (dep_delay <= 0) ~ "EWR with negative delay",
(origin == "EWR") & (dep_delay > 0) ~ "EWR with positive delay",
(origin == "JFK") ~ "Stats for JFK Airport",
(origin == "LGA") & (air_time <= 220) ~ "La Guardia under 6 hours flights",
TRUE ~ "La Guardia above 6 hours flights"
)) %>%
count(new_classification) %>%
ungroup()

4. Using transmute

This function executes same functions as mutate — create new column. But transmute() adds new variable(s) and drop existing ones, where mutate() preserves the existing ones.

library(lubridate)# Combination of mutate and select
flights %>%
mutate(date = make_date(year, month, day), carrier, tailnum) %>%
select(date, carrier, tailnum)
# using transmute
flights %>%
transmute(date = make_date(year, month, day), carrier, tailnum)

The example above returns essentially the same result set, whereas transmute() is cleaner and easier to read. If one would remove the select() function from the first example, one would get the complete dataset, along with a new column.

5. Lumping levels

Lumping together levels for a given factor variable is an extremely powerful function when there are many levels but one is interested in only the top most frequent levels.

Given the vector of 15 letters and their frequency, one can immediately see, that out of 15 different letters, roughly 80% of the observations fall into 5 letters.

# using transmute
x <- factor(rep(LETTERS[1:15], times = c(20,15,23,2,4,3,1,1,1,5,2,8,3,1,1)))
x %>% table()

And to achieve this, one can use fct_lump_n() function or fct_lump_min().

x %>% 
fct_lump_n(5) %>%
table()
# or alternatively
# x %>% fct_lump_min(5) %>% table()
## result:
> A B C J L Other
> 20 15 23 5 8 19

Now displaying the long tail of different levels with low frequency is grouped into “other” bucket, great for visualising data.

# using ggplot to show top 5 frequent carriers and "other" group
flights %>%
mutate(name = fct_lump_n(carrier, 5)) %>%
count(name) %>%
mutate(name = fct_reorder(name, n)) %>%
ggplot(aes(x=name, y=n)) +
geom_col()
Quick overview graph using fct_lump_n() function

6. Generating all possible combinations

Generating all the possible combinations out of a set of variables is not your everyday scenario. But Tidyverse offers the function called crossing() or expand() to do the job.

Following example will create all possible combinations for variables age, status, values and temperature.

# pseudo sample
crossing(
age = c(30,40,50,60,70),
status = c("New", "Used"),
values = c("0-100EUR", "101-200E", "201-300", "301-400"),
temperature = c(30,35,34)
)

Working with flights dataset, let’s create all possible combinations for couple of selected variables. One would use function expand()

# generating all possible combinations
flights %>% expand(origin, dest, dep_time, carrier)

But finding all existing and unique combinations

# getting all possible combinations that are present in dataset using crossing
flights %>%
select(origin, dest, dep_time, carrier) %>%
crossing()
# > 97,946 more rows
# getting all possible combinations using expand and nesting
flights %>% expand(nesting(origin, dest, dep_time,carrier))
# > 97,946 more rows

7. Reshaping data with pivot and spread

Another important task that will usually occur with data preparation is data reshaping.

First function one must (!) understand and be comfortable work with is pivot_wider(). This function takes the values from one variable (in this case variable: origin) and transpose (reshape / pivot) the data (dep_time and arr_time) for given statistics (average of values).

#pivot_wider
flights %>%
group_by(carrier) %>%
select(origin, dep_time, arr_time) %>%
pivot_wider(
names_from = origin,
values_from = c(dep_time,arr_time),
values_fn = ~mean(.x, na.rm = TRUE),
names_glue = "{origin}_{.value}"
)
#check calculation for carrier UA and origin EWR
flights %>%
filter(carrier == 'UA' & origin == 'EWR') %>%
group_by(carrier) %>%
summarise(
avg_dep_time = mean(dep_time, na.rm = TRUE)
)

Contra to pivot_wider() is the function pivot_longer(). This function does the exact opposite. Wider function adds more columns and reduces the number of rows, longer function reduces the number of columns and creates new rows of data.

Let’s persist the dataframe from pivot_wider example and name the dataframe flights_wider.

#create and persist dataframe called: flights_wider
flights_wider <- flights %>%
group_by(carrier) %>%
select(origin, dep_time, arr_time) %>%
pivot_wider(
names_from = origin,
values_from = c(dep_time,arr_time),
values_fn = ~mean(.x, na.rm = TRUE),
names_glue = "{origin}_{.value}"
)
# excluding carrier
flights_wider %>%
pivot_longer(
!carrier,
names_to = "origin",
values_to = "time",
values_drop_na = TRUE
)
#or with cols parameter and defining the pattern for column selection
flights_wider %>%
pivot_longer(
cols = ends_with("time"),
names_to = "origin",
values_to = "time",
values_drop_na = TRUE
)

8. Adding a running ID to your dataframe

The last tip is the easiest and yet can save tons of time. Adding a running ID to each row of the dataframe. This ID will create “uniqueness” overall attributes (columns) for a given row. This can save you a lot of filtering and writing code, whereas, one can say, give me a row with ID = 42.

# Associate ID with every row in dataset
flights %>%
mutate(running_id = row_number())
# or using tibble row_to_column function
flights2 <- tibble::rowid_to_column(flights, ID)

This list is by no means the ultimate list and one could add many honourable mentions. But despite all that, the list has been curated over a longer period of time and this cherry-picking is based on work and the use of Tidyverse.

The complete code is available on Github. Follow this link.

Tomaž Kaštrun is a data geek, in data mining and data science, and enjoys working with data. Community is core to technology development. Microsoft data platform MVP. Github: http://www.github.com/tomaztk

--

--

Tomaz Kastrun
Tomaz Kastrun

Written by Tomaz Kastrun

Data Platform MVP, Data scientist, Geek. Community is core to technology development.