Replacing case_when() with joins in R

Lara Southard, PhD
Nerd For Tech
Published in
3 min readNov 15, 2021

--

I used to find myself writing hefty ifelse() or case_when() statement to recode values or create new columns — until someone told me to I could use a join instead.

Conceptual map for case_when()

Example of how to replace a case_when() with a join.

Let’s use a common dataset in R:

USArrests using Assault and Urban population columns

By running a quick min/max aggregate function, we can see the minimum assault is 45 and the maximum is 337.

Let’s pretend we wanted to create a new column for buckets of assaults as: high assaults, medium-high, medium, medium-low, and low assaults?

We could do a case_when()

USArrests %>%
mutate(buckets = case_when(Assault >= 300 ~ "High",
Assault < 300 & Assult >= 250 ~ "High-Medium",
Assault < 250 & Assult >=200 ~ "Medium",
Assault < 200 & Assult >=150 ~ "Medium-Low",
Assault < 150 ~ "Low"))

It works, but it quickly becomes cumbersome and I wouldn’t call this optimized. Instead, we can just create a reference table, then use a fuzzy_join(). You could also do this with a left_join(), but I think the fuzzy_join is the best option so I’ll show that one.

Note: If I have a few categories, I’ll just do a case_when() or if_else(). I’m using an example with a few categories just to be brief. I also really like to use the joins over other functions to create reference tables I would use for multiple datasets instead of copying large case_when() or if_else() statements.

Generic example of fuzzy_join()

fuzzyjoin::fuzzy_join()

Create a reference dataframe:

ref.df <- data.frame(
bucket = c(“High”, “Medium-High”, “Medium-Low”, “Low”),
value.high = c(max(USArrests$Assault), 249, 199, 149),
value.low = c(250, 200, 150, min(USArrests$Assault)))

Above, I knew I wanted to high and low buckets to include the maximum and minimum assaults, repsectively. I just created random buckets for the sake of the example using hardcoded values.

Next, use fuzzy_join() to match values that are between your high and low values:

USArrests %>% 
fuzzy_join(ref.df,
by = c("Assault"="value.low",
"Assault" = 'value.high'),
match_fun = c(`>=`,`<=`)) %>%
select(-c(value.high, value.low))

fuzzy_join() Line-by-Line:

  1. Pipe in your dataframe you want to join (master df)
  2. Call the function and enter your reference dataframe
  3. We want to match against the low ( value.low ) and against high ( value.high).
  4. Next, match_fun() will tell R how you want to match each. The first argument in match_fun()goes with the first argument in by = c(), the second argument goes with the second argument in by = c(). What I’ve done here is said: match if Assaultis greater than or equal ( ) to value.low and if Assault is less than or equal ( ) to value.high. match_fun() takes any logical operators. If this is confusing, you can think of it like between(). .
  5. Lastly, I removed unnecessary columns from my final dataframe using select(-c).

The result:

All methods can get you to the same result and have their place. For many categories (unlike the example) or code I would like to reuse, I tend to lean on the fuzzy_join() option because more scalable and prone to fewer errors in my opinion.

I have other examples of how to use this with dates on my Github here.

--

--

Lara Southard, PhD
Nerd For Tech

trained neuroscientist | professional research scientist | lifelong feminist