Replacing case_when() with joins in R
I used to find myself writing hefty
case_when() statement to recode values or create new columns — until someone told me to I could use a join instead.
Example of how to replace a case_when() with a join.
Let’s use a common dataset in R:
By running a quick min/max aggregate function, we can see the minimum assault is
45 and the maximum is
Let’s pretend we wanted to create a new column for buckets of assaults as: high assaults, medium-high, medium, medium-low, and low assaults?
We could do a
mutate(buckets = case_when(Assault >= 300 ~ "High",
Assault < 300 & Assult >= 250 ~ "High-Medium",
Assault < 250 & Assult >=200 ~ "Medium",
Assault < 200 & Assult >=150 ~ "Medium-Low",
Assault < 150 ~ "Low"))
It works, but it quickly becomes cumbersome and I wouldn’t call this optimized. Instead, we can just create a reference table, then use a
fuzzy_join(). You could also do this with a
left_join(), but I think the
fuzzy_join is the best option so I’ll show that one.
Note: If I have a few categories, I’ll just do a
if_else(). I’m using an example with a few categories just to be brief. I also really like to use the joins over other functions to create reference tables I would use for multiple datasets instead of copying large
Create a reference dataframe:
ref.df <- data.frame(
bucket = c(“High”, “Medium-High”, “Medium-Low”, “Low”),
value.high = c(max(USArrests$Assault), 249, 199, 149),
value.low = c(250, 200, 150, min(USArrests$Assault)))
Above, I knew I wanted to high and low buckets to include the maximum and minimum assaults, repsectively. I just created random buckets for the sake of the example using hardcoded values.
fuzzy_join() to match values that are between your high and low values:
by = c("Assault"="value.low",
"Assault" = 'value.high'),
match_fun = c(`>=`,`<=`)) %>%
- Pipe in your dataframe you want to join (master df)
- Call the function and enter your reference dataframe
- We want to match against the low (
value.low) and against high (
match_fun()will tell R how you want to match each. The first argument in
match_fun()goes with the first argument in
by = c(), the second argument goes with the second argument in
by = c(). What I’ve done here is said: match if
Assaultis greater than or equal (
Assaultis less than or equal (
match_fun()takes any logical operators. If this is confusing, you can think of it like
- Lastly, I removed unnecessary columns from my final dataframe using
All methods can get you to the same result and have their place. For many categories (unlike the example) or code I would like to reuse, I tend to lean on the
fuzzy_join() option because more scalable and prone to fewer errors in my opinion.
I have other examples of how to use this with dates on my Github here.