Replacing case_when() with joins in R
--
I used to find myself writing hefty ifelse()
or case_when()
statement to recode values or create new columns — until someone told me to I could use a join instead.
Example of how to replace a case_when() with a join.
Let’s use a common dataset in R:
By running a quick min/max aggregate function, we can see the minimum assault is 45
and the maximum is 337.
Let’s pretend we wanted to create a new column for buckets of assaults as: high assaults, medium-high, medium, medium-low, and low assaults?
We could do a case_when()
…
USArrests %>%
mutate(buckets = case_when(Assault >= 300 ~ "High",
Assault < 300 & Assult >= 250 ~ "High-Medium",
Assault < 250 & Assult >=200 ~ "Medium",
Assault < 200 & Assult >=150 ~ "Medium-Low",
Assault < 150 ~ "Low"))
It works, but it quickly becomes cumbersome and I wouldn’t call this optimized. Instead, we can just create a reference table, then use a fuzzy_join()
. You could also do this with a left_join()
, but I think the fuzzy_join
is the best option so I’ll show that one.
Note: If I have a few categories, I’ll just do a case_when()
or if_else()
. I’m using an example with a few categories just to be brief. I also really like to use the joins over other functions to create reference tables I would use for multiple datasets instead of copying large case_when()
or if_else()
statements.
fuzzyjoin::fuzzy_join()
Create a reference dataframe:
ref.df <- data.frame(
bucket = c(“High”, “Medium-High”, “Medium-Low”, “Low”),
value.high = c(max(USArrests$Assault), 249, 199, 149),
value.low = c(250, 200, 150, min(USArrests$Assault)))
Above, I knew I wanted to high and low buckets to include the maximum and minimum assaults, repsectively. I just created random buckets for the sake of the example using hardcoded values.
Next, use fuzzy_join()
to match values that are between your high and low values:
USArrests %>%
fuzzy_join(ref.df,
by = c("Assault"="value.low",
"Assault" = 'value.high'),
match_fun = c(`>=`,`<=`)) %>%
select(-c(value.high, value.low))
fuzzy_join() Line-by-Line:
- Pipe in your dataframe you want to join (master df)
- Call the function and enter your reference dataframe
- We want to match against the low (
value.low
) and against high (value.high
). - Next,
match_fun()
will tell R how you want to match each. The first argument inmatch_fun()
goes with the first argument inby = c()
, the second argument goes with the second argument inby = c()
. What I’ve done here is said: match ifAssault
is greater than or equal (≥
) tovalue.low
and ifAssault
is less than or equal (≤
) tovalue.high
.match_fun()
takes any logical operators. If this is confusing, you can think of it likebetween()
. . - Lastly, I removed unnecessary columns from my final dataframe using
select(-c)
.
The result:
All methods can get you to the same result and have their place. For many categories (unlike the example) or code I would like to reuse, I tend to lean on the fuzzy_join()
option because more scalable and prone to fewer errors in my opinion.
I have other examples of how to use this with dates on my Github here.