Writing your own dplyr functions

Saher El-Neklawy
Optima . Blog
Published in
4 min readDec 17, 2015

dplyr is awesome, like really awesome. The thing I like most about it is how readable it makes data processing code look. In short, there are two primary aspects that make dplyr great for readability (in addition to it’s great performance, data back-end agnosticism, and more) :

  • The pipe operator %>%
  • Using column names directly, without quoting them as a string

Let’s take an example of a grouping followed by an averaging operation in standard R on the popular mtcars dataset:

tapply(mtcars[['mpg']], mtcars[['cyl']], mean)

After looking at this line for some time, you will see that it groups the cars by the number of cylinders on their engines (cyl) and then takes the average of miles per gallon (mpg) for each group. For this line to be readable, one needs to remember the argument order of tapply, and use the data frame variable twice. Luckily, there is an alternative! :)

Looking at the functions offered by dplyr, a better alternative to the previous line is:

summarize(group_by(mtcars, cyl), mean_mpg = mean(mpg))

There are several things to note here:

  • The data frame variable is only mentioned once
  • The column names are used directly
  • The grouping is handled by the group_by function, which passes it’s output to the summarize function.

But this isn’t exactly intuitive, it has to be read inside out, starting with the group_by, then the summarize.

The key to solving this is in the pipe operator %>% provided through the magrittr package which comes directly when you load dplyr. The simplest way to understand it is directly from the library’s documentation:

x %>% f is equivalent to f(x)

x %>% f(y) is equivalent to f(x, y)

x %>% f %>% g %>% h is equivalent to h(g(f(x)))

Given this magic, we can rewrite the previous line as follows:

mtcars %>% group_by(cyl) %>% summarize(mean_mpg = mean(mpg))

This is more like it. It’s easy to read how the data flows. Starting from mtcars, that is then grouped by cyl, and then the mean is taken from the result of this grouping. The reasons the %>% operator is very friendly with dplyr, is that the first argument to all functions is a data frame to operate on.

A new function

Now, let’s group by a different column, for example the number of gears (gear).

We can do the same and just change the parameter sent to group_by, but then I would have to write the whole line again. This is usually a sign to group code into a function. The final call to this new function should look something like this:

mtcars %>% mean_mpg(cyl)
mtcars %>% mean_mpg(gear)

This means we need to create a function with the following signature (remember that the first argument is the data):

mean_mpg = function(data, group_col)

On first impulse, one may write a function like this:

mean_mpg = function(data, group_col) {
data %>% group_by(group_col) %>% summarize(mean_mpg = mean(mpg))
}

But, when calling this function, we get this error:

mtcars %>% mean_mpg(gear)
Error: unknown column ‘group_col’

HUH? Why doesn’t it pass the name of the column???

This brings us to another piece of magic dplyr does through the lazyeval package. This is what allows you to use column names without quotes and is known as non-standard evaluation.

Non-standard Evaluation

As an example, our dplyr line can be written as follows in standard evaluation:

mtcars %>% group_by_('cyl') %>% summarize(mean_mpg = mean(mpg))

Note the underscore at the end of group_by_, this means that we are using the standard evaluation version of group_by, therefore we need to pass the column name with quotes.

Looking at the manual of any dplyr function, you will see there is always a standard evaluation version of the usual functions, suffixed with an underscore. Furthermore, when you print a function like group_by, it looks like this:

> print(group_by)
function (.data, ..., add = FALSE)
{
group_by_(.data, .dots = lazyeval::lazy_dots(...), add = add)
}
<environment: namespace:dplyr>

So how do we use lazyeval to fix our mean_mpg function. The key is in the .dots parameter of any dplyr function. This is what usually takes in the column names you will work with.

The 2 functions we can use from lazy eval are:

  • lazyeval::lazy which lazily evaluates a single parameter value
  • lazyeval::lazy_dots which lazily evaluates multi-param values

Thus our function can be written as:

mean_mpg = function(data, group_col) {
data %>% group_by_(.dots = lazyeval::lazy(group_col)) %>% summarize(mean_mpg = mean(mpg))
}

which allows us to run:

mtcars %>% mean_mpg(cyl)
mtcars %>% mean_mpg(gear)

or

mean_mpg = function(data, ...) {
data %>% group_by_(.dots = lazyeval::lazy_dots(...)) %>% summarize(mean_mpg = mean(mpg))
}

which allows us to run:

mtcars %>% mean_mpg(cyl)
mtcars %>% mean_mpg(gear)
mtcars %>% mean_mpg(cyl, gear)

.. and we just created our own dplyr functions! :)

For more reading into non-standard evaluation, take a look at the following links:

and for more information on dplyr in general check the following talks by the man himself, Hadley Wickham:

For a quick summry of using dplyr, check out this awesome data wrangling cheat sheet by Rstudio.

How this post came to be? Every week at Optima, everyone on the team gets five minutes or so to share a “nugget” of data science, algorithms or related knowledge. The only rule is that it can be explained and grasped in 5 to 10 minutes. Lately we decided to share these nuggets with the world. So here we are.

--

--

Saher El-Neklawy
Optima . Blog

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++.+++++++++++++++++.++++++++++++++++++. — — — — . — — — — — — — .