# R Function Of The Week: apply() and sapply()

• A brief overview of the `apply()` concept and general use case
• `apply()` vs `sapply()` with examples

This week at Statistics Without Borders, we go over the `apply()` concept and discuss when it is most helpful to use. While programming, you may come across situations that require you to iterate over a set of values and apply some kind of function or calculation to each value. Traditionally, this is accomplished using a loop construct.

## Overview of `apply()`

The `apply()` function can be used in lieu of loops and help speed up your code. `apply()` can take matrix-like structures as input and supply output results in the form of a vector, array, or list. Because it has the built-in capability to accept multi-faceted inputs and produce similar output structures, using this function eliminates the need to write lengthy loops.

In the example code below, we will perform some simple operations on a built-in R dataset called `Orange`. This dataset records growth of orange trees.

• First, we will load the data set into our environment using the `data` command.
`## Load in datadata('Orange')head(Orange)`

Output

`#   Tree  age circumference# 1    1  118            30# 2    1  484            58# 3    1  664            87# 4    1 1004           115# 5    1 1231           120# 6    1 1372           142`
• Next, we will engineer a new feature. This new feature will calculate each tree’s `circumference` value divided by the largest `circumference` value in the data.
• We will first perform this calculation using a `for` loop and then compute the same values using `apply()`.

Let’s write a `for` loop to express each circumference value as a percentage of the largest value in the circumference column

`max_circ <- max(Orange\$circumference) # maximum circumferencepct_circ <- list()  ## create an empty list to store valuesfor(i in 1:nrow(Orange)) { # loop over each row    # Divide each value by the max and multiply by 100  pct_circ[i] <- (Orange\$circumference[i] / max_circ) * 100  } # End for looppct_circ`

As you can see, we were successfully able to iterate over each row and compute percentage of total. However, we can do this in one line using the `apply()` function. The general syntax of the function is as follows:

`apply(data, Margin, Function)`

The first argument expects an array or matrix. The second argument can be `1`, `2`, or a vector of indices. This argument tells the function whether the operation should be applied across rows (`1`), down a column (`2`), or to a specific cell. The last argument expects the name of a function that is going to applied. In our case, since there is no built-in function that can compute a given number as a percentage of another number, we will write a built in function:

`pct_func <- function(x) {  result <- x / max_circ   # x is a vector/matrix  result <- result * 100 # convert from decimal to %  return(result)}`

Now that we have our function ready, we just need to specify the correct inputs to the `apply()` function. Remember that it expects data to be a dataframe or a matrix. We will supply the circumference column (column 3), as our input matrix. When indexing column 3 of the `Orange` dataframe, we will have to specify an additional argument `drop=F` to ensure our matrix maintains its dimensions.

`apply(Orange[,3,drop=F], 1, pct_func)`

Output

`# output from console# 1         2         3         4         5         6         7         8         9        10 # 14.01869  27.10280  40.65421  53.73832  56.07477  66.35514  67.75701  15.42056  32.24299  51.86916 # 11        12        13        14        15        16        17        18        19        20 # 72.89720  80.37383  94.85981  94.85981  14.01869  23.83178  35.04673  50.46729  53.73832  64.95327 # 21        22        23        24        25        26        27        28        29        30 # 65.42056  14.95327  28.97196  52.33645  78.03738  83.64486  97.66355 100.00000  14.01869  22.89720 # 31        32        33        34        35 # 37.85047  58.41121  66.35514  81.30841  82.71028`

As you can see, we were able to achieve the same results in one line. Using the `apply()` method, we can quickly apply any function to a set of data. While in this case we wrote our own user defined function, combining an anonymous function (commonly referred to as a “lambda expression”) with `apply()` allows users to write their own functions within the `apply()` function. While we will not be discussing lambda expressions within this article, this blog post does a good job of walking through how they work in both R and Python.

## apply() vs. sapply()

Building off of the `apply()` function, `sapply()`takes in more flexible input types and is ideal for vector operations. `sapply()` takes in a list, vector, or DataFrame as an input and returns a matrix or vector of the same length as an output. The general syntax of the `sapply()` function is:

`sapply(data, function)`

Notice that there is no `margin` argument. This is because the `sapply()` function will apply the function to each element of the data by default. Due to this default behavior, the `margin` argument is unnecessary.

Following the same example as above, let’s compute circumference as a percentage of the largest circumference in our data set using `sapply()` . Note that since the input does not have to be a matrix with usable dimensions, we can just supply the “circumference” column by name and will get a list object of the same length as an output.

`sapply(Orange\$circumference, pct_func)#Output:# [1]  14.01869  27.10280  40.65421  53.73832  56.07477  66.35514  67.75701  15.42056  32.24299# [10]  51.86916  72.89720  80.37383  94.85981  94.85981  14.01869  23.83178  35.04673  50.46729# [19]  53.73832  64.95327  65.42056  14.95327  28.97196  52.33645  78.03738  83.64486  97.66355# [28] 100.00000  14.01869  22.89720  37.85047  58.41121  66.35514  81.30841  82.71028`

If we wanted to, we could apply the `pct_func` to the `age` column in our dataset as well, we could like so:

`sapply(Orange[,c('circumference', 'age')], pct_func)#Output: # circumference       age# [1,]      14.01869  55.14019# [2,]      27.10280 226.16822# [3,]      40.65421 310.28037# [4,]      53.73832 469.15888# [5,]      56.07477 575.23364# [6,]      66.35514 641.12150# [7,]      67.75701 739.25234# [8,]      15.42056  55.14019# [9,]      32.24299 226.16822# [10,]      51.86916 310.28037# [11,]      72.89720 469.15888# [12,]      80.37383 575.23364# [13,]      94.85981 641.12150# [14,]      94.85981 739.25234# [15,]      14.01869  55.14019# [16,]      23.83178 226.16822# [17,]      35.04673 310.28037# [18,]      50.46729 469.15888# [19,]      53.73832 575.23364# [20,]      64.95327 641.12150# [21,]      65.42056 739.25234# [22,]      14.95327  55.14019# [23,]      28.97196 226.16822# [24,]      52.33645 310.28037# [25,]      78.03738 469.15888# [26,]      83.64486 575.23364# [27,]      97.66355 641.12150# [28,]     100.00000 739.25234# [29,]      14.01869  55.14019# [30,]      22.89720 226.16822# [31,]      37.85047 310.28037# [32,]      58.41121 469.15888# [33,]      66.35514 575.23364# [34,]      81.30841 641.12150# [35,]      82.71028 739.25234`

As a result, we get a DataFrame of the same length as our input DataFrame. Super convenient! I tend to use the function on the daily to perform data manipulations. Both `sapply()` and `apply()` are 2 of the functions in the larger apply family of functions. I find that these tend to cover most of the use cases I come across on a day-to-day basis but if you are interested in learning more, this post covers the entire apply family.

# Meet the Author

Neha Anwer

Neha is a data science professional with domain experience in the financial advisory field. She has been a volunteer at Statistics Without Borders for over a year and most recently joined the SWB Marketing & Communications team. In her spare time, Neha enjoys traveling, curling up with a fiction read, and spending time outdoors.

--

--

Statistics Without Borders (SWB) is an apolitical probono organization under the auspices of the American Statistical Association.