R Function Of The Week: apply() and sapply()

Statistics Without Borders
6 min readApr 13, 2022

--

Author: Neha Anwer, Statistics Without Borders

This Article Contains

  • A brief overview of the apply() concept and general use case
  • apply() vs sapply() with examples

This week at Statistics Without Borders, we go over the apply() concept and discuss when it is most helpful to use. While programming, you may come across situations that require you to iterate over a set of values and apply some kind of function or calculation to each value. Traditionally, this is accomplished using a loop construct.

Overview of apply()

The apply() function can be used in lieu of loops and help speed up your code. apply() can take matrix-like structures as input and supply output results in the form of a vector, array, or list. Because it has the built-in capability to accept multi-faceted inputs and produce similar output structures, using this function eliminates the need to write lengthy loops.

In the example code below, we will perform some simple operations on a built-in R dataset called Orange. This dataset records growth of orange trees.

  • First, we will load the data set into our environment using the data command.
## Load in data
data('Orange')
head(Orange)

Output

#   Tree  age circumference
# 1 1 118 30
# 2 1 484 58
# 3 1 664 87
# 4 1 1004 115
# 5 1 1231 120
# 6 1 1372 142
  • Next, we will engineer a new feature. This new feature will calculate each tree’s circumference value divided by the largest circumference value in the data.
  • We will first perform this calculation using a for loop and then compute the same values using apply().

Let’s write a for loop to express each circumference value as a percentage of the largest value in the circumference column

max_circ <- max(Orange$circumference) # maximum circumference
pct_circ <- list() ## create an empty list to store values
for(i in 1:nrow(Orange)) { # loop over each row

# Divide each value by the max and multiply by 100
pct_circ[i] <- (Orange$circumference[i] / max_circ) * 100

} # End for loop
pct_circ
Screenshot of for loop results in R console
Result

As you can see, we were successfully able to iterate over each row and compute percentage of total. However, we can do this in one line using the apply() function. The general syntax of the function is as follows:

apply(data, Margin, Function)

The first argument expects an array or matrix. The second argument can be 1, 2, or a vector of indices. This argument tells the function whether the operation should be applied across rows (1), down a column (2), or to a specific cell. The last argument expects the name of a function that is going to applied. In our case, since there is no built-in function that can compute a given number as a percentage of another number, we will write a built in function:

pct_func <- function(x) {
result <- x / max_circ # x is a vector/matrix
result <- result * 100 # convert from decimal to %
return(result)
}

Now that we have our function ready, we just need to specify the correct inputs to the apply() function. Remember that it expects data to be a dataframe or a matrix. We will supply the circumference column (column 3), as our input matrix. When indexing column 3 of the Orange dataframe, we will have to specify an additional argument drop=F to ensure our matrix maintains its dimensions.

apply(Orange[,3,drop=F], 1, pct_func)

Output

# output from console
# 1 2 3 4 5 6 7 8 9 10
# 14.01869 27.10280 40.65421 53.73832 56.07477 66.35514 67.75701 15.42056 32.24299 51.86916
# 11 12 13 14 15 16 17 18 19 20
# 72.89720 80.37383 94.85981 94.85981 14.01869 23.83178 35.04673 50.46729 53.73832 64.95327
# 21 22 23 24 25 26 27 28 29 30
# 65.42056 14.95327 28.97196 52.33645 78.03738 83.64486 97.66355 100.00000 14.01869 22.89720
# 31 32 33 34 35
# 37.85047 58.41121 66.35514 81.30841 82.71028

As you can see, we were able to achieve the same results in one line. Using the apply() method, we can quickly apply any function to a set of data. While in this case we wrote our own user defined function, combining an anonymous function (commonly referred to as a “lambda expression”) with apply() allows users to write their own functions within the apply() function. While we will not be discussing lambda expressions within this article, this blog post does a good job of walking through how they work in both R and Python.

apply() vs. sapply()

Building off of the apply() function, sapply()takes in more flexible input types and is ideal for vector operations. sapply() takes in a list, vector, or DataFrame as an input and returns a matrix or vector of the same length as an output. The general syntax of the sapply() function is:

sapply(data, function)

Notice that there is no margin argument. This is because the sapply() function will apply the function to each element of the data by default. Due to this default behavior, the margin argument is unnecessary.

Following the same example as above, let’s compute circumference as a percentage of the largest circumference in our data set using sapply() . Note that since the input does not have to be a matrix with usable dimensions, we can just supply the “circumference” column by name and will get a list object of the same length as an output.

sapply(Orange$circumference, pct_func)#Output:
# [1] 14.01869 27.10280 40.65421 53.73832 56.07477 66.35514 67.75701 15.42056 32.24299
# [10] 51.86916 72.89720 80.37383 94.85981 94.85981 14.01869 23.83178 35.04673 50.46729
# [19] 53.73832 64.95327 65.42056 14.95327 28.97196 52.33645 78.03738 83.64486 97.66355
# [28] 100.00000 14.01869 22.89720 37.85047 58.41121 66.35514 81.30841 82.71028

If we wanted to, we could apply the pct_func to the age column in our dataset as well, we could like so:

sapply(Orange[,c('circumference', 'age')], pct_func)#Output: 
# circumference age
# [1,] 14.01869 55.14019
# [2,] 27.10280 226.16822
# [3,] 40.65421 310.28037
# [4,] 53.73832 469.15888
# [5,] 56.07477 575.23364
# [6,] 66.35514 641.12150
# [7,] 67.75701 739.25234
# [8,] 15.42056 55.14019
# [9,] 32.24299 226.16822
# [10,] 51.86916 310.28037
# [11,] 72.89720 469.15888
# [12,] 80.37383 575.23364
# [13,] 94.85981 641.12150
# [14,] 94.85981 739.25234
# [15,] 14.01869 55.14019
# [16,] 23.83178 226.16822
# [17,] 35.04673 310.28037
# [18,] 50.46729 469.15888
# [19,] 53.73832 575.23364
# [20,] 64.95327 641.12150
# [21,] 65.42056 739.25234
# [22,] 14.95327 55.14019
# [23,] 28.97196 226.16822
# [24,] 52.33645 310.28037
# [25,] 78.03738 469.15888
# [26,] 83.64486 575.23364
# [27,] 97.66355 641.12150
# [28,] 100.00000 739.25234
# [29,] 14.01869 55.14019
# [30,] 22.89720 226.16822
# [31,] 37.85047 310.28037
# [32,] 58.41121 469.15888
# [33,] 66.35514 575.23364
# [34,] 81.30841 641.12150
# [35,] 82.71028 739.25234

As a result, we get a DataFrame of the same length as our input DataFrame. Super convenient! I tend to use the function on the daily to perform data manipulations. Both sapply() and apply() are 2 of the functions in the larger apply family of functions. I find that these tend to cover most of the use cases I come across on a day-to-day basis but if you are interested in learning more, this post covers the entire apply family.

SWB Logo

Meet the Author

Headshot of Author

Neha Anwer

Neha is a data science professional with domain experience in the financial advisory field. She has been a volunteer at Statistics Without Borders for over a year and most recently joined the SWB Marketing & Communications team. In her spare time, Neha enjoys traveling, curling up with a fiction read, and spending time outdoors.

--

--

Statistics Without Borders

Statistics Without Borders (SWB) is an apolitical probono organization under the auspices of the American Statistical Association.