# R Function Of The Week: apply() and sapply()

Author: Neha Anwer, Statistics Without Borders

## This Article Contains

- A brief overview of the
`apply()`

concept and general use case `apply()`

vs`sapply()`

with examples

This week at Statistics Without Borders, we go over the `apply()`

concept and discuss when it is most helpful to use. While programming, you may come across situations that require you to iterate over a set of values and *apply *some kind of function or calculation to each value. Traditionally, this is accomplished using a loop construct.

## Overview of `apply()`

The `apply()`

function can be used in lieu of loops and help speed up your code. `apply()`

can take matrix-like structures as input and supply output results in the form of a vector, array, or list. Because it has the built-in capability to accept multi-faceted inputs and produce similar output structures, using this function eliminates the need to write lengthy loops.

In the example code below, we will perform some simple operations on a built-in R dataset called `Orange`

. This dataset records growth of orange trees.

- First, we will load the data set into our environment using the
`data`

command.

`## Load in data`

data('Orange')

head(Orange)

**Output**

`# Tree age circumference`

# 1 1 118 30

# 2 1 484 58

# 3 1 664 87

# 4 1 1004 115

# 5 1 1231 120

# 6 1 1372 142

- Next, we will engineer a new feature. This new feature will calculate each tree’s
`circumference`

value divided by the largest`circumference`

value in the data. - We will first perform this calculation using a
`for`

loop and then compute the same values using`apply()`

.

Let’s write a `for`

loop to express each circumference value as a percentage of the largest value in the circumference column

max_circ <- max(Orange$circumference) # maximum circumference

pct_circ <- list() ## create an empty list to store valuesfor(i in 1:nrow(Orange)) { # loop over each row

# Divide each value by the max and multiply by 100

pct_circ[i] <- (Orange$circumference[i] / max_circ) * 100

} # End for looppct_circ

As you can see, we were successfully able to iterate over each row and compute percentage of total. However, we can do this in one line using the `apply()`

function. The general syntax of the function is as follows:

`apply(data, Margin, Function)`

The first argument expects an array or matrix. The second argument can be `1`

, `2`

, or a vector of indices. This argument tells the function whether the operation should be applied across rows (`1`

), down a column (`2`

), or to a specific cell. The last argument expects the name of a function that is going to applied. In our case, since there is no built-in function that can compute a given number as a percentage of another number, we will write a built in function:

`pct_func <- function(x) {`

result <- x / max_circ # x is a vector/matrix

result <- result * 100 # convert from decimal to %

return(result)

}

Now that we have our function ready, we just need to specify the correct inputs to the `apply()`

function. Remember that it expects data to be a dataframe or a matrix. We will supply the circumference column (column 3), as our input matrix. When indexing column 3 of the `Orange`

dataframe, we will have to specify an additional argument `drop=F`

to ensure our matrix maintains its dimensions.

`apply(Orange[,3,drop=F], 1, pct_func)`

**Output**

`# output from console`

# 1 2 3 4 5 6 7 8 9 10

# 14.01869 27.10280 40.65421 53.73832 56.07477 66.35514 67.75701 15.42056 32.24299 51.86916

# 11 12 13 14 15 16 17 18 19 20

# 72.89720 80.37383 94.85981 94.85981 14.01869 23.83178 35.04673 50.46729 53.73832 64.95327

# 21 22 23 24 25 26 27 28 29 30

# 65.42056 14.95327 28.97196 52.33645 78.03738 83.64486 97.66355 100.00000 14.01869 22.89720

# 31 32 33 34 35

# 37.85047 58.41121 66.35514 81.30841 82.71028

As you can see, we were able to achieve the same results in one line. Using the `apply()`

method, we can quickly *apply* *any function* to a set of data. While in this case we wrote our own user defined function, combining an anonymous function (commonly referred to as a “lambda expression”) with `apply()`

allows users to write their own functions within the `apply()`

function. While we will not be discussing lambda expressions within this article, this blog post does a good job of walking through how they work in both R and Python.

## apply() vs. sapply()

Building off of the `apply()`

function, `sapply()`

takes in more flexible input types and is ideal for vector operations. `sapply()`

takes in a list, vector, or DataFrame as an input and returns a matrix or vector of the same length as an output. The general syntax of the `sapply()`

function is:

`sapply(data, function)`

Notice that there is no `margin`

argument. This is because the `sapply()`

function will apply the function to each *element* of the data by default. Due to this default behavior, the `margin`

argument is unnecessary.

Following the same example as above, let’s compute circumference as a percentage of the largest circumference in our data set using `sapply()`

. Note that since the input does not *have *to be a matrix with usable dimensions, we can just supply the “circumference” column by name and will get a list object of the same length as an output.

sapply(Orange$circumference, pct_func)#Output:

# [1] 14.01869 27.10280 40.65421 53.73832 56.07477 66.35514 67.75701 15.42056 32.24299

# [10] 51.86916 72.89720 80.37383 94.85981 94.85981 14.01869 23.83178 35.04673 50.46729

# [19] 53.73832 64.95327 65.42056 14.95327 28.97196 52.33645 78.03738 83.64486 97.66355

# [28] 100.00000 14.01869 22.89720 37.85047 58.41121 66.35514 81.30841 82.71028

If we wanted to, we could apply the `pct_func`

to the `age`

column in our dataset as well, we could like so:

sapply(Orange[,c('circumference', 'age')], pct_func)#Output:

# circumference age

# [1,] 14.01869 55.14019

# [2,] 27.10280 226.16822

# [3,] 40.65421 310.28037

# [4,] 53.73832 469.15888

# [5,] 56.07477 575.23364

# [6,] 66.35514 641.12150

# [7,] 67.75701 739.25234

# [8,] 15.42056 55.14019

# [9,] 32.24299 226.16822

# [10,] 51.86916 310.28037

# [11,] 72.89720 469.15888

# [12,] 80.37383 575.23364

# [13,] 94.85981 641.12150

# [14,] 94.85981 739.25234

# [15,] 14.01869 55.14019

# [16,] 23.83178 226.16822

# [17,] 35.04673 310.28037

# [18,] 50.46729 469.15888

# [19,] 53.73832 575.23364

# [20,] 64.95327 641.12150

# [21,] 65.42056 739.25234

# [22,] 14.95327 55.14019

# [23,] 28.97196 226.16822

# [24,] 52.33645 310.28037

# [25,] 78.03738 469.15888

# [26,] 83.64486 575.23364

# [27,] 97.66355 641.12150

# [28,] 100.00000 739.25234

# [29,] 14.01869 55.14019

# [30,] 22.89720 226.16822

# [31,] 37.85047 310.28037

# [32,] 58.41121 469.15888

# [33,] 66.35514 575.23364

# [34,] 81.30841 641.12150

# [35,] 82.71028 739.25234

As a result, we get a DataFrame of the same length as our input DataFrame. Super convenient! I tend to use the function on the daily to perform data manipulations. Both `sapply()`

and `apply()`

are 2 of the functions in the larger apply family of functions. I find that these tend to cover most of the use cases I come across on a day-to-day basis but if you are interested in learning more, this post covers the entire apply family.

**Want to learn more about Statistics Without Borders?**Follow us on Twitter and LinkedIn, and check out our website.**Want to volunteer on projects or contribute to this blog?**Send us an email at statisticswithoutborders@gmail.com.

# Meet the Author

Neha is a data science professional with domain experience in the financial advisory field. She has been a volunteer at Statistics Without Borders for over a year and most recently joined the SWB Marketing & Communications team. In her spare time, Neha enjoys traveling, curling up with a fiction read, and spending time outdoors.