Group Manipulation In R — 3

Vivekanandan Srinivasan
Analytics Vidhya
Published in
6 min readOct 3, 2019

If you have not read the part 2 of R data analysis series kindly go through the following article where we discussed about Statistical Visualization In R — 2.

The contents in the article are gist from couple of books which I got introduced during my IIM-B days.

R for Everyone — Jared P. Lander

Practical Data Science with R — Nina Zumel & John Mount

All the code blocks discussed in the article are present in form of R markdown in the Github link.

A general rule of thumb for data analysis is that manipulating the data or data munging consumes 80 % of the effort. This often requires repeated operations on different sections of the data — split-apply-combine. That is, we split the data into discrete sections based on some metric, apply a transformation of some kind to each section, and then combine all the sections together. There are many ways to iterate over data in R, and we will see some of the most convenient methods of doing it.

Apply Family

R has built-in apply function and all of its relatives such as tapply, lapply, sapply and mapply. Let’s see how each function has its own usage while manipulating the data.

apply

apply is the first member of this family that users usually learn and it is also the most restrictive in nature. It must be used on the matrix, meaning all of the elements must be of the same type whether they are character, numeric or logical. If used on some other object, such as data.frame, it will be converted to a matrix first.

The first argument to apply is the object we are working with. The second argument is the margin to apply the function over, with 1 meaning to operate over the rows and 2 meaning operating over the columns. The third argument is the function we want to apply. Any following argument will be passed on to the function.

To illustrate its use we start with a trivial example, summing the rows or columns of a matrix. Notice that this could alternatively be accomplished using the built-in rowSums and colSums, yielding the same results.

theMatrix <- matrix(1:9, nrow=3)
apply(theMatrix,1,sum) ## Row Sum
apply(theMatrix,2,sum) ## Column Sum

Similar to most of the R functions where we have an argument na.rmto handle missing values NA in the matrix or any other data type. Let’s add some NA to the theMatrix.

theMatrix[2,1] <- NA
apply(theMatrix,1,sum)

By adding na.rm argument to the apply function, it will ignore the missing values and calculate the sum over rows and columns.

apply(theMatrix,1,sum,na.rm=TRUE)

lapply and sapply

lapply works similar to apply but it applies the function to each element of the list and returning the results as list as well.

theList <- list(A=matrix(1:9,3), B=1:5,C=matrix(1:4,2), D=2)
lapply(theList,sum)

Dealing with lists feels a bit cumbersome sometimes, so to return the result as vector instead, sapply can be put into use in the same way as lapply. And a vector is technically a form of list, so lapply and sapply can also take vector as their input

sapply(theList,sum)## Counting no of characters in each word
theNames <- c("Jared","Deb","Paul")
sapply(theNames,nchar)

mapply

Perhaps the most overlooked but so useful member of the apply family is mapply , which applies a function to each element of multiple lists. Often when confronted with this scenario, people will resort to using a loop, which is certainly not necessary. Let’s build two lists to understand the usage of the mapply with an example. We use built-in identical function in R to see whether two lists are identical by comparing element-to-element.

## build two lists
firstList <- list(A=matrix(1:16,4),B=matrix(1:16,2),c(1:5))
secondList <- list(A=matrix(1:16,4),B=matrix(1:16,8),c(15:1))
## test element by element if they are identical
mapply(identical,firstList,secondList)

mapply can also take user-defined function in place of built-in function in R. Let’s build a simple function that adds the number of rows of each corresponding element in a lists.

simpleFunc <- function(x,y) {
NROW(x) + NROW(y)
}
mapply(simpleFunc,firstList,secondList)

There are many other members of the apply family that either do not get used much or have been superseded by functions in the plyr family. They include

  • tapply
  • rapply
  • eapply
  • vapply
  • by

aggregate

People who got used to SQL terminology generally wants to run a group by and aggregation as their first R task. The way to do this is to use the aptly named aggregate function. We have multiple ways to call, aggregate and we will see the most convenient ways of calling it using formula notation.

formulas consist of a left side and right side separated by a tilde (~). The usage of formula methodology is similar to how we created graphics using ggplot2 in our previous article. The left side represents the variable that we want to make a calculation on and the right side represents one or more variables that we want to group the calculation by. To demonstrate the usage of aggregate we once resort to diamonds data in ggplot2.

require(ggplot2)
data(diamonds)
head(diamonds)

As a first example, we will calculate the average price for each type of cut in the diamonds data. The first argument aggregate is the formula specifying that the price should be broken by cut. The second argument is the data to use, in this case, diamonds. The third argument is the function to apply to each subset of the data.

aggregate(price~cut, diamonds,mean)

Notice that we only specified the column name and did not have to identify the data because that is given in the second argument. After the third argument specifying the function, additional named arguments to that function can be passed as follows.

aggregate(price~cut, diamonds,mean,na.rm=TRUE)

To group data by more than one variable, add the additional variable to the right side of the formula separating it with a plus sign(+).

aggregate(price~cut + color, diamonds,mean)

To aggregate two variables, they must be combined using cbind on the left side of the formula.

aggregate(cbind(price,carat)~ cut + color,diamonds,mean)

It is important to note from the above example only one function can be supplied, and hence applied to the variables. To apply more than one function, it easier to use the dplyr or data.table packages which extend and enhances the capability of data.frames.

Aggregating data is a very important step in the analysis process. Sometimes it is the end goal and other times it is the preparation for applying more advanced methods. In this article we have seen common methodologies to perform group manipulation in R. In our next article we will compare and contrast advanced group manipulation techniques using two versatile packages dplyr and data.table.

Advanced-Data Wrangling in R — 4

Do share your thoughts and support by commenting and sharing the article among your peer groups.

--

--

Vivekanandan Srinivasan
Analytics Vidhya

An analytics professional with over six years of experience spanning across predictive modelling, statistical analysis and big data technologies.