Advanced-Data Structures In R — 2

Vivekanandan Srinivasan
Analytics Vidhya
Published in
8 min readNov 26, 2019

If you have not read part 1 of the R basics series kindly go through the following article where we discussed Getting Started With R — 1. The series covers fundamentals of R that include data types, control structures, loops, functions, and advanced data structures.

If you are already familiar with these topics and looking for a comprehensive introduction to all important topics in statistics and machine learning using R. Kindly start off with the following series which discusses all necessary topics related to data science.

Many Ways of Reading Data Into R — 1

The contents in the article are gist from a couple of books that I got introduced during my IIM-B days.

R for Everyone — Jared P. Lander

Practical Data Science with R — Nina Zumel & John Mount

All the code blocks discussed in the article are present in the form of R markdown in the Github link.

To see all the articles written by me kindly use the link, Vivek Srinivasan.

Sometimes data require more complex storage than simple vectors and thankfully R provides a host of data structures. The most common are the data.frame, matrix and list, followed by the array. Of these, the data.frame will be most familiar to anyone who has used a spreadsheet, the matrix to people familiar with matrix math and the list to programmers.

data.frames

Perhaps one of the most useful features of R is the data.frame. It is one of the most often cited reasons for R’s ease of use.

On the surface a data.frame is just like an Excel spreadsheet in that it has columns and rows. In statistical terms, each column is a variable and each row is an observation. In terms of how R organizes data.frames, each column is actually a vector, each of which has the same length. That is very important because it lets each column hold different types of data. This also implies that within a column each element must be of the same type, just like with vectors.

There are numerous ways to construct a data.frame, the simplest is to use the data.frame function. Let’s create a basic data.frame using some of the vectors we have already introduced, namely x, y and q.

x <- 10:1
y <- -4:5
q <- c("Hockey", "Football", "Baseball", "Curling", "Rugby", "Lacrosse", "Basketball", "Tennis", "Cricket", "Soccer")
theDF <- data.frame(x, y, q)
theDF

This creates a 10x3 data.frame consisting of those three vectors. Notice the names of theDF are simply the variables. We could have assigned names during the creation process, which is generally a good idea.

theDF <- data.frame(First=x, Second=y, Sport=q)
theDF

data.frames are complex objects with many attributes. The most frequently checked attributes are the number of rows and columns. Of course, there are functions to do this for us: nrow and ncol. And in case both are wanted at the same time, there is the dim function.

nrow(theDF)
ncol(theDF)
dim(theDF)

Checking the column names of a data.frame is as simple as using the names function. This returns a character vector listing the columns. Since it is a vector we can access individual elements of it just like any other vector. We can also check and assign the row names of a data.frame using rownames function.

names(theDF)

Usually, a data.frame has far too many rows to print them all to the screen, so thankfully the head function prints out only the first few rows. Similar to head and to print last few rows use tail function in R.

head(theDF)

Since each column of the data.frame is an individual vector, it can be accessed individually and each has its own class. Like many other aspects of R, there are multiple ways to access an individual column. There is the $ operator and also the square brackets. Running theDF$Sport will give the third column in theDF. That allows us to specify one particular column by name.

theDF$Sport

Similar to vectors, data.frames allow us to access individual elements by their position using square brackets, but instead of having one position, two are specified. The first is the row number and the second is the column number. So to get the third row from the second column we use theDF[3, 2].

theDF[3, 2]

To specify more than one row or column, use a vector of indices. Following is the way to access rows 3 and 5, columns 2 through 3.

theDF[c(3, 5), 2:3]

To access an entire row, specify that row while not specifying any column. Likewise, to access an entire column, specify that column while not specifying any row.

theDF[2, ]

We see that factors are stored in a special way. To see how they would be represented in data.frame, form use model.matrix to create a set of indicator (or dummy) variables. That is one column for each level of a factor, with a 1 if a row contains that level or a 0 otherwise. It is equivalent to creating dummy variables for factor columns.

model.matrix(~ theDF$Sport - 1)

Lists

Often a container is needed to hold arbitrary objects of either the same type or varying types. R accomplishes this through lists. They store any number of items of any type. A list can contain all numerics or characters or a mix of the two or data.frames or, recursively, other lists.

Lists are created with the list function where each argument to the function becomes an element of the list.

list(1, 2, 3)
list(theDF, 1:10)

Like data.frames, lists can have names. Each element has a unique name that can be either viewed or assigned using names.

a <- list(TheDataFrame=theDF, TheVector=1:10)
a

To access an individual element of a list, use double square brackets, specifying either the element number or name. Note that this allows access to only one element at a time. Once an element is accessed it can be treated as if that actual element is being used, allowing nested indexing of elements.

a[["TheDataFrame"]] or a[[1]]

It is possible to append elements to a list simply by using an index (either numeric or named) that does not exist.

length(a)## adding new element to list
a[[3]] <- "4"
length(a)

Occasionally appending to a list — or vector or data.frame for that matter — is fine, but doing so repeatedly is computationally expensive. So it is best to create a list as long as its final desired size and then fill it in using the appropriate indices.

Matrices

A very common mathematical structure that is essential to statistics is a matrix. This is similar to a data.frame in that, it is rectangular with rows and columns except that every single element, regardless of column, must be the same type, most commonly all numerics. They also act similarly to vectors with element-by-element addition, multiplication, subtraction, division, and equality. The nrow, ncol and dim functions work just like they do for data.frames.

# create a 5x2 matrix
A <- matrix(1:10, nrow=5)
# create another 5x2 matrix
B <- matrix(21:30, nrow=5)
# create another 2x5 matrix
C <- matrix(21:40, nrow=2)

Matrix multiplication is a commonly used operation in mathematics, requiring the number of columns of the left-hand matrix to be the same as the number of rows of the right-hand matrix. Both A and B are 5X2 so we will transposeB so it can be used on the right-hand side.

A %*% t(B)

Another similarity with data.frames is that matrices can also have row and column names.

rownames(A) <- c("1st", "2nd", "3rd", "4th", "5th")
colnames(C) <- LETTERS[1:10]
A %*% C

Notice the effect when transposing a matrix and multiplying matrices. Transposing naturally flips the row and column names. Matrix multiplication keeps the row names from the left matrix and the column names from the right matrix.

Arrays

An array is essentially a multidimensional vector. It must all be of the same type, and individual elements are accessed in a similar fashion using square brackets. The first element is the row index, the second is the column index and the remaining elements are for outer dimensions.

## Creating 3D array
theArray <- array(1:12, dim=c(2, 3, 2))
theArray

The main difference between an array and a matrix is that matrices are restricted to two dimensions, while arrays can have an arbitrary number.

## Accessing different dimension of array
theArray[1, , ]
theArray[1, , 1]
theArray[, , 1]

Data come in many types and structures, which can pose a problem for some analysis environments, but R handles them with aplomb. The most common data structure is the one-dimensional vector, which forms the basis of everything in R.

The most powerful structure is the data.frame — something special in R that most other languages do not have — which handles mixed data types in a spreadsheet-like format. Lists are useful for storing collections of items, like a hash in Perl. In the next article, we will discuss how to create custom functions in R and nuances associated with it.

Writing R Functions — 3

Do share your thoughts and support by commenting and sharing the article among your peer groups.

--

--

Vivekanandan Srinivasan
Analytics Vidhya

An analytics professional with over six years of experience spanning across predictive modelling, statistical analysis and big data technologies.