Advanced-Data Structures In R — 2
If you have not read part 1 of the R basics series kindly go through the following article where we discussed Getting Started With R — 1. The series covers fundamentals of R that include data types, control structures, loops, functions, and advanced data structures.
If you are already familiar with these topics and looking for a comprehensive introduction to all important topics in statistics and machine learning using R. Kindly start off with the following series which discusses all necessary topics related to data science.
Many Ways of Reading Data Into R — 1
The contents in the article are gist from a couple of books that I got introduced during my IIM-B days.
R for Everyone — Jared P. Lander
Practical Data Science with R — Nina Zumel & John Mount
All the code blocks discussed in the article are present in the form of R markdown in the Github link.
To see all the articles written by me kindly use the link, Vivek Srinivasan.
Sometimes data require more complex storage than simple vectors and thankfully R provides a host of data structures. The most common are the data.frame
, matrix
and list
, followed by the array
. Of these, the data.frame
will be most familiar to anyone who has used a spreadsheet, the matrix
to people familiar with matrix math and the list
to programmers.
data.frames
Perhaps one of the most useful features of R is the data.frame
. It is one of the most often cited reasons for R’s ease of use.
On the surface a data.frame
is just like an Excel spreadsheet in that it has columns and rows. In statistical terms, each column is a variable and each row is an observation. In terms of how R organizes data.frames
, each column is actually a vector
, each of which has the same length. That is very important because it lets each column hold different types of data. This also implies that within a column each element must be of the same type, just like with vectors.
There are numerous ways to construct a data.frame
, the simplest is to use the data.frame
function. Let’s create a basic data.frame
using some of the vectors
we have already introduced, namely x
, y
and q
.
x <- 10:1
y <- -4:5
q <- c("Hockey", "Football", "Baseball", "Curling", "Rugby", "Lacrosse", "Basketball", "Tennis", "Cricket", "Soccer")theDF <- data.frame(x, y, q)
theDF
This creates a 10x3 data.frame
consisting of those three vectors. Notice the names of theDF
are simply the variables. We could have assigned names during the creation process, which is generally a good idea.
theDF <- data.frame(First=x, Second=y, Sport=q)
theDF
data.frames
are complex objects with many attributes. The most frequently checked attributes are the number of rows and columns. Of course, there are functions to do this for us: nrow
and ncol
. And in case both are wanted at the same time, there is the dim
function.
nrow(theDF)
ncol(theDF)
dim(theDF)
Checking the column names of a data.frame
is as simple as using the names
function. This returns a character vector
listing the columns. Since it is a vector we can access individual elements of it just like any other vector
. We can also check and assign the row names of a data.frame
using rownames
function.
names(theDF)
Usually, a data.frame
has far too many rows to print them all to the screen, so thankfully the head
function prints out only the first few rows. Similar to head
and to print last few rows use tail
function in R.
head(theDF)
Since each column of the data.frame is an individual vector
, it can be accessed individually and each has its own class. Like many other aspects of R
, there are multiple ways to access an individual column. There is the $
operator and also the square brackets. Running theDF$Sport
will give the third column in theDF
. That allows us to specify one particular column by name.
theDF$Sport
Similar to vectors, data.frames
allow us to access individual elements by their position using square brackets, but instead of having one position, two are specified. The first is the row number and the second is the column number. So to get the third row from the second column we use theDF[3, 2]
.
theDF[3, 2]
To specify more than one row
or column
, use a vector
of indices. Following is the way to access rows 3
and 5
, columns 2
through 3
.
theDF[c(3, 5), 2:3]
To access an entire row
, specify that row
while not specifying any column
. Likewise, to access an entire column
, specify that column
while not specifying any row
.
theDF[2, ]
We see that factors are stored in a special way. To see how they would be represented in data.frame
, form use model.matrix
to create a set of indicator (or dummy) variables. That is one column for each level
of a factor, with a 1
if a row contains that level or a 0
otherwise. It is equivalent to creating dummy
variables for factor columns.
model.matrix(~ theDF$Sport - 1)
Lists
Often a container is needed to hold arbitrary objects of either the same type or varying types. R accomplishes this through lists
. They store any number of items of any type. A list can contain all numerics or characters or a mix of the two or data.frames
or, recursively, other lists
.
Lists are created with the list
function where each argument to the function becomes an element of the list.
list(1, 2, 3)
list(theDF, 1:10)
Like data.frames
, lists
can have names. Each element has a unique name that can be either viewed or assigned using names
.
a <- list(TheDataFrame=theDF, TheVector=1:10)
a
To access an individual element of a list
, use double square brackets, specifying either the element number or name. Note that this allows access to only one element at a time. Once an element is accessed it can be treated as if that actual element is being used, allowing nested indexing of elements.
a[["TheDataFrame"]] or a[[1]]
It is possible to append elements to a list simply by using an index (either numeric or named) that does not exist.
length(a)## adding new element to list
a[[3]] <- "4"length(a)
Occasionally appending to a list — or vector or data.frame for that matter — is fine, but doing so repeatedly is computationally expensive. So it is best to create a list
as long as its final desired size and then fill it in using the appropriate indices.
Matrices
A very common mathematical structure that is essential to statistics is a matrix
. This is similar to a data.frame
in that, it is rectangular with rows and columns except that every single element, regardless of column
, must be the same type
, most commonly all numerics
. They also act similarly to vectors with element-by-element addition, multiplication, subtraction, division, and equality. The nrow
, ncol
and dim
functions work just like they do for data.frames
.
# create a 5x2 matrix
A <- matrix(1:10, nrow=5)# create another 5x2 matrix
B <- matrix(21:30, nrow=5)# create another 2x5 matrix
C <- matrix(21:40, nrow=2)
Matrix
multiplication is a commonly used operation in mathematics, requiring the number of columns of the left-hand matrix to be the same as the number of rows of the right-hand matrix. Both A
and B
are 5X2
so we will transposeB
so it can be used on the right-hand side.
A %*% t(B)
Another similarity with data.frames is that matrices can also have row and column names.
rownames(A) <- c("1st", "2nd", "3rd", "4th", "5th")
colnames(C) <- LETTERS[1:10]A %*% C
Notice the effect when transposing
a matrix and multiplying
matrices. Transposing naturally flips the row and column names. Matrix multiplication keeps the row names from the left matrix and the column names from the right matrix.
Arrays
An array
is essentially a multidimensional vector. It must all be of the same type
, and individual elements are accessed in a similar fashion using square brackets. The first element is the row
index, the second is the column
index and the remaining elements are for outer dimensions.
## Creating 3D array
theArray <- array(1:12, dim=c(2, 3, 2))
theArray
The main difference between an array
and a matrix
is that matrices are restricted to two dimensions, while arrays can have an arbitrary number.
## Accessing different dimension of array
theArray[1, , ]
theArray[1, , 1]
theArray[, , 1]
Data come in many types and structures, which can pose a problem for some analysis environments, but R
handles them with aplomb. The most common data structure is the one-dimensional vector
, which forms the basis of everything in R
.
The most powerful structure is the data.frame
— something special in R
that most other languages do not have — which handles mixed data types in a spreadsheet-like format. Lists
are useful for storing collections of items, like a hash
in Perl
. In the next article, we will discuss how to create custom functions in R and nuances associated with it.
Do share your thoughts and support by commenting and sharing the article among your peer groups.