Data Structure in R programming

Sharon Chetia
Analytics Vidhya
Published in
12 min readJun 9, 2020

Bad programmers worry about the code. Good programmers worry about data structures and their relationships. — Linus Torvalds

Starting something can be as daunting as continuously nurturing the will and drive to keep it going and seeing it through till the very end. Learning a new programming language is quite similar.

When starting to learn a new programming language, one of the first things we need to do is understand its Data Structure. Why? Because when you know about Data structure, you know the best ways in which you can store, organize,manage ,access and modify your data, saving you time and memory space. This series is all about the Data Structures in R programming language.

Before we start with the Basic R data structures, we need to know about two concepts in Data Structure:

  1. Homogeneity: Tells us whether the data structure is ‘homogeneous’ i.e. it contains only similar types of data for instance all numeric, all string etc. , or a combination of multiple data types i.e. ‘heterogeneous’.
  2. Dimension: Tells us in what fashion, order data will be stored, whether it is linear or 1 D, tabular or 2 D etc.
  1. Vector: It is a linear or One-Dimensional collection of data of the same data type. The data is indexed i.e. every element in a vector is numbered and in R the count starts from ‘1’. Even a single element will be stored as a vector of length 1.

What happens when you try to store elements of different data types in the same vector?

It results is Implicit conversion. What this means is that R will convert the elements to the datatype that can be representative of all the elements in the vector. Example: In the vector below the element in index 2 is of data type numeric unlike the rest which are all of character data type. When R reads this, it will implicitly convert it into a character datatype and align it with the rest of the elements in the vector.

Ways to create a vector in R:

a) Using Combine and assign functions

Syntax for assign() — assign(“variable_name”, c(element 1, element 2,…..)

Syntax for c() — variable_name<-c(element 1,element 2,…….)

Note that we can also use the ‘=’ sign though it is not recommended

e.g. x=c(1,2,3)

b) Using the Sequence function and the colon ‘:’ operator

In case we need to generate a vector containing a sequence of numbers, we can do so in either of the below two ways:

Syntax for seq() — seq(first_number,last_number,by=number_of_steps)

Syntax for ‘:’— first_number:last_number

An advantage of seq() above ‘:’ is that we can also specify by how many steps we want to move in our sequence. Below is an example:

c) Using the Replicate Function

We can use the rep() function when we want to create a vector containing the same element multiple times or when we want to replicate an existing vector multiple times.

Syntax — rep(x,times) where x is a vector and times is the number of times we want to replicate x.

Examples below: Note the use of ‘each’ in the example on the right and the difference in the outputs :

d) Using the vector() function

The vector() function creates a vector of mode and length as specified by the programmer. It initializes the vector to default values for each mode.

Syntax for vector() — vector(mode=”x ”, length=)

Below code shows the output for some of the basic modes:

e) Using the as.vector() function

We can use the as.vector() function to explicitly convert its argument into a vector of a specified mode. This is also known as ‘Typecasting’.

Syntax — as.vector(x, mode=”mode_name” )

where x is an R object.

We will get a little ahead of our self in this next example, since we are yet to discuss about matrices. But to understand how as.vector works, all we need to understand here is that we are considering that we already have an R object called my_matrix which has a Matrix Data Structure and we want to convert it into a vector. It can be done as shown below:

Note that we can use the is.vector() function to confirm that typecasting as actually converted the matrix into a vector.

Accessing elements in a vector-Using [] and dimension names

We use the square brackets [] to access elements of a vector by referencing their respective index numbers or dimension names. Example below:

Note that instead of indices, we can also access the elements by using the dimension names as reference i.e. [dim_name] instead of [index_number]. This will be discussed shortly while we learn how to name vector dimensions.

Naming vector Indices - Using names() function

As already discussed earlier, vectors are one dimensional and their elements are indexed with numbers starting from 1, but we can also choose to name a vector index instead of using the index numbers to reference them.

Syntax — names(vector_name)<-c(“dim_name1”,”dim_name2",….)

Example:

Deleting index names is quite simple.

Syntax — names(vector_name)<-NULL

For example: names(student)<-NULL

2. Matrix: It is a tabular or Two-Dimensional collection of data of the same type(Homogeneous). It has rows and columns. Note that like in the case of vectors, if we try to build a matrix containing elements of heterogeneous data types, R will implicitly convert the elements to the simplest data type that can represent all the information.

Creating a Matrix in R:

a) Using the function matrix()

Syntax — matrix(data=, nrow=, ncol=, byrow= )

Here, data is a data vector, nrow is the number of rows and ncol is the number of columns. Also byrow=TRUE will fill data row wise and byrow=FALSE will fill data column-wise.

Example : Below is a matrix containing 3 rows and 4 columns of character data type. And the data has been filled byrow.

b) Using the function rbind()

The rbind() function takes in a sequence of vectors, dataframes and matrices too and combines them by rows.

Syntax — rbind(x1,x2…) where x1,x2 … can be a vectors,matrices or dataframes.

Example 1: Considering the same matrix A, we can use the below code to rearrange the elements row-wise.

Example 2: Combining vectors to form a Matrix

c) Using the function cbind()

The cbind() function takes in a sequence of vectors, dataframes and matrices too and combines them column-wise.

Syntax — rbind(x1,x2…) where x1,x2 … can be vectors,matrices or dataframes.

Example: Below we use cbind to combine 3 vectors x,y,z into a single matrix A

Naming Matrix dimensions

Below are the various ways in which we can name Matrix dimensions

a) We can name the dimensions while creating the matrix. Example :Note the parts highlighted in yellow, they are the dimension names.

However note that in this case we could add only one dimension name(i.e. only column dimension name if we are doing a cbind and row dimension name if we are doing rbind).

b) Using the colnames() and rownames() functions

Say we have the below matrix called ‘my_matrix’ and we want to name both its dimension i.e. both row and column names.

my_matrix

Below is how we do it:

c) Using the matrix() function along with its argument ‘dimnames’. Note that dimnames only takes ‘lists’ as values. (Lists has been discussed later in this section)

Syntax — matrix(data=, nrow=, ncol=, byrow=, dimnames=list(x,y))

where x is the vector containing row names and y is the vector containing the column names.

Accessing elements in a Matrix

There are various ways in which we can access a Matrix by either using indices or dimension names as reference. The image below illustrates 4 ways in which we can access the element “000” highlighted in yellow from the matrix ‘A’.

3. Arrays: Arrays like Matrices contain elements of same data type only .But unlike matrices, arrays can have more than 2 dimensions i.e. Arrays are homogeneous Multi-dimensional Data structures. A matrix is in fact a 2 Dimensional Array .

Creating an Array in R:

We use the array() function to create arrays.

Syntax — array(data= , dim=c(no_row, no_column, numberOf_dimensions))

for instance: w<-array(c(1:18),dim=c(3,3,2))

Example: Say we collected the pass and fail percentages of 3 colleges in a city for the last 4 years. The record for each college can be stored in separate Matrices of 2 rows and 3 columns and then we can combine the data for all the 4 colleges in an array of 4 dimensions .

Array structure can be imagined as card stacked one above the other.

Below is the R code for the same:

Accessing elements in an Array

Syntax — array_name[row_name,column_name,dimension_number]

We can think of the dimension number as the card numbers. Say we want to know the Fail % of year 3 for college 4, our code will be as shown below:

my_array[2,3,4] #Output will be 35

4. List:

Lists are 1 Dimensional Heterogeneous data structures. It can contain all and any of the datatypes-numeric,logical,character,complex. It can contain vectors, matrices, dataframes , arrays and even other lists inside it. The data will be stored in a linear fashion i.e. 1 D.

Creating a List in R:

By using the the list() function

Syntax — list(data) where data is an R object(s)

Example:

Naming List Indices - Using names() function

Similar to vectors, indices in lists too can be named using the function names()

Example: For the same list that was created earlier, we can name it as below:

names(my_list)<-names(my_list)<-c(“ABC”,”Numbers”)

my_list

Accessing list elements- Using Index numbers and index names

Like vectors, we can access the lists elements too using the [] symbol. We just need to be little careful in selecting the index numbers. Let us look at two examples below.

a) Using index numbers to access elements

Let us consider the same list ‘my_list’ that we created earlier. It has two elements, a vector ‘x’ and a matrix ‘y’.

Take a look at the indexing in the output and try and understand it.

[[1]] here means it is the first element (which is a vector) inside the list.

[[2]] here means it is the second element(which is a matrix) inside the list.

Now look at the red arrows in the picture below. We can see that [1] here represents the first element “a” inside the vector. Similarly “b” and “c” have indices [2] and [3] respectively. The way it appears in R is that we see only the first index in every new line as shown by the arrow in blue.

So now if we want to access say “b” from the list we have to follow the indexing path:

my_list first element inside the lists i.e. the vector x second element inside the vector x i.e. “b”

Which can be written as:

Similarly say we want to access the number 8 from the matrix

my_listsecond element inside the list i.e. the matrix y — element in the cell[2,4] inside the matrix y i.e. ‘8’

Which can be written as:

b) Accessing named elements- Using the $ sign

Now say our list indices were named . The vector inside the list is named “ABC” and the matrix has been named as “Numbers”. And we want to access the element “b” from the vector and the element 8 from the matrix inside the list. We use the $ sign here, which is used to access named elements in R.

Below is how we do it:

5. Dataframe:

Dataframes are 2 Dimensional Heterogeneous data structures. You can imagine it as something similar to what we see in MS-Excel. It has rows and columns and every column will have elements that are of same data type. So we can think of columns as vectors(1 D and Homogeneous). Rows can have elements of different data types. Refer to the image below.

Creating a Dataframe in R- Using the data.frame() function

Syntax: data.frame(vector1,vector2…)

Where vector 1, vector2.. are all of the same lengths.

Example:

Naming Dataframe Dimensions— Using names() function

a) We can name the dimensions while creating the dataframes like below:

data.frame(Alphabets=LETTERS[1:4],Numbers=c(1,2,3,4))

b) Using colnames() and rownames() functions

Accessing list elements- Using Index numbers and names

We can access Dataframe in the same way we access Matrix elements, using the Row and column number for a particular element. Say we wanted to access Population of Bangalore which is 8436675 from my_df , we can do it in two ways as shown below:

6. Factor:

When working with categorical data, we need to define them properly for better data modelling. We can do so by storing them as factors, which in other words, is storing them as different ‘levels’. Factors can store both strings and integers. Some examples of categorical data are: The different human blood group types, types of credit transactions like via ATM card, cash, cheque, internet banking etc.

Consider the character vector ‘tt’:

tt<-c(“ATMcard”,”Cash”,”Cheque”,”NetBanking”,”ATMcard”,”NetBanking”,
“Cheque”,”Cash”,”Cheque”,”NetBanking” ,”Cash”)

We can convert it into a factor since its data can be meaningfully categorized.

Converting a character vector to a Factor

Note that we can now see the Unique levels at the bottom of our output. We can also use the function levels(tt_factored), if we want to see the different levels in our factor ‘tt_factored’

How are these levels stored internally in R ?

An important thing to understand here is that these levels are stored as integers internally. To confirm this we can use the typeof() and str ()functions. Each level is assigned an integer number starting from 1.

If the levels are of character type like in our example, they will be arranged in alphabetical order and assigned a number staring from 0. So, ATMcard-Is level 1, Cash-Level 2, Cheque-Level 3, Netbanking-Level 4.

If the levels are numeric then they will be arranged in ascending order and then assigned a level starting from 1.

Can we change the level numbering instead of accepting the default ascending/alphabetical order ?

Answer is, Yes! We can pass the desired order of levels in the the levels parameter as shown below.

We have finally reached the end. Wooo Hooo!!!

There are off course plenty of other interesting stuff about data structures. I have tried to included the basic concepts and I hope this helps anyone with a new found love for R programming and anything related to Data .

Take care and continue learning!

--

--

Sharon Chetia
Analytics Vidhya

Currently in love with researching and learning about data. I love animals and gardening too :)