Getting Started With R — 1

Vivekanandan Srinivasan
Analytics Vidhya
Published in
13 min readNov 26, 2019

This series is intended for R beginners who are looking for a quick introduction to the basics of R. It covers the fundamentals of R that includes data types, control structures, loops, functions, and advanced data structures.

If you are already familiar with these topics and looking for a comprehensive introduction to all important topics in statistics and machine learning using R. Kindly start off with the following series which discusses all necessary topics related to data science.

Many Ways of Reading Data Into R — 1

The contents are inspired by a couple of books that I got introduced during my IIM-B days.

R for Everyone — Jared P. Lander

Practical Data Science with R — Nina Zumel & John Mount

All the code blocks discussed in the article are present in the form of R markdown in the Github link.

I hope it helps and let’s get started !!!!!!!

R is a powerful tool for all manner of calculations, data manipulation, and scientific computations. Before getting to the complex operations possible in R we must start with the basics. Like most languages, R has its share of mathematical capability, variables, functions and data types.

Basic Math

Being a statistical programming language, R can certainly be used to do basic math and that is where we will start. We begin with the “hello, world!” of basic math: 1 + 1. In the console, there is a right angle bracket (>) where code should be entered. Let us directly try a little bit of complex operation as an example.

(4 * 6) + 5

These follow the basic order of operations: Parenthesis, Exponents, Multiplication, Division, Addition and Subtraction (PEMDAS). This means operations inside parentheses take priority over other operations. Next on the priority list is exponentiation. After that, multiplication and division are performed, followed by addition and subtraction.

So far we have put white space in between each operator, such as * and /. This is not necessary but is encouraged as a good coding practice.

Variables

Variables are an integral part of any programming language and R offers a great deal of flexibility. Unlike statically typed languages such as C++, R does not require variable types to be declared. A variable can take on any available data type. It can also hold any R object such as a function, the result of an analysis or a plot. A single variable can at one point held a number, then later hold a character and then later a number again.

There are a number of ways to assign a value to a variable, and again, this does not depend on the type of value being assigned. The valid assignment operators are <- and =, with the first being preferred. For example, let’s save 2 to the variable x and 5 to the variable y. The arrow operator can also point in the other direction.

x <- 2
y = 5
3 -> z

The assignment operation can be used successively to assign value to multiple variables simultaneously.

a <- b <- 7

A more laborious, though sometimes necessary, way to assign variables is to use the assign function.

assign(“j”, 4)

Variable names can contain any combination of alphanumeric characters along with periods (.) and underscores (_). However, they cannot start with a number or an underscore.

The most common form of assignment in the R community is the left arrow (<-), which may seem awkward to use at first but eventually becomes second nature. It even seems to make sense, as the variable is sort of pointing to its value. There is also a particularly nice benefit for people coming from languages like SQL, where a single equal sign (=) tests for equality.

It is generally considered the best practice to use actual names, usually nouns, for variables instead of single letters. This provides more information to the person reading the code. This is seen throughout this book.

For various reasons, a variable may need to be removed. This is easily done using remove or its shortcut rm.

rm(j)

This frees up memory so that R can store more objects, although it does not necessarily free up memory for the operating system. To guarantee that, use gc, which performs garbage collection, releasing unused memory to the operating system. R automatically does garbage collection periodically, so this function is not essential.

Data Types

There are numerous data types in R that store various kinds of data. The four main types of data most likely to be used are numeric, character (string), Date/POSIXct (time-based) and logical (TRUE/FALSE). The type of data contained in a variable is checked with the class function

class(x)

Numeric Data

As expected, R excels at running numbers, so numeric data is the most common type in R. The most commonly used numeric data is numeric. This is similar to a float or double in other languages. It handles integers and decimals, both positive and negative, and of course, zero. A numeric value stored in a variable is automatically assumed to be numeric. Testing whether a variable is numeric is done with the function is.numeric.

is.numeric(x)

Another important, if less frequently used, the type is integer. As the name implies this is for whole numbers only, no decimals. To set an integer to a variable it is necessary to append the value with an L. As with checking for a numeric, the is.integer function is used.

i <- 5L
is.integer(i)

Do note that, even though i is an integer, it will also pass a numeric check. R nicely promotes integers to numeric when needed. This is obvious when multiplying an integer by a numeric, but importantly it works when dividing an integer by another integer, resulting in a decimal number.

Character Data

Even though it is not explicitly mathematical, the character (string) data type is very common in statistical analysis and must be handled with care. R has two primary ways of handling character data: character and factor. While they may seem similar on the surface, they are treated quite differently.

x <- "data"
y <- factor("data")

Notice that x contains the word “data” encapsulated in quotes, while y has the word “data” without quotes and the second line of information about the levels of y . We will see more about it when we discuss Vectors.

Characters are case sensitive, so “Data” is different from “data” or “DATA”. To find the length of a character (or numeric) use the nchar function. But this will not work for factor data as seen below.

nchar(x)
nchar(y)

Dates Type

Dealing with dates and times can be difficult in any language, and to further complicate matters R has numerous different types of dates. The most useful are Date and POSIXct. Date stores just a date while POSIXct stores a date and time. Both objects are actually represented as the number of days (Date) or seconds (POSIXct) since January 1, 1970.

date1 <- as.Date("2012-06-28")
date1
date2 <- as.POSIXct("2012-06-28 17:42")
date2

Easier manipulation of date and time objects can be accomplished using the lubridate and chron packages. Using functions such as as.numeric or as.Date does not merely change the formatting of an object but actually changes the underlying type.

Logical Type

Logicals are a way of representing data that can be either TRUE or FALSE. Numerically, TRUE is the same as 1 and FALSE is the same as 0. So TRUE ∗ 5 equals 5 while FALSE * 5 equals 0.

Similar to other types, logical have their own test, using the is.logical function.

k <- TRUE
class(k)
is.logical(k)

R provides T and Fas shortcuts for TRUE and FALSE, respectively, but it is best practice not to use them, as they are simply variables storing the values TRUE and FALSE and can be overwritten, which can cause a great deal of frustration. Logical can result from comparing two numbers, or characters.

Vectors

A vector is a collection of elements, all of the same type. For instance, c(1, 3, 2, 1, 5) is a vector consisting of the numbers 1, 3, 2, 1, 5, in that order. Similarly, c(“R”, “Excel”, “SAS”, “Excel”) is a vector of the character elements, “R”, “Excel”, “SAS”, and “Excel”. A vector cannot be of mixed type.

Vectors play a crucial, and helpful, role in R. More than being simple containers, vectors in R are special in that R is a vectorized language. That means operations are applied to each element of the vector automatically, without the need to loop through the vector. This is a powerful concept that may seem foreign to people coming from other languages, but it is one of the greatest things about R.

Vectors do not have a dimension, meaning there is no such thing as a column vector or row vector. These vectors are not like the mathematical vector, where there is a difference between row and column orientation.

The most common way to create a vector is with c. The “c” stands for combine because multiple elements are being combined into a vector.

x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

Now that we have a vector of the first ten numbers, we might want to multiply each element by 3. In R this is a simple operation using just the multiplication operator (*).

x*3

No loops are necessary. Addition, subtraction and division are just as easy. This also works for any number of operations.

Earlier we created a vector of the first ten numbers using the c function, which creates a vector. A shortcut is the : operator, which generates a sequence of consecutive numbers, in either direction.

c(1:10)

Vector operations can be extended even further. Let’s say we have two vectors of equal length. Each of the corresponding elements can be operated on together.

x <- 1:10
y <- -5:4
## Add them
x+y

Things get a little more complicated when operating on two vectors of unequal length. The shorter vector gets recycled — that is, its elements are repeated, in order, until they have been matched up with every element of the longer vector. If the longer one is not a multiple of the shorter one, a warning is given.

x + c(1, 2)
x + c(1, 2, 3)

Comparisons also work on vectors. Here the result is a vector of the same length containing TRUE or FALSE for each element.

x <= 5

To test whether all the resulting elements are TRUE, use the all function. Similarly, the any function checks whether any element is TRUE.

x <- 10:1
y <- -4:5
## Checking any and all
any(x < y)
all(x < y)

Accessing individual elements of a vector is done using square brackets ([ ]). The first element of x is retrieved by typing x[1], the first two elements by x[1:2] and nonconsecutive elements by x[c(1, 4)].

x[c(1, 4)]

This works for all types of vectors whether they are numeric, logical, character and so forth. It is possible to give names to a vector either during creation or after the fact.

c(One="a", Two="y", Last="r")

Factor Vectors

Factors are an important concept in R, especially when building models. Let’s create a simple vector of text data that has a few repeats. We will start with the vector q we created earlier and add some elements to it.

q2 <- c("Hockey", "Lacrosse", "Hockey", "Water Polo", "Hockey", "Lacrosse")
q2Factor <- as.factor(q2)
q2Factor

Notice that after printing out every element of q2Factor, R also prints the levels of q2Factor. The levels of a factor are the unique values of that factor variable. Technically, R is giving each unique value of a factor a unique integer, tying it back to the character representation. This can be seen with as.numeric.

as.numeric(q2Factor)

In ordinary factors the order of the levels does not matter and one level is no different from another. Sometimes, however, it is important to understand the order of a factor, such as when coding education levels. Setting the ordered argument to TRUE creates an ordered factor with the order given in the levels argument.

factor(x=c("High School", "College", "Masters", "Doctorate"),     levels=c("High School", "College", "Masters", "Doctorate"),       ordered=TRUE)

Factors can drastically reduce the size of the variable because they are storing only the unique values, but they can cause headaches if not used properly.

Missing Data — NA vs NULL

Missing data plays a critical role in both statistics and computing, and R has two types of missing data, NA and NULL. While they are similar, they behave differently and that difference needs attention.

Often we will have data that has missing values for any number of reasons. Statistical programs use various techniques to represent missing data such as a dash, a period or even the number 99. R uses NA. NA will often be seen as just another element of a vector. is.na tests each element of a vector for missingness.

z <- c(1, 2, NA, 8, 3, NA, 3)
is.na(z)

NA is entered simply by typing the letters “N” and “A” as if they were normal text. This works for any kind of vector.

Handling missing data is an important part of statistical analysis. There are many techniques depending on field and preference. One popular technique is multiple imputation, which is discussed in detail in Chapter 25 of Andrew Gelman and Jennifer Hill’s book Data Analysis Using Regression and Multilevel/Hierarchical Models, and is implemented in the mi, mice and Amelia packages.

NULL is the absence of anything. It is not exactly missingness, it is nothingness. Functions can sometimes return NULL and their arguments can be NULL. An important difference between NA and NULL is that NULL is atomical and cannot exist within a vector. If used inside a vector, it simply disappears.

z <- c(1, NULL,3)
length(z)
z

Even though it was entered into the vector z, it did not get stored in z. In fact, z is only two elements long. The test for a NULL value is is.null. Since NULL cannot be a part of a vector, is.null is appropriately not vectorized.

Pipes

A new paradigm for calling functions in R is the pipe. The pipe from the magrittr package works by taking the value or object on the left-hand side of the pipe and inserting it into the first argument of the function that is on the right-hand side of the pipe. A simple example would be using a pipe to feed x to the mean function.

library(magrittr)
x <- 1:10
x %>% mean

Pipes are most useful when used in a pipeline to chain together a series of function calls. Given a vector z that contains numbers and NAs, we want to find out how many NAs are present. Traditionally, this would be done by nesting functions.

z <- c(1, 2, NA, 8, 3, NA, 3) 
sum(is.na(z))
## Using Pipes
z %>% is.na %>% sum

Pipes read more naturally in a left-to-right fashion, making the code easier to comprehend. Using pipes is negligibly slower than nesting function calls, though as Hadley Wickham notes, pipes will not be a major bottleneck in code.

When piping an object into a function and not setting any additional arguments, no parentheses are needed. However, if additional arguments are used, then they should be named and included inside the parentheses after the function call. The first argument is not used, as the pipe already inserted the left-hand object into the first argument.

z %>% mean(na.rm=TRUE)

Data come in many types, and R is well equipped to handle them. In addition to basic calculations, R can handle numeric, character and time-based data. One of the nicer parts of working with R, although one that requires a different way of thinking about programming, is vectorization. This allows operating on multiple elements in a vector simultaneously, which leads to faster and more mathematical code. In the next article, we will discuss advanced data structures available in R.

Advanced-Data Structures In R — 2

Do share your thoughts and support by commenting and sharing the article among your peer groups.

--

--

Vivekanandan Srinivasan
Analytics Vidhya

An analytics professional with over six years of experience spanning across predictive modelling, statistical analysis and big data technologies.