Programming in R: A Case Study

Khushee Kapoor
Hashtag by IECSE
Published in
14 min readOct 5, 2021

--

This introductory tutorial walks you through the basic statistical features available in R and how to use them. The tutorial gives you hands-on learning using the Palmer Penguins dataset.

What is R?

R is a programming language that is used in the statistical and numerical analysis of complex datasets. R provides multiple libraries, which makes data analysis and graphical representation amazingly easy to achieve.

Software programmers, statisticians, data scientists, and data miners are using R to develop statistical software.

Prerequisites for the Tutorial

There are no prerequisites required as such, but basic programming knowledge in any language will help you understand Exploratory Data Analysis (EDA) and Data Visualization. And if you have dealt with Python before, then a lot of things would look similar.

How is R different from Python?

R has more data analysis and visualization libraries than Python, and they are much easier to implement and understand.

R is built by statisticians and leans heavily into statistical models and specialized analytics. Data scientists use R for deep statistical analysis, supported by just a few lines of code and beautiful data visualizations. For example, you might use R for customer behavior analysis or genomics research.

The visualizations in R are much more pleasing to the eye, making it easier to comprehend the dataset. It has unmatched libraries like dplyr and dbplyr for Data Wrangling, ggplot2 for Data Visualization, mlr3 and caret for Machine Learning, tidyverse for General Data Science Tasks, etc. All these libraries and many more can be used for unmatched data exploration and experimentation.

Even though Python is widely used for its wide range of purposes because developers can use Python for production and deployment, R scores higher from a statistical analysis point of view.

What will I learn at the end of the Tutorial?

We are very sure that you will get a good understanding of the packages and libraries in R, and you will also be able to analyze any of your custom datasets using R.

Palmer Penguins: A Case Study

Palmer Penguins

Now that we are familiar with R, let’s try coding some basic commands on one of R’s most popular datasets — Palmer Penguins. We urge you to code along with us to experience the power of R.

To begin, we need to load the Palmer Penguins dataset in the environment.

To install any package in an R environment, we follow the syntax:

install.packages(“package name”)

This is a one-time step. Once the package is installed on your system, you only need to load it in the environment before using it. To load a package, we use the syntax:

library(package name)

Take note that the # symbol denotes a comment, which we insert in the code to make it more readable.

Basic Exploratory Data Analysis

Now that the Palmer Penguins package is all loaded and set to be manipulated, let’s get to know it! Remember that the name of the dataset in this package is ‘penguins’.

The conventional way to start any dataset analysis is to view its first few rows. This gives us an idea of how the dataset is built without overloading the environment with an enormous number of rows. To do this, we use the function:

head(name of the dataset)

Default Head Function

By default, the head() function displays the first 6 rows of the dataset. However, if we want to view more/ fewer rows, we can specify the number of rows in the parentheses as such:

head(name of the dataset, number of rows)

Custom Head Function

Similarly, we can also view the last few rows of the dataset using the function:

tail(name of the dataset)

Default Tail Function

Just like the head() function, by default, the tail() function displays the last 6 rows. However, we can tailor the function to our needs and display as many number of rows we want by adding a parameter:

tail(name of the dataset, number of rows)

Custom Tail Function

To know the dimensions of the dataset, we use the dim() function. The syntax of the function is:

dim(name of the dataset)

Dimensions

Here we can see that our dataset has 344 data points and 8 features, namely- species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, and year. Also, we can see that the columns- species, island, and sex have the data type factor <fct> (which means categorical data), the columns- bill_length_mm, and bill_depth_mm have the data type double <dbl> (which means numbers with decimals), and the columns — flipper_length_mm, body_mass_g, and year have the data type int <int>(which means numbers without a decimal point).

A simpler way to know all these details is by using the str() — structure function. The syntax of the str() function is as follows:

str(name of the dataset)

Structure Function

Using the str() function, we get to know the dimensions of the dataset, the data type of the columns, and the first few values as well.

Here, we also get deeper insights, such as the different ‘levels’ or ‘classes’ or ‘categories’ in the columns with the factor data type, for example- the column sex has two levels- female and male, the column species has three levels- Adelie, Chinstrap and Gentoo.

Awesome! Now that we are introduced to the Palmer Penguins dataset, let’s apply some functions that will help us measure some stats about the dataset.

We are using the library ‘tidyverse’. Like any other package, we first install the package in our system (reminder: one-time step only) and then load it in the environment.

Before we begin, let’s see a handy operator — %>%, the pipe operator. This operator builds a pipe of functions, which are performed one after the other, using the result of the previous functions. Note, the pipe operator must be at the end of each line.

To start, we will use the count() function. This function is generally applied to columns having the factor data type. It returns the count of each level in a column. The syntax is as follows:

dataset name %>%

count(name of the column)

Count Function

As we can see, the count() function returns the count of each species in the column. There are 152 penguins of the ‘Adelie’ species, 68 penguins of the ‘Chinstrap’ species, and 124 penguins of the ‘Gentoo’ species.

Next, we use the summarize() function that returns a particular value after performing an operation. The syntax is:

dataset name %>%

summarize(new column = function(column name))

Summarize Function

This function returns the average of the body mass of the penguins as a new column called avg_body_mass. We use the mean() function to calculate the average. We can also use median() and getmode() functions to get the column’s median and mode, respectively. The statement na.rm=TRUE ignores any null values in the column which might cause a calculation error.

Let’s add another function to the pipe. Now, we are going to find the average of the body mass group-wise. The summary() function remains the same, but before that, we add in the group_by() function.

The group_by() function groups the dataset by the levels of a factor column and then applies the next functions group/factor-wise. The syntax of the group_by() function:

dataset name %>%

group_by(column name) %>%

Group By Function

Unlike the previous result, here we see that the summarize() function has calculated the mean for each level of the species separately and stored it in the avg_body_mass column.

Now, what if we want to filter out only those penguins who have body mass less than the average of their species? The filter() function comes to our rescue!

In the pipe, we add in the filter() function after the group_by() function. The syntax for the filter() function is as follows:

dataset name %>%

filter(condition)

Filter Function

Here, we see the penguins who have body mass less than the average body mass of their species. Recollect, we have used the mean() function and the na.rm=TRUE statement for calculation. We use the relational operator ‘<’ or less than to build the condition. We can also use conditions like ‘>’ or greater than, ‘>=’ or greater than or equal to, ‘<=’ or less than or equal to, ‘==’ or equal to, to build the conditions. The ‘!’ represents the not operator and reverses the condition.

Moving on, if we want to arrange the penguins who have less body mass than the average of their species, in a particular order, we use the arrange() function, using the syntax:

dataset name %>%

arrange(column name)

Arrange Function

Here, we see that the penguins having body mass less than the average of their species are sorted in descending order by their body mass. By default, the arrange function sorts the dataset in ascending order. However, because we have added the desc() parameter, our dataset is arranged in descending order.

Yay! We have explored our dataset and various statistical measures of it.

Let’s proceed to visualize our data using the library ‘ggplot2’. Like before, we first install the package in our system (reminder: this is a one-time step) and then load the package in the environment

The basic syntax to make any plot using ggplot2 is as follows:

ggplot(dataset, aes(x=..,y=..) +

geom_shape

Here, aes stands for the aesthetics of the plot and the word geom stands for the geometric object. Note that the shape after geom changes for every kind of plot we make. Also, take care that the ‘+’ operator is placed at the end of each line.

To begin with, we will make a basic scatter plot. A scatter plot plots every individual data point in the dataset using shapes on a graph. The syntax for it is as follows:

gglot(dataset, aes(x=..,y=..) +

geom_point ()

Scatter Plot

The above plot represents the variation of body mass with respect to the flipper length of the penguin. Note that passing the values of both x and y is compulsory in a scatter plot.

Next, we will plot a line graph. A line graph plots every point in the dataset on the graph and joins it with its consequent and subsequent data points in the dataset using a line. The syntax to plot a line graph is as follows:

ggplot(dataset, aes(x=..,y=..,) +

geom_line ()

Line Graph

The line graph above represents the variation of body mass with respect to the flipper length of a penguin. Like a scatter plot, passing both the values of x and y is compulsory in a line graph.

Moving on, we will plot a bar graph. In a bar graph, every data point is represented by a bar whose length is proportional to the value they represent. To build a bar graph, we use the following syntax:

ggplot(dataset, aes(x=..)) +

geom_bar ()

Bar Plot

As we can see, the bar graph we have plotted represents the number of penguins having a particular body mass. Unlike the scatter and line plots, we have to pass only x to the bar graph function, and by default, the y-axis will represent the count of the x-axis variable.

Similar to the bar graph, a histogram plots bars but for data points grouped in bins. The bins are continuous number ranges.

To make a histogram, we use the syntax:

ggplot(dataset, aes(x=..,)) +

geom_histogram ()

Histogram

Referring to the above plot, we can see that the system has grouped the body mass of penguins into ranges and then plotted bars with length proportional to the count of penguins having body mass in that range. Like a bar graph, we only have to pass the x variable, and by default the y axis measures the count of the x axis variables.

The difference between a bar plot and a histogram is that bar plots work with both, the categorical and numeric data types, whereas histograms work only with the numeric data type.

All these plots convey a lot of information, but they look a little boring, don’t they? Let’s make them more visually appealing!

To add in a dash of colour, we include the colour parameter in the aesthetics parenthesis. The colour parameter highlights the points on the basis of the levels of a factor column.

Colour Parameter

Here, the points are highlighted based on the species of penguins. R also provides a legend to make sense of the colours used. In this case, the data related to the Adelie penguins are orange in colour, the Chinstrap are green and the Gentoo are blue.

Similarly, we can change the size of the points on the basis of the levels of a factor column. Just add the size parameter in aesthetics parenthesis.

Size Parameter

In this case, we see that along with the size, the different species of penguins can be separated on the basis of the size of points. The Adelie penguins have small sized points, the Chinstrap penguins have medium sized points and the Gentoo penguins have big sized points.

If we want to go even further, we can change the shape of the point on the basis of levels of the factor column. To do this, we add the shape parameter in the aesthetics parenthesis.

Shape Parameter

Here, the three species of penguins are very clearly distinguished on the basis of colour, size and shape. Notice, the legend updates itself accordingly.

To make the graph more readable, we will add in some text. The different ways we can insert text are:

  • Title: Appears at the top left of the plot in a big font size.
  • Subtitle: Appears at the top left of the plot beneath the Title in a small font size.
  • Caption: Appears as a note at the bottom right of the plot in a small font size.
  • X: Appears as the label of the x-axis, in a medium font size.
  • Y: Appears as the label of the y-axis, in a medium font size.

The text is added in the labs() function as follows:

ggplot(dataset, aes(x=..,y=..) +

geom_point () +

labs(title=.., subtitle=.., caption=.., x=.., y=..)

Labelled Visualization

A sneak into Advanced Exploratory Data Analysis

Dimensionality is a curse. Often we need to drop some columns to avoid overfitting of models which leads to decrease in accuracy.

To do this, there are two ways:

  • By indices
  • By labels

To select columns on the basis of indices, we use [..,..] to access them in the dataset. The syntax is as follows:

dataset [row indices, column indices]

If we want to include all of the rows or all of the columns, we leave the space blank. If we want to access only specific columns, we put in the index/ range of indices.

Dropping Column by Indices

Here, we have left the indices of rows empty, meaning we want to include all rows in the new dataset. Also, we want columns with indices 3, 4, 5, and 6. So we have put the expression 3:6 which generates the required series.

Caution: The changes we have made are not permanent. These functions provide a copy of the dataset with the filters applied. To make the changes in the original dataset, we assign the original dataset to the changed dataset.

Similarly, if we want to drop rows, we will put in the indices before the comma. Here, we have also used the ‘-’ operator which omits the mentioned rows and retains the others. Also notice that we have kept the indices of columns empty, hence, we retain all the columns.

Dropping Rows by Indices

We can see that the number of rows have reduced by 3 (344 to 341).

If we want to access the columns by their names, we use the select() function. This comes in the ‘tidyverse’ package. The syntax is as follows:

select(dataset name, column names)

Dropping Columns by Name

In this case, we do not want the columns sex and year. To do this, we have made a vector of column names using the c() or concat function and put a ‘-’ sign in front of the vector to omit those rows.

Another major problem in datasets is the presence of null values. To count the number of null values, we use the function sum(is.na(..)) as follows:

From above, we can see that the dataset has 19 null values in total.

To remove these values, we use the na.omit(..) function as shown below:

Dropping Missing Values

Here, we can see that 11 rows were removed (as the number of rows dropped from 344 to 333). This means that some rows had multiple null values.

In the real world, we remove columns on the basis of correlation. To know which columns are highly correlated, we will plot a correlation heatmap. However, correlation can only be calculated between numeric columns. So, let’s select only the numeric columns from the dataset and make the change permanent.

To do this, we have used the select_if() and is.numeric functions. The select_if() function is a part of the ‘tidyverse’ package and subsets the column based on a condition. The is.numeric function checks whether a column is of the numeric data type. The syntax of the select_if() function is:

select_if(dataset name, condition)

Select If Function

As we can see, the penguins dataset has only numeric columns, viz. bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, and year.

Note: just by typing the name of the dataset, we can have a preview of the dataset.

To avoid any calculation error, let’s drop rows with missing values, and make it permanent.

Dropped Missing Values

As we can see, there are no missing values in the dataset.

All set now! Let’s plot the correlation heatmap.

We will be using the cor() function to calculate the correlation between the columns in the dataset. Let’s store this in a variable called corr.

To plot the correlation heatmap, we need to install the ‘corrplot’ library. Before you see the next code snippet, write your own code! Let’s test your learning.

Now let’s plot the heatmap. The syntax is rather simple:

corrplot(correlation between columns)

Correlation Heatmap

Ta-da! As we can see, flipper_length_mm is highly correlated with bill_lenght_mm and body_mass_g. That means, we will have to drop it to improve the accuracy of our model. Nuh-uh, you’ll have to write the code for this one!

Psst. scroll up for the cheat code!

Congratulations! You can now perform basic statistical operations on datasets and make beautiful visualizations using R. At this point, we encourage you to explore further and move forward in your journey to become an R expert. Here are some resources that might help:

https://cran.r-project.org/doc/manuals/R-intro.pdf

https://www.tidyverse.org/

https://ggplot2.tidyverse.org/reference/

Special thanks to Lajith Puthuchery and Nishant Dash for their valuable contributions to the article.

--

--

Khushee Kapoor
Hashtag by IECSE

Data Science and Engineering Undergraduate at Manipal Institute of Technology