Programming in R: A Case Study
This introductory tutorial walks you through the basic statistical features available in R and how to use them. The tutorial gives you hands-on learning using the Palmer Penguins dataset.
What is R?
R is a programming language that is used in the statistical and numerical analysis of complex datasets. R provides multiple libraries, which makes data analysis and graphical representation amazingly easy to achieve.
Software programmers, statisticians, data scientists, and data miners are using R to develop statistical software.
Prerequisites for the Tutorial
There are no prerequisites required as such, but basic programming knowledge in any language will help you understand Exploratory Data Analysis (EDA) and Data Visualization. And if you have dealt with Python before, then a lot of things would look similar.
How is R different from Python?
R has more data analysis and visualization libraries than Python, and they are much easier to implement and understand.
R is built by statisticians and leans heavily into statistical models and specialized analytics. Data scientists use R for deep statistical analysis, supported by just a few lines of code and beautiful data visualizations. For example, you might use R for customer behavior analysis or genomics research.
The visualizations in R are much more pleasing to the eye, making it easier to comprehend the dataset. It has unmatched libraries like dplyr and dbplyr for Data Wrangling, ggplot2 for Data Visualization, mlr3 and caret for Machine Learning, tidyverse for General Data Science Tasks, etc. All these libraries and many more can be used for unmatched data exploration and experimentation.
Even though Python is widely used for its wide range of purposes because developers can use Python for production and deployment, R scores higher from a statistical analysis point of view.
What will I learn at the end of the Tutorial?
We are very sure that you will get a good understanding of the packages and libraries in R, and you will also be able to analyze any of your custom datasets using R.
Palmer Penguins: A Case Study
Now that we are familiar with R, let’s try coding some basic commands on one of R’s most popular datasets — Palmer Penguins. We urge you to code along with us to experience the power of R.
To begin, we need to load the Palmer Penguins dataset in the environment.
To install any package in an R environment, we follow the syntax:
install.packages(“package name”)
This is a one-time step. Once the package is installed on your system, you only need to load it in the environment before using it. To load a package, we use the syntax:
library(package name)
Take note that the # symbol denotes a comment, which we insert in the code to make it more readable.
Basic Exploratory Data Analysis
Now that the Palmer Penguins package is all loaded and set to be manipulated, let’s get to know it! Remember that the name of the dataset in this package is ‘penguins’.
The conventional way to start any dataset analysis is to view its first few rows. This gives us an idea of how the dataset is built without overloading the environment with an enormous number of rows. To do this, we use the function:
head(name of the dataset)
By default, the head() function displays the first 6 rows of the dataset. However, if we want to view more/ fewer rows, we can specify the number of rows in the parentheses as such:
head(name of the dataset, number of rows)
Similarly, we can also view the last few rows of the dataset using the function:
tail(name of the dataset)
Just like the head() function, by default, the tail() function displays the last 6 rows. However, we can tailor the function to our needs and display as many number of rows we want by adding a parameter:
tail(name of the dataset, number of rows)
To know the dimensions of the dataset, we use the dim() function. The syntax of the function is:
dim(name of the dataset)
Here we can see that our dataset has 344 data points and 8 features, namely- species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, and year. Also, we can see that the columns- species, island, and sex have the data type factor <fct> (which means categorical data), the columns- bill_length_mm, and bill_depth_mm have the data type double <dbl> (which means numbers with decimals), and the columns — flipper_length_mm, body_mass_g, and year have the data type int <int>(which means numbers without a decimal point).
A simpler way to know all these details is by using the str() — structure function. The syntax of the str() function is as follows:
str(name of the dataset)
Using the str() function, we get to know the dimensions of the dataset, the data type of the columns, and the first few values as well.
Here, we also get deeper insights, such as the different ‘levels’ or ‘classes’ or ‘categories’ in the columns with the factor data type, for example- the column sex has two levels- female and male, the column species has three levels- Adelie, Chinstrap and Gentoo.
Awesome! Now that we are introduced to the Palmer Penguins dataset, let’s apply some functions that will help us measure some stats about the dataset.
We are using the library ‘tidyverse’. Like any other package, we first install the package in our system (reminder: one-time step only) and then load it in the environment.
Before we begin, let’s see a handy operator — %>%, the pipe operator. This operator builds a pipe of functions, which are performed one after the other, using the result of the previous functions. Note, the pipe operator must be at the end of each line.
To start, we will use the count() function. This function is generally applied to columns having the factor data type. It returns the count of each level in a column. The syntax is as follows:
dataset name %>%
count(name of the column)
As we can see, the count() function returns the count of each species in the column. There are 152 penguins of the ‘Adelie’ species, 68 penguins of the ‘Chinstrap’ species, and 124 penguins of the ‘Gentoo’ species.
Next, we use the summarize() function that returns a particular value after performing an operation. The syntax is:
dataset name %>%
summarize(new column = function(column name))
This function returns the average of the body mass of the penguins as a new column called avg_body_mass. We use the mean() function to calculate the average. We can also use median() and getmode() functions to get the column’s median and mode, respectively. The statement na.rm=TRUE ignores any null values in the column which might cause a calculation error.
Let’s add another function to the pipe. Now, we are going to find the average of the body mass group-wise. The summary() function remains the same, but before that, we add in the group_by() function.
The group_by() function groups the dataset by the levels of a factor column and then applies the next functions group/factor-wise. The syntax of the group_by() function:
dataset name %>%
group_by(column name) %>%
Unlike the previous result, here we see that the summarize() function has calculated the mean for each level of the species separately and stored it in the avg_body_mass column.
Now, what if we want to filter out only those penguins who have body mass less than the average of their species? The filter() function comes to our rescue!
In the pipe, we add in the filter() function after the group_by() function. The syntax for the filter() function is as follows:
dataset name %>%
filter(condition)
Here, we see the penguins who have body mass less than the average body mass of their species. Recollect, we have used the mean() function and the na.rm=TRUE statement for calculation. We use the relational operator ‘<’ or less than to build the condition. We can also use conditions like ‘>’ or greater than, ‘>=’ or greater than or equal to, ‘<=’ or less than or equal to, ‘==’ or equal to, to build the conditions. The ‘!’ represents the not operator and reverses the condition.
Moving on, if we want to arrange the penguins who have less body mass than the average of their species, in a particular order, we use the arrange() function, using the syntax:
dataset name %>%
arrange(column name)
Here, we see that the penguins having body mass less than the average of their species are sorted in descending order by their body mass. By default, the arrange function sorts the dataset in ascending order. However, because we have added the desc() parameter, our dataset is arranged in descending order.
Yay! We have explored our dataset and various statistical measures of it.
Let’s proceed to visualize our data using the library ‘ggplot2’. Like before, we first install the package in our system (reminder: this is a one-time step) and then load the package in the environment
The basic syntax to make any plot using ggplot2 is as follows:
ggplot(dataset, aes(x=..,y=..) +
geom_shape
Here, aes stands for the aesthetics of the plot and the word geom stands for the geometric object. Note that the shape after geom changes for every kind of plot we make. Also, take care that the ‘+’ operator is placed at the end of each line.
To begin with, we will make a basic scatter plot. A scatter plot plots every individual data point in the dataset using shapes on a graph. The syntax for it is as follows:
gglot(dataset, aes(x=..,y=..) +
geom_point ()
The above plot represents the variation of body mass with respect to the flipper length of the penguin. Note that passing the values of both x and y is compulsory in a scatter plot.
Next, we will plot a line graph. A line graph plots every point in the dataset on the graph and joins it with its consequent and subsequent data points in the dataset using a line. The syntax to plot a line graph is as follows:
ggplot(dataset, aes(x=..,y=..,) +
geom_line ()
The line graph above represents the variation of body mass with respect to the flipper length of a penguin. Like a scatter plot, passing both the values of x and y is compulsory in a line graph.
Moving on, we will plot a bar graph. In a bar graph, every data point is represented by a bar whose length is proportional to the value they represent. To build a bar graph, we use the following syntax:
ggplot(dataset, aes(x=..)) +
geom_bar ()
As we can see, the bar graph we have plotted represents the number of penguins having a particular body mass. Unlike the scatter and line plots, we have to pass only x to the bar graph function, and by default, the y-axis will represent the count of the x-axis variable.
Similar to the bar graph, a histogram plots bars but for data points grouped in bins. The bins are continuous number ranges.
To make a histogram, we use the syntax:
ggplot(dataset, aes(x=..,)) +
geom_histogram ()
Referring to the above plot, we can see that the system has grouped the body mass of penguins into ranges and then plotted bars with length proportional to the count of penguins having body mass in that range. Like a bar graph, we only have to pass the x variable, and by default the y axis measures the count of the x axis variables.
The difference between a bar plot and a histogram is that bar plots work with both, the categorical and numeric data types, whereas histograms work only with the numeric data type.
All these plots convey a lot of information, but they look a little boring, don’t they? Let’s make them more visually appealing!
To add in a dash of colour, we include the colour parameter in the aesthetics parenthesis. The colour parameter highlights the points on the basis of the levels of a factor column.
Here, the points are highlighted based on the species of penguins. R also provides a legend to make sense of the colours used. In this case, the data related to the Adelie penguins are orange in colour, the Chinstrap are green and the Gentoo are blue.
Similarly, we can change the size of the points on the basis of the levels of a factor column. Just add the size parameter in aesthetics parenthesis.
In this case, we see that along with the size, the different species of penguins can be separated on the basis of the size of points. The Adelie penguins have small sized points, the Chinstrap penguins have medium sized points and the Gentoo penguins have big sized points.
If we want to go even further, we can change the shape of the point on the basis of levels of the factor column. To do this, we add the shape parameter in the aesthetics parenthesis.
Here, the three species of penguins are very clearly distinguished on the basis of colour, size and shape. Notice, the legend updates itself accordingly.
To make the graph more readable, we will add in some text. The different ways we can insert text are:
- Title: Appears at the top left of the plot in a big font size.
- Subtitle: Appears at the top left of the plot beneath the Title in a small font size.
- Caption: Appears as a note at the bottom right of the plot in a small font size.
- X: Appears as the label of the x-axis, in a medium font size.
- Y: Appears as the label of the y-axis, in a medium font size.
The text is added in the labs() function as follows:
ggplot(dataset, aes(x=..,y=..) +
geom_point () +
labs(title=.., subtitle=.., caption=.., x=.., y=..)
A sneak into Advanced Exploratory Data Analysis
Dimensionality is a curse. Often we need to drop some columns to avoid overfitting of models which leads to decrease in accuracy.
To do this, there are two ways:
- By indices
- By labels
To select columns on the basis of indices, we use [..,..] to access them in the dataset. The syntax is as follows:
dataset [row indices, column indices]
If we want to include all of the rows or all of the columns, we leave the space blank. If we want to access only specific columns, we put in the index/ range of indices.
Here, we have left the indices of rows empty, meaning we want to include all rows in the new dataset. Also, we want columns with indices 3, 4, 5, and 6. So we have put the expression 3:6 which generates the required series.
Caution: The changes we have made are not permanent. These functions provide a copy of the dataset with the filters applied. To make the changes in the original dataset, we assign the original dataset to the changed dataset.
Similarly, if we want to drop rows, we will put in the indices before the comma. Here, we have also used the ‘-’ operator which omits the mentioned rows and retains the others. Also notice that we have kept the indices of columns empty, hence, we retain all the columns.
We can see that the number of rows have reduced by 3 (344 to 341).
If we want to access the columns by their names, we use the select() function. This comes in the ‘tidyverse’ package. The syntax is as follows:
select(dataset name, column names)
In this case, we do not want the columns sex and year. To do this, we have made a vector of column names using the c() or concat function and put a ‘-’ sign in front of the vector to omit those rows.
Another major problem in datasets is the presence of null values. To count the number of null values, we use the function sum(is.na(..)) as follows:
From above, we can see that the dataset has 19 null values in total.
To remove these values, we use the na.omit(..) function as shown below:
Here, we can see that 11 rows were removed (as the number of rows dropped from 344 to 333). This means that some rows had multiple null values.
In the real world, we remove columns on the basis of correlation. To know which columns are highly correlated, we will plot a correlation heatmap. However, correlation can only be calculated between numeric columns. So, let’s select only the numeric columns from the dataset and make the change permanent.
To do this, we have used the select_if() and is.numeric functions. The select_if() function is a part of the ‘tidyverse’ package and subsets the column based on a condition. The is.numeric function checks whether a column is of the numeric data type. The syntax of the select_if() function is:
select_if(dataset name, condition)
As we can see, the penguins dataset has only numeric columns, viz. bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, and year.
Note: just by typing the name of the dataset, we can have a preview of the dataset.
To avoid any calculation error, let’s drop rows with missing values, and make it permanent.
As we can see, there are no missing values in the dataset.
All set now! Let’s plot the correlation heatmap.
We will be using the cor() function to calculate the correlation between the columns in the dataset. Let’s store this in a variable called corr.
To plot the correlation heatmap, we need to install the ‘corrplot’ library. Before you see the next code snippet, write your own code! Let’s test your learning.
Now let’s plot the heatmap. The syntax is rather simple:
corrplot(correlation between columns)
Ta-da! As we can see, flipper_length_mm is highly correlated with bill_lenght_mm and body_mass_g. That means, we will have to drop it to improve the accuracy of our model. Nuh-uh, you’ll have to write the code for this one!
Psst. scroll up for the cheat code!
Congratulations! You can now perform basic statistical operations on datasets and make beautiful visualizations using R. At this point, we encourage you to explore further and move forward in your journey to become an R expert. Here are some resources that might help:
Special thanks to Lajith Puthuchery and Nishant Dash for their valuable contributions to the article.