Data Analysis with R — The Basics

Nathaniel Motulsky
6 min readMay 31, 2024

--

Coding, Analysis, and important Terminology. This article has it all, so use the table of contents to skip around to what you need.

Not a Member? Read the full story here: FULL ARTICLE

Table of Contents:

  1. R Graphing and Code
  2. Data Analysis and Vocalization
    1. Measures of Spread (Standard Deviation, IQR, and Range)
  3. Vocabulary

R Graphing and Code

The first thing you want to do in R is initialize the mosaic and tidyverse packages to be able to use important functions including favstats, boxplots, and mosaic plots. You can do that with the functions below.

Library(mosaic)
Library(tidyverse

If you get an error, you might not have those packages installed already, if so, use the functions below to do that. Also, here is a detailed explanation of how to install packages if you need more help with that: https://www.dataquest.io/blog/install-package-r/.

install.packages("mosaic")
install.packages("tidyverse")

We are going to create a numerical variable to use in our graphs, but you can use any numerical variable in its place.

scores<-c(79,81,80,77,83,74,93,80,73,77,83,86,90,79,86,83,89,97)

We can create a stem and leaf plot of this with the following code.

stem(scores)

This would create an output that looks like this:

We can use the favstats command to see the 5 number summary of our variable.

favstats(~scores)

This should create an output that looks like this:

We can create a boxplot with the following code.

bwplot(scores)

That should create an output that looks like this:

We will now create a categorical variable.

grade<-c("Senior","Senior","Senior","Senior","Senior","Senior","Senior","Junior","Junior","Junior","Junior","Junior","Junior","Junior","Junior","Junior","Junior","Junior")

We can use the tally function to see how many juniors and how many seniors there are.

tally(grade)

You should get an output like this:

We can assign scores to the juniors and seniors by giving the first score to the first Senior the second to the second and so on. We can then visualize the scores and the grades with a side by side boxplot. See code below.

test<-data.frame(grade,scores)
bwplot(scores~grade,data=test,xlab="grade")

The first codeline is lining up the grade and scores variables to connect the first two values, the second two, and so on. The second codeline is a boxplot with the two variables scores and grade, we tell R that the dataset is “test” which we defined in the first codeline, and we title the X axis with “grade”. You should get an output like this:

We can also create a bargraph of the scores:

bargraph(~scores)

Output:

We can visualize scores vs grades with a mosaic plot:

mosaicplot(scores~grade,data=test)

The output should look like this:

Another way to visualize this that I think looks much better with a segmented bar graph.

ggplot(data=test, aes(x=scores, fill=grade))+geom_bar(position="fill")

The output should come out like this:

We can also create a histogram of the scores:

histogram(~scores,data=test,ylab="proportion")

Output:

We can separate this by grade using a side by side histogram with either of the two following commands depending on the layout you want.

histogram(~scores|grade,layout=c(1,2),data=test)
histogram(~scores|grade,layout=c(2,1),data=test)

The first line should create an output like this:

The second output should look like this:

Data Analysis and Vocalization

In this section we will talk about data analysis and how to describe data. A useful resource is the Vocabulary section which you can scroll to if you don’t know a word.

Three main ways we describe data and graphs are spread, center, and shape. Spread includes standard deviation, range, and IQR and we use words like more spread out or variable. Center includes mean and median, and we use words like greater mean/median or greater average. Shape includes left-skewed, right-skewed, or symmetrical.

We will use an example graph that we coded earlier.

Try describing the boxplots.

We would say that juniors have a greater median score than seniors as we see with the center dot, but juniors are more variable as we see with the greater range and greater IQR.

Measures of Spread

Standard deviation is the amount of deviation from the mean. The diagram below demonstrates that a small standard deviation means that the values are closer together, with less variability, whereas a large standard deviation means the values are more spread out, and are more variable.

The range is the difference between the largest and smallest values. It tells you how far apart the largest and smallest values of the data are. With the range, outliers can skew the numbers heavily, so if you have a data set of (1,25,26,27,28,75), the range would be 74, but the data is much closer than that, it is just that there are outliers.

The IQR is kind of like the range without outliers. It is the difference between the third and first quartiles (scroll down to the vocabulary if you don’t know what this means). The IQR is often a better measure of spread than range because it is not influenced by outliers.

We usually use mean and standard deviation to describe data if it is symmetric. But if it is skewed, we use median and IQR.

Vocabulary

  • Mean: the average of a set of values (add up all the values and divide by the number of values)
  • Median: the most middle value (find the value that is in the middle of the data set. Take off highest and lowest value, then second highest and second lowest, and so on. If you have two middle values take them mean of the two.
  • Minimum: smallest value
  • Maximum: greatest value
  • Range: greatest value − smallest value
  • Upper/Lower Quartile (Q3/Q1): The median of the upper/lower half of the data set (after finding median, find median of just half of the data set)
  • IQR (Interquartile Range): Q3 − Q1
  • 5 Number Summary: Minimum, Q1, Median, Q3, Maximum
    Standard Deviation:
  • Mode: the most common number (the number that appears the most)
  • Standard Deviation: amount of deviation from the mean

--

--