Introduction To Statistics Using R — 5

Vivekanandan Srinivasan
Analytics Vidhya
Published in
7 min readNov 13, 2019

If you have not read part 4 of the R data analysis series kindly go through the following article where we discussed Advanced-Data Wrangling in R — 4.

The contents in the article are gist from a couple of books that I got introduced during my IIM-B days.

R for Everyone — Jared P. Lander

Practical Data Science with R — Nina Zumel & John Mount

All the code blocks discussed in the article are present in the form of R markdown in the Github link.

To see all the articles written by me kindly use the link, Vivek Srinivasan.

Some of the most common tools used in statistics are means, variances, correlations, t-tests, and ANOVA. These are well represented in R with easy-to-use functions such as mean, var, cor, and t.test and anova. in this article, we are going to focus only on computing basic summary statistics so that t-tests and ANOVA are discussed elaboratory in the next article.

Mean

The first thing many people think of in relation to statistics is the average, or mean as it is properly called. We start by looking at some simple numbers and later in the article play with bigger datasets. First, we generate a random sampling of 100 numbers between 1 and 100.

x <- sample(1:100, size = 100, replace = TRUE)
x

sample uniformly draws size entries from x. The setting replace=TRUE means that the same can be drawn multiple times. Now that we have a vector of data we can calculate the mean. This is the simplest arithmetic mean. The value of mean may be different for you since we are calculating the mean from a sample of 100 random numbers.

mean(x)

Simple enough. Because this is statistics we need to consider cases where some data are missing. To create this we take x and randomly set 20% of the elements to NAs.

# copy x
y <- x
# randomly setting 20 values to NA
y[sample(1:100, size=100, replace=TRUE)] <- NA
y

Using mean on y will return NA. This is because, by default, if mean encounters even one element that is NA it will return NA. This is to avoid misleading information. To have the NA’s removed before calculating the mean, set na.rm to TRUE.

mean(y, na.rm=TRUE)

Weighted Mean

To calculate the weighted mean of a set of numbers, the function weighted.mean takes a vector of numbers and a vector of weights. It also has an optional argument, na.rm to remove NA’s before calculating; otherwise, a vector with NA values will return NA as mentioned in the above case.

grades <-  c(95,72,87,66)
weights <- (1/2,1/4,1/8,1/8)
weighted.mean(x=grades, w=weights)

The formula for weighted mean is given in the equation given below, which is the same as the expected value of a random variable.

Variance and SD

The variance and Standard deviation are the most commonly used measures of spread. We know that variance is a measure of how much spread out a data set is. It is calculated as the average squared deviation of each number from the mean of a data set.

var(x)

Standard deviation is the square root of the variance and is calculated with sd. Similar to mean function sd and var also, have the na.rm argument to remove NAs before computation.

sd(y, na.rm=TRUE)

Summary Statistics

In addition, to mean, sd and var other commonly used functions for summary statistics are min, max, median, and quantiles.

min(x)
max(x)
median(x)

The median as calculated before is the middle of an ordered set of numbers. For instance, the median of 5, 2, 1, 8 and 6 is 5. In the case when there is even amount of numbers, the median is the mean of the middle two numbers. For 5, 1, 7, 4, 3, 8, 6 and 2 the median is 4.5.

The helpful function that computes the mean, minimum, maximum and median at one go is summary. There is no need to specify na.rm because if there are NA’s, they are automatically removed and their count is included in the results.

summary(y)

The summary also displayed the first and third quantiles. To calculate other quantile values we can use quantile function.

quantile(x,probs= c(0.1, 0.25, 0.5, 0.75, 0.99))

Quantiles are numbers in a set where a certain percentage of the numbers are smaller than that quantile. For instance, of the numbers through 1 to 200, 75th quantile- the number that is larger than 75% of the numbers is 150.25.

Correlation and Covariance

When dealing with more than one variable, we need to test their relationships with others. Two simple, straightforward methods are correlation and covariance. To examine these concepts we look at the economics data from ggplot2.

require(ggplot2)
head(economics)

In the economics dataset, pce is personal consumption expenditure and psavert is the personal savings rate. We calculate their correlation using cor.

cor(economics$pce, economics$psvert)

This very low correlation makes sense because spending and saving are opposites of each other.

To compare multiple variables at once,use cor on a matrix (only for numeric variables).

cor(economics[,c(2,4:6)])

Because this is just a table of numbers, it would be helpful to also visualize the information using a plot. For this, we use the ggpairs function from the GGally package (collection of helpful plots built on ggplot2). This shows a scatterplot of every variable in the data against every other variable.

GGally::ggpairs(economics[,c(2,4:6))

This is similar to a small multiples plot except that each pane has a different x and y axis. While this shows the original data, it does not actually show the correlation in a visual sense. To show that we build a heatmap of the correlation numbers. A high positive correlation indicates a positive relationship between variables, high negative correlation indicates a negative relationship and near-zero represents no strong relationship.

require(reshape2)
require(scales)
econCor <- cor(economics[,c(2,4:6)])
## Converting to long format
econMelt <- melt(econCor,varnames=c("x","y"),value.names="Correlation")
econMelt

Now we use this data and create our heatmap for the correlation values using ggplot2.

ggplot(econMelt,aes(x=x,y=y)) + 
geom_tile(aes(fill=correlation)) +
scale.fill.gradient2(
low=muted("red"),
mid="white",
high="steelblue"),
guide=guide.colorbar(ticks=FALSE,barheight=10),
limits=c(-1,1)) +
theme.minimal() +
labs(x=NULL,y=NULL)

To see ggpairs in all its glory, look at tips data from the reshape2 package. This shows every pair of variables in relation to each other building either histograms, boxplots or scatterplots depending on the combination of continuous and discrete variables. While the data dump like this looks really nice, it is not always the most informative form of exploratory data analysis.

data(tips,package="reshape2")
head(tips)
GGally::ggpairs(tips)

No discussion of correlation would be complete without the old refrain, “Correlation does not mean covariance”. In other words, just because two variables are correlated does not mean they have an effect on each other. Running the following code generates a pleasant surprise. Kindly run the code and see it yourself since I do not want to break the surprise here.

require(RXKCD)
getXKCD(which="552")

Similar to correlation is covariance, which is like a variance between variables. The cov function works similar to cor functions with the same arguments for dealing with missing data. In fact, ?cov and ?cor pull up the same help menu.

cov(economis$pce,economics$psavert)

In the next article, we will see some advanced statistical concepts t-tests (One sample, two-sample and paired t-tests) and ANOVA which forms the basis of advanced statistical methods like regression.

Advance Statistics Using R — 6

Do share your thoughts and support by commenting and sharing the article among your peer groups.

--

--

Vivekanandan Srinivasan
Analytics Vidhya

An analytics professional with over six years of experience spanning across predictive modelling, statistical analysis and big data technologies.