8 Descriptive Statistics Concepts Explained Using R

Measures of Central Tendency, Dispersion, and Position

Marco Basile

Published in

Analytics Vidhya

16 min readSep 25, 2020

Prelude

I decided to split the original article on statistics in two parts.

In this article, the first one, you’ll find the usual descriptive statistics concepts:

Measures of Central Tendency: Mean, Median, Mode
Measures of Dispersion: Variance and Standard Deviation
Measures of Position: Quartiles, Quantiles and Interquartiles

In the next one, I’ll demonstrate how to use R to explain statistical concepts connected to A/B Hypothesis testing, such as:

Sampling, sample size, and Population
Hypothesis Creation and Null Hypothesis
P-Value and Confidence Level
Statistical Significance and Power
One Sample T-test and Multiple Samples T-test (A/B or multivariate A/B testing)
Type I and Type II errors

Before diving in, we’ll need to introduce briefly statistics and R.

What is Statistics?

According to merriam-webster, statistics is a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.

We want to make things easier.

We say that statistics is just a term used to summarize the process that analysts use to characterize a given data set.

Throughout this process, the analyst gathers, reviews, analyses, and draws conclusions from data.

Interpretation happens during the last step, helping to make better-informed decisions, whether it’s in business or in science.

Depending on what you need to perform on a given data set, you have different types of statistics, often called stats.

For example, anything that describes data (charts, graphs, mode, mean) is a descriptive stat.

Instead, test stats are what we’re gonna talk about in the next article since they're used in null hypothesis testing.

We need to differentiate between statistic and parameter though.

A statistic is a piece of data from a portion of your original population, called sample.

A parameter is a piece of data related to all of your population.

Since for most tests you can’t have all the data, analysts use samples of a given population to make educated guesses about the original population.

We’re gonna use both in the next two articles, as our data sets may be small at times.

Now that you have a small statistics foundation, it’s time to understand who is this guy called “R”.

What is R?

A graphical representation of the A/B multivariate testing

As stated in r-project, R is a language and environment for statistical computing and graphics.

R is a coding language totally different from others, as it was developed mainly by statisticians.

In fact, R provides a wide variety of statistical and graphical techniques, such as tests, linear modeling, time-series analysis, and more.

One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed.

We can surely say that, together with Python, it’s the best language out there for data manipulation, calculation, and interpretation.

On top of that, the R community is very active and full of data enthusiasts ready to help you.

A quick note before diving in.

If you’ve ever heard “Base R”, it’s the same as “R”.

Analysts and users often called “R” “Base R” as R itself just includes the basic packages, while you can add many others to perform more advanced analysis.

Now that you learned what Statistics and R are, let’s talk about our concepts using R.

Measures of Central Tendency

Mean

The mean, represented with μ as a parameter of a given population and with x̅ as a statistic of a population’s sample, is often called the average in daily life.

It’s the midpoint or the typical value of a given data set.

In mathematics, we’d use this formula:

In plain English, you calculate the mean or average by summing all the elements you got and dividing the total sum by the number of elements.

To calculate the mean in R, we use:

mean(my_data)

If you have a data set called NYC_weather and you want to save the average value in Average_NYC_weather, here’s what we’d do:

Average_NYC_weather <- mean(NYC_weather)

Now let’s talk about the median.

Median

The median is the value that falls right in the middle of your data set, given that the data set is ordered from smallest to largest value.

You can have an even or an odd number of values:

Odd n. of values

five_author_ages <- c(29, 49, 42, 43, 32)

First, we want to sort them out, and then we have:

sorted_author_ages <- c(29, 32, 42, 43, 49)

Now, we can see the median is 42.

Even n. of values

six_author_ages <- c(29, 49, 42, 43, 32, 44)

Let’s sort them out:

sorted_author_ages <- c(29, 32, 42, 43, 44, 49)

Since now we have two values in the middle, we’ll find the median using the mean.

In this case, “(42+43)/2”, therefore 42.5.

In R, these two passages are performed using the formula:

median(my_data)

If we want to calculate the median of a sample data set called children_ages, and save the result to median_children_ages, here’s how it’d look like:

median_children_ages <- median(children_ages)

Now let’s go on with the Mode.

Mode

In plain words, the mode is the value that appears more frequently in a given data set.

We can have multiple modes in a given data set if they have the same frequency.

We have this numbers series:

29,49,42,43,32,38,37,41,27,27

The mode is 27, as it appears twice more than the others.

In R, the package DescTools includes the Mode() function.

First, you want to import the library, together with the standard ones:

#import librarieslibrary(readr)library(dplyr)library(DescTools)

Next our sample data set:

example_data <- c(24, 16, 12, 10, 12, 28, 38, 12, 28, 24)

Then, our mode:

example_mode <- Mode(example_data)

In this case, that would be 12.

Now let’s sum up all the concepts we learned so far with a simple project in R.

Playing with R: Central Tendency for NYC Houses

Living in NYC may be expensive, but how much exactly?

In this simple project, we’re gonna find the mean, median, and mode for there of the five NY boroughs: Brooklyn, Manhattan, and Queens.

Let’s import the libraries first:

# Load librarieslibrary(readr)library(dplyr)library(DescTools)

Then, let’s load the three data sets in R:

# Read in housing databrooklyn_one_bed <- read_csv('brooklyn-one-bed.csv')brooklyn_price <- brooklyn_one_bed$rentmanhattan_one_bed <- read_csv('manhattan-one-bed.csv')manhattan_price <- manhattan_one_bed$rentqueens_one_bed <- read_csv('queens-one-bed.csv')queens_price <- queens_one_bed$rent

Note: the “$” sign is used to select a column of a given data set.

Note: “read_csv” is used to load the .csv file in the R object.

Let’s calculate the mean:

brooklyn_mean <- mean(brooklyn_price)print(brooklyn_mean) #3327.404manhattan_mean <- mean(manhattan_price)print(manhattan_mean) #3993.477queens_mean <- mean(queens_price)print(queens_mean) #2346.254

The average cost is pretty high in Manhattan. Anyone living there?

Then, the Median:

brooklyn_median <- median(brooklyn_price)print(brooklyn_median) #3000manhattan_median <- median(manhattan_price)print(manhattan_median) #3800queens_median <- median(queens_price)print(queens_median) #2200

And Finally the mode.

For Brooklyn:

brooklyn_mode <- Mode(brooklyn_price)
print(brooklyn_mode)#[1] 2500
#attr(,"freq")
#[1] 26

R is telling us the mode is the value 2500, appearing 26 times.

For Manhattan:

manhattan_mode <- Mode(manhattan_price)
print(manhattan_mode)#[1] 3500
#attr(,"freq")
#[1] 56

R is telling us the mode is the value 3500, appearing 56 times.

For Queens:

queens_mode <- Mode(queens_price)
print(queens_mode)#[1] 1750
#attr(,"freq")
#[1] 11

R is telling us the mode is the value 1750, appearing 11 times.

To summarize all the values we’ve got:

Mean: Brooklyn(3327.404), Manhattan(3993.477), Queens(2346.254)
Median: Brooklyn(3000), Manhattan(3800), Queens(2200)
Mode: Brooklyn(2500|26), Manhattan(3500|56), Queens(1750|11)

These results tell us that Manhattan is the most expensive place to live, with a central tendency of $3900, while Brooklyn is second, with a mean of $3300, and Queens the last one, with a central tendency of £2300.

The median and the mode seems to confirm that as well.

If you want to dive deeper, try to make an educated guess looking at their histograms. Here’s how they look like:

library(ggplot2)hist(brooklyn_price)

library(ggplot2)hist(manhattan_price)

library(ggplot2)hist(queens_price)

Now that we know what mean, median and mode are and how to find them using R, let’s talk about Variance and Standard Deviation.

Measures of Dispersion

Variance

Learning about Mean, Median, and Mode is a good place to start describing our data, but what if we have two different data sets that look like this:

dataset_one <- c(-4, -2, 0, 2, 4) 
dataset_two <- c(-400, -200, 0, 200, 400)

Even though values are quite different, the mean is always 0.

That’s not enough to communicate differences.

That’s why we introduce Variance.

Variance tells you how spread out the points in data set are:

Let’s introduce two sample data sets:

# load librarieslibrary(readr)library(dplyr)library(ggplot2)teacher_one_grades <- c(83.42, 88.04, 82.12, 85.02, 82.52, 87.47, 84.69, 85.18, 86.29, 85.53, 81.29, 82.54, 83.47, 83.91, 86.83, 88.5, 84.95, 83.79, 84.74, 84.03, 87.62, 81.15, 83.45, 80.24, 82.76, 83.98, 84.95, 83.37, 84.89, 87.29)teacher_two_grades <- c(85.15, 95.64, 84.73, 71.46, 95.99, 81.61, 86.55, 79.81, 77.06, 92.86, 83.67, 73.63, 90.12, 80.64, 78.46, 76.86, 104.4, 88.53, 74.62, 91.27, 76.53, 94.37, 84.74, 81.84, 97.69, 70.77, 84.44, 88.06, 91.62, 65.82)

Now, let’s represent both with a histogram.

t1_chart <- qplot(teacher_one_grades,geom='histogram',binwidth = .8,main = 'Teacher One Grades',ylab = 'Grades',fill=I("blue"),col=I("red"),alpha=I(.2)) +geom_vline(aes(xintercept=mean(teacher_one_grades),color="mean"), linetype="solid",size=1) +scale_color_manual(name = "statistics", values = c(mean = "red" ))

Let’s explain what we did:

with the qplot function, we create a histogram with the following characteristics:

histogram chart type
a column width of 0.8
a title of ‘Teacher one Grades’
Y-axis called ‘Grades’
filling the columns with blue and adorned with red
a line // Y-axis corresponding to the mean of ‘Teacher one Grades’
the line // Y-axis, solid type and size 1
color red for the given line

Note: This syntax is valid even for the next histogram chart.

Now let’s see what we get:

t1_chart

Histogram for teacher_one_grades data set

Let’s see the other one. Code:

t2_chart <- qplot(teacher_two_grades,geom='histogram',binwidth = .8,main = 'Teacher One Grades',ylab = 'Grades',fill=I("blue"),col=I("red"),alpha=I(.2)) +geom_vline(aes(xintercept=mean(teacher_two_grades),color="mean"), linetype="solid",size=1) +scale_color_manual(name = "statistics", values = c(mean = "red" ))

Same syntax as above.

Now let’s see the chart:

t2_chart

Histogram for teacher_two_grades data set

Try to answer the following question:

“Which data set is the most spread out?”

Now, if we want to calculate the variance ourselves, here’s the process we’d follow.

First, we calculate the distance of each value from the mean of the data set.

Let’s say we have the following data set:

grades <- c(88, 82, 85, 84, 90)

We calculate the mean and print it out:

mean <- mean(grades)print(mean) #85.8

The difference for each one of those look like this:

difference_one <- 88 - meandifference_two <- 82 - meandifference_three <- 85 - meandifference_four <- 84 - meandifference_five <- 90 - mean

Then, we need the average of the distances.

difference_sum <- difference_one + difference_two + difference_three + difference_four + difference_fiveaverage_difference <- difference_sum / 5average_difference # 0.000000000000002842171

Now, what happens if we have this data set:

c(-200, 200)

If we find the average, we’d get 0, but that’s not an indicator of how spread out this dataset is in reality.

Hence, the problem is with negative numbers. To get rid of those, we square each difference:

the squared difference between each value and the mean

Where X is the given value and μ is the mean.

Now, let’s sum it up with a single formula:

∑ stands for the total sum of numbers, starting from the first one(i=1) to the last one(N).
X is the given value and μ the mean.
N is the total number of values we have.
θ is the symbol we use to indicate variance.

In R:

variance <- function(x) mean((x-mean(x))^2)

Now that we understood every single piece of Variance, let’s make it simple and straightforward using R.

teacher_one_variance <- variance(teacher_one_grades)
teacher_two_variance <- variance(teacher_two_grades)

Variance is flawed though, because it’s not descriptive enough, as it’s squared of the mean distance.

That’s why we can say that Standard Deviation is the son of Variance. In Javascript we’d say:

class Variance{};class Standard_Deviation extends Variance{};

Now, let’s find out what Sd() is all about.

Standard Deviation

Standard deviation is computed by taking the square root of the variance.

Here’s how it looks like:

Now let’s make an example using R:

library(dplyr)library(tidyr)load("lesson_data.Rda")variance <- function(x) mean((x-mean(x))^2)hist(nba_data, col=rgb(0,0,1,1/4),xlim=c(50,100), main="NBA and OkCupid",breaks=80)hist(okcupid_data,  col=rgb(1,0,0,1/4), add=T, breaks=80)legend("topright", c("NBA Data", "OkCupid Data"), fill=c("blue", "red"))box()

Above we load libraries, define the variance function and display two histograms of two sample data sets.

Here’s what we did:

choose an RGB color for both
define an interval for X-axis → 50 ≤ X ≤100
spaces between columns of 80
created a legend at the top right margin filling the NBA Data checkbox with blue and the OkCupid Data checkbox with red.

Here’s what we get:

histogram for NBA data and OkCupid Data data sets.

these data sets contain a sample of the heights of OkCupid users and NBA players.

Here’s how we find Standard Deviation:

nba_standard_deviation <- sd(nba_data)okcupid_standard_deviation <- sd(okcupid_data)print(paste("The standard deviation of the NBA dataset is ",nba_standard_deviation))print(paste("The standard deviation of the OkCupid dataset is ", okcupid_standard_deviation))

Here’s what we get:

"The standard deviation of the NBA dataset is  3.65199686214009""The standard deviation of the OkCupid dataset is  3.92632398306864"

Now that we know how much a unit is, we can easily determine how spread out our data really is.

In most cases, you’ll find that 68% of your values fall in the first unit, as in the chart below:

Let’s find out how many standard deviations Lebron James is away from the mean, in OkCupid and NBA data sets.

Lebron is 80 inches tall.

We need to compute first the difference with the mean:

nba_difference <- 80 - nba_meanokcupid_difference <- 80 - okcupid_mean

Now, we divide the difference by the sd().

num_nba_deviations <- nba_difference / nba_standard_deviationnum_okcupid_deviations <- okcupid_difference / okcupid_standard_deviationnum_nba_deviations # 0.552026761276738
okcupid_standard_deviation # 2.95085175089013

As expected, Lebron is way more distant from the mean of OkCupid than the NBA data set.

That’s how Standard deviation can be extremely helpful for our interpretations.

Now let’s conclude with a simple project using the concepts of Variance and Standard Deviation.

Playing with R: Variance in London Weather

We want to know which is the best month to go on a holiday in London between June and July, looking at the temperature and making an accurate prediction of whether it’s less probable that we meet bloody weather or not.

First, we load libraries and data:

library(readr)library(dplyr)load("project.Rda")

How many rows do we have?

print(nrow(london_data)) #39106

We filter out June and July and save the results:

june <- london_data %>%filter(month == "06")july <- london_data %>%filter(month == "07")

We piped(%>%) the london_data in the filter function and saved the result in the related month.

We select just the temperature columns and save in two R objects:

temp_june <- june$TemperatureCtemp_july <- july$TemperatureC

We calculate the mean and the standard deviations for both:

verage_temp_june <- mean(temp_june)print(average_temp_june) #17.04729
average_temp_july <- mean(temp_july)print(average_temp_july) #18.77561
sd_temp_june <- sd(temp_june)print(sd_temp_june) #4.59863
sd_temp_july <- sd(temp_july)print(sd_temp_july) #4.137097

In your opinion, what would be the best month to go on holiday?

Here’s my take on this:

“Based on the analysis above, I’d rather pick the month of july for a vacation in London, since there’s an higher temperature on average.
On top of that, there’s even a lower standard deviation, which means that there’s a lower probability to have an outsider during my vacation.”

Now, let’s talk about measures of Position.

Measures of Position

Quartiles

A quartile is one of the four equal groups in which a data set can be split.

By doing this, we can say whether a data point falls into the 1st, 2nd, 3rd or 4th group.

Quartiles are represented by the letter “Q”: Q1, Q2, Q3, Q4.

The 2nd quartile happens to be the median.

Let’s take a sample data set:

dataset_one <- c(−108,4,8,15,16,23,42)

If we have an even number of values instead, we’ll compute the average of the two values in the middle.

The Q2 would be:

dataset_one_q2 <- 15

Q1 and Q3 are, respectively, the medians of the values before and after the Q1 median.

In this case:

dataset_one_q1 <- 4 #(-108,4,3)dataset_one_q3 <- 23 #(16,23,42)

Now let’s use R to find our quartiles. We load an R object continnaning the lengths of 9.975 songs.

load("songs.Rda")

Now, we calculate Q1, Q2, Q3:

songs_q1 = quantile(songs,0.25)songs_q2 = quantile(songs,0.50)songs_q3 = quantile(songs,0.75)

Here’s our final histogram would look like:

hist <- qplot(songs,geom="histogram",main = 'Histogram of Song Lengths',xlab = 'Song Length (Seconds)',ylab = 'Count',fill=I("blue"),col=I("red"),alpha=I(.2)) +geom_vline(aes(xintercept=quantile(songs,0.25),color=I("blue")),linetype="solid",size=1,show.legend=T) +geom_vline(aes(xintercept=quantile(songs,0.5),color=I("purple")),linetype="solid",size=1,show.legend=T) +geom_vline(aes(xintercept=quantile(songs,0.75),color=I("yellow")),linetype="solid",size=1,show.legend=T) +scale_colour_manual(name = "",labels =c("Q1","Q2","Q3"),values=c("blue","purple","yellow"))

We made an histogram with the following characteristics:

we include songs.rda file
the main title is ‘Histogram of Song Lengths’
the song’s length is represented by the X-axis
the number of songs of a certain length is represented by the Y-axis
We filled the columns with blue adorned by red.
We created 3 lines of solid type, size 1, colored blue, purple and yellow // Y-axis.
These lines represent Q1, Q2, and Q3.
We created a legend at the right with correct labels and checkboxes with each colored line.

Here’s the result:

Histogram of Song lengths, divided in Quartiles

A song with a length of 300s would fall in the 4th quartile, while one with a length of 250s would fall in 3rd.

Now let’s take a look at the quartiles’ parents, the quantiles.

Quantiles

Quartiles are a specific kind of Quantiles.

You want to know whether you’re in the top 10% of your school or not.

Using quantiles, you can split the data into ten equal groups.

Quantile() takes the data set and a number or a vector or numbers between 0 and 1 as an input.

ten_percent <- quantile(school_one, 0.10)

This is partially correct, because it splits the first 10% of the data apart from the remaining 90%.

We’d have to use something like this instead:

ten_percent <- quantile(school_one, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)

This perfectly splits the data set into 10 parts with the same amount of values.

hist <- qplot(school_one,geom="histogram",main = 'School One',xlab = 'SAT Score',ylab = 'Count',fill=I("blue"),col=I("red"),alpha=I(.2)) +geom_vline(aes(xintercept=quantile(school_one,0.1),color=I("blue")),linetype="solid",size=1,show.legend=T) +geom_vline(aes(xintercept=quantile(school_one,0.2),color=I("blue")),linetype="solid",size=1,show.legend=T) +geom_vline(aes(xintercept=quantile(school_one,0.3),color=I("blue")),linetype="solid",size=1,show.legend=T) +geom_vline(aes(xintercept=quantile(school_one,0.4),color=I("blue")),linetype="solid",size=1,show.legend=T) +geom_vline(aes(xintercept=quantile(school_one,0.5),color=I("blue")),linetype="solid",size=1,show.legend=T) +geom_vline(aes(xintercept=quantile(school_one,0.6),color=I("blue")),linetype="solid",size=1,show.legend=T) +geom_vline(aes(xintercept=quantile(school_one,0.7),color=I("blue")),linetype="solid",size=1,show.legend=T) +geom_vline(aes(xintercept=quantile(school_one,0.8),color=I("blue")),linetype="solid",size=1,show.legend=T) +geom_vline(aes(xintercept=quantile(school_one,0.9),color=I("blue")),linetype="solid",size=1,show.legend=T)

Here’s what we did:

we plotted on a histogram the school_one file
the main title is ‘School One’
the SAT score is represented by the X-axis
the number of SAT scores is represented by the Y-axis
We filled the columns with blue adorned by red.
We created 9lines of solid type, size 1, colored blue // Y-axis.
These lines represent Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8, Q9.

Here’s what we get:

We finally arrived to the last measure of descriptive statistics, the InterQuartiles.

InterQuartiles

InterQuartiles, or IQR, are commonly used when Q1 and Q4 are outliers.

In this case, we’ll find out the difference between the Q1 and the Q3, hence InterQuartiles.

If we have to do that manually, we compute Q1 and Q3 and find the difference:

q1 <- quantile(dataset,0.25)q3 <- quantile(dataset,0.75)interquartile <- q3 - q1

But let’s not overcomplicate things. R comes to our help:

dataset = c(4, 10, 38, 85, 193) interquartile_range = IQR(dataset) #75

Why IQR is useful?

IQR doesn’t care about outliers in your data.

The following data sets give the same result:

dataset_one = c(6, 9, 10, 45, 190, 200) #144.5
dataset_two = c(6, 9, 10, 45, 190, 20000000) #144.5

Playing with R: Life Expectancy By Country

if you’ve come this far, it means you really enjoy statistics!

Now, take a deep breath because the last project contains some new stuff.

Ready? Let’s dive in.

We want to know how much our life expectancy varies depending on the country we live in.

The country_data data set contains data from 158 countries.

Let’s find out who lives longer.

We load libraries and data set:

library(ggplot2)library(readr)library(dplyr)data <- read_csv("country_data.csv")

Let’s print out the quartiles of the entire data set:

life_expectancy_quartiles <- quantile(life_expectancy)print(life_expectancy_quartiles)# 0%       25%      50%      75%      100%
# 46.11250 62.32500 72.52500 75.44219 82.53750

Here’s how our histogram look like:

We want to understand the difference between countries with low GDP and high GDP.

First, we pull the column from our table:

gdp <- data %>% pull(GDP)

We split low and high depending on the median GDP.

median_gdp <- median(gdp)

We find low GDP and high GDP filtering values ≤ median_gdp and > median_gdp.

We save the result in low_gdp and high_gdp:

low_gdp <- data %>%filter(GDP <= median_gdp)
high_gdp <- data %>%filter(GDP > median_gdp)

Since we need to get data about life_expectancy, we’d have to pull the life_expectancy column by the data data set.

Here’s what we do:

low_gdp <- data %>%filter(GDP <= median_gdp) %>%pull(life_expectancy)high_gdp <- data %>%filter(GDP > median_gdp) %>%pull(life_expectancy)

Now we compute the value of quartiles, both for low and high GDP countries.

low_gdp_quartiles <- quantile(low_gdp,c(0.25,0.5,0.75,1))print(low_gdp_quartiles)# 25%      50%      75%      100% 
# 56.33750 64.34375 71.73750 75.96875

High GDP countries:

high_gdp_quartiles <- quantile(high_gdp,c(0.25,0.5,0.75, 1))print(high_gdp_quartiles)  
   
# 25%      50%      75%      100% 
# 72.96562 75.15625 80.52187 82.53750

See the difference?

The poorest people in the world live less than 56 years. The wealthiest more than 82 years.

Note: These values are clearly average stats, and don’t show outliers.

Finally, let’s take a look at the histograms:

Low GDP countries: