Statistics — Deep dive — Part3

Pavan Ebbadi
Analytics Vidhya
Published in
8 min readSep 2, 2020

In part2 we covered frequency distribution and histograms
Part4 - Sample space, probability, permutation and combinations

In this part we will be looking into
1. Measures of central tendency
2. Measures of Variation and Position
3. Exploratory Data analysis

While describing data, central tendencies and variance play an important role. Look at the following statements

A. Average income of the house is $50,000.
B. More than 40% of the people gathered in procession were teenagers.
C. 50% of all products brought in cities today are through online.
Though these statements are plain, they provide a lot of information. Able to measure central tendency is very important and a primitive skill of a statistics.

Statistic — A measure obtained by using data values from a sample.
Eg: Average of sales by sales agent on random 10 days
Parameter — Measure obtained by using all data of population.
Eg: Average sales per day of sales agent throughout his job.

Measures of central tendency

  1. Mean : The mean, also known as the arithmetic average, is found by adding the values of the data and dividing by the total number of values.

Mean height of a family whose 5 members heights are as below:
5', 4.5', 3', 6', 3.5' is (5+4.5+3+6+3.5)/5 = 4.2

The procedure for finding the mean for grouped data with frequency uses the midpoint of class limit and multiply it with frequency as shown below.

Class and Frequency(f) are given, midpoint(Xm and f*Xm gives total for each class. This should be aggregated and then divided by sum of frequency to get the mean.

2. Median: is the midpoint of the data array or also known as 50th percentile of data. Data should be arranged in order to calculate median.

Eg: Find the median for the daily vehicle pass charge for five U.S. National Parks. The costs are $25, $15, $15, $20, and $15.
Order the data : 15, 15, 15, 20, 25 — The mid point of the data array is 15.
If there is not a single mid point then average the 2 middle numbers.

The midrange is a rough estimate of the middle. It is found by adding the lowest and highest values in the data set and dividing by 2.
2, 3, 6, 8, 4, 1
Midrange = (1+8)/2 = 4.5

3. Weighted mean is variance in mean calculation where instead of directly using mean we multiply it with weight.
Eg: If you purchase milk from 3 shops where price/gal is 2$, 3$, 4$ respectively, average amount of money spent on milk is not 3$.
Assume you bought 1gal , 2gal , 3gal from each shop , average money spent is (1x2+2x3+3x4)/(1+2+3) = 3.33.

4. Mode is the most repeated/value with highest frequency. A dataset can have more than one mode or mode might not exist at all.
Eg: Mode of 2,3,3,4 is 3

Measures of Variation

A coach wants to select 5 tall kids in squad for athletics, he choses below groups (heights in ft)
Group A: 3.8 , 4.2 , 5 , 5.5 , 7.5 (crazy tall guy )
Group B: 5.1 , 5 , 5.2 , 5.3 , 5.4
Though both Group A and B has same average heights, Group B is much evenly distributed heights. This is a small data set where we can visibly see high variance. To identify variations in data we use range, variance and standard deviation.

  1. Range is the highest value minus the lowest value.
    Range for above examples :
    Group A : 7.5 - 3.8 = 3.7
    Group B : 5.4 - 5 = 0.4
  2. Population Variance is the average of the squares of the distance each value is from the mean. Variance is denoted by greek letter sigma square.

3. Population Standard Deviation : is the square root of the variance. The symbol for the population standard deviation is sigma.

Problem:

Mean, Variance and SD calculation

4. Sample variance and Sample Standard deviation: Population variance formula when applied on a smaller sample does not give the best estimate of the population variance. Variance computed by this formula usually underestimates the population variance. Therefore, instead of dividing by n, find the variance of the sample by dividing by n -1, same applies for sample standard deviation too.

Problem

Variance and Standard Deviation for Grouped Data

Coefficient of variation: A Statistic that allows you to compare standard deviations when the units are different.

The range can be used to approximate the standard deviation. The approximation is called the Range rule of thumb.
Eg: 5,8,9,11,18 .
Range = 18–5 = 13
S.D = Range/4 ~ 3.25

Chebyshev’s Theorem

The proportion of values from a data set that will fall within k standard deviations of the mean will be at least 1 - 1/k² , where k is a number greater than 1 (its k square).

This theorem states that at least 75%, of the data values will fall within 2 standard deviations of the mean of the data set. This result is found by substituting k = 2 in the expression.
1–1/k² = 1–1/2² = 0.75 or 75%

The Empirical (Normal) Rule

Chebyshev’s theorem applies to any distribution regardless of its shape. However, when a distribution is bell-shaped (or what is called normal), the following statements, which make up the empirical rule, are true.

  1. Approximately 68% of the data values will fall within 1 standard deviation of the mean.
  2. Approximately 95% of the data values will fall within 2 standard deviations of the mean.
  3. Approximately 99.7% of the data values will fall within 3 standard deviations of the mean.

Measures of Position

“You can’t compare apples and oranges.” But with the use of statistics, it can be done to some extent. Suppose that a student scored 90 on a music test and 45 on an English exam. Direct comparison of raw scores is impossible, since the exams might not be equivalent in terms of number of questions, value of each question, and so on. However, a comparison of a relative standard similar to both can be made. This comparison uses the mean and standard deviation and is called a standard score or z score. (We also use z scores in later chapters.)

A standard score or z score tells how many standard deviations a data value is above or below the mean for a specific distribution of values. If a standard score is zero, then the data value is the same as the mean.

A z score or standard score for a value is obtained by subtracting the mean from the value and dividing the result by the standard deviation. The symbol for a standard score is z. The formula is

z = (value - mean) /standard deviation

For population and samples, the formula is
z= (X- mean )/s

The z score represents the number of standard deviations that a data value falls above or below the mean.

Percentile calculation — Calculating percentile is very simple yet an important skill in statistics.
It’s calculated with this formula c = ( n x p)/100
Where n = total values in array
p = percentile required
*. If c is whole number then count = (c+(c+1))/2
*. If c is not whole number, round up to next whole number

Eg: If you need to calculate 30th percentile of below array of 20 numbers
[3,4,1,7,6,8,12,11,14,18,2,5,13,9,19,16,10,17,15,20]
Step1 : Arrange the numbers in ascending order(or descending order)
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
Step 2: Identify total numbers and the percentile required and substitute in the formula
n = 20
p = 30%ile
count = (n * p)/100 = 20x30/100 = 6
Identify 6th number , which is 6 , since count is whole number average 6th and 7th number. In this case (6+7)/2 = 6.5
So 6.5 is the 30th percentile
Similarly 5.5 is the 25th percentile

Quartiles, Interquartile Range and Outliers

Quartiles divide the distribution into four groups, separated by Q1, Q2, Q3. Where Q1 is 25th percentile, Q2 is 50th and Q3 is 75th percentile.

Q2 is same as median.
Q1 is calculated as median between Q2 and lowest value

Interquartile range (IQR) is defined as the difference between Q1 and Q3 and is the range of the middle 50% of the data.

Outliers — Extremely high or extremely low values of a dataset are called outliers. Anything values in dataset not in between-1.5*Q1 to +1.5*Q3 can be treated as outlier.

Exploratory data analysis (EDA)

Exploratory data analysis laid out by John Tukey pictures the stem and leaf of data. It uses median for central tendency, IQR for variance and Boxplot(also called as whiskers plot) to graphically highlight the spread of data.
The popular 5 number summary in boxplot has
1. Lowest value in the dataset
2. Q1
3. Median
4. Q3
5. Highest value of dataset

Summary of concepts covered in this part
• Measures of central tendency
Mean, median , mode

•Measures of variance
Range, variance , standard deviation

•Difference between population and sample standard deviation.

•Chebyshev’s theorem to range of data between k standard deviations

•Measures of position — Z index

•Percentile, IQR, Outliers and 5 point summary

Reference: Elementary Statistics — Bluman

--

--

Pavan Ebbadi
Analytics Vidhya

Senior Advisor of Analytics at CVS Caremark. Leading a team to build Personalization engine for CVS customers with stats, machine learning and deep learning.