Highlights of Descriptive Statistics

Personal notes and a deep dive into the concepts of descriptive statistics used to summarize and organize data in an easily understood manner.

29 min readJan 23, 2023

Motivation

Learning has always been an important aspect of my life and above all, the joy that it brought me. However, I noticed that whenever the acquired knowledge is not used frequently, it can slowly start disappearing, thus, I believe in refreshing learned concepts. Furthermore, I enjoy diving deep into subjects and understanding them from the ground up by conducting research and getting different perspectives. On top of that, often certain learned terms can appear under different names in other sources, and putting them together in one place helps bring clarity. This article aims primarily to be a form of personal notes, however, I believe that it can be beneficial to a broader audience as well.

Disclaimer: I skip the most fundamental concepts like mean, median, mode, standard deviation, and others since those concepts are commonly used. A great article that touches on that and can be read complementary to this one is “Understanding Descriptive Statistics” by Sarang Narkhede.

Legally obtained from Shutterstock: https://www.shutterstock.com/image-vector/futuristic-3d-isometric-data-graphic-business-1090127810

Introduction

Descriptive statistics involves summarizing and organizing the data in an easily understood way. Descriptive statistics, unlike inferential statistics, simply describes the data but does not try to generalize beyond the data nor attempt to make inferences from the sample to the whole population.¹⁻²

An example of using descriptive statistics would be describing starting salaries of college majors by calculating the mean salary and the range for each type of major. An additional metric could be the percentage of college graduates by major who obtain jobs within six months of graduation.¹

Descriptive statistics are broken down into two categories:

measures of central tendency (mean or average, median, mode),
and measures of variability or spread (standard deviation, z-scores, mean deviation or mean absolute deviation, variance, coefficient of variation, covariance, range, percentile, quartiles, interquartile range, skewness, skewness coefficient, kurtosis, correlation).²

Trimmed (Truncated) Mean

With arithmetic mean, median, and mode being used commonly, trimmed (truncated) mean does not get as much spotlight. In a perfect — normally distributed world— using a traditional mean would give a fairly reasonable picture of the data, however, often with outliers being present, things get more difficult. The main advantage of the trimmed mean is robustness and higher efficiency (quality) for mixed distributions and heavy-tailed distribution, at the cost of lower efficiency for some other less heavily tailed distributions (such as the normal distribution). A trimmed mean method helps in dealing with extreme observations by removing an equal percentage or number of the highest and lowest values in a given data before calculating the mean.³

A definition of a trimmed mean according to Investopedia⁴:

A trimmed mean (similar to an adjusted mean) is a method of averaging that removes a small designated percentage of the largest and smallest values before calculating the mean. After removing the specified outlier observations, the trimmed mean is found using a standard arithmetic averaging formula. The use of a trimmed mean helps eliminate the influence of outliers or data points on the tails that may unfairly affect the traditional or arithmetic mean. Trimmed means are used in reporting economic data in order to smooth the results and paint a more realistic picture.

The formula for a trimmed mean, where:

p — equals the number of truncated observations on one side (highest or lowest values),
n — equals the number of observations,
X — equals the value of the particular observation,
i — equals the number of the particular observation

is following:

LaTeX: \bar{X} = \frac{\sum_{i=p+1}^{n-p} X_i}{n-2p}

Mean of Group Frequencies and Standard deviation of Group Frequencies

Mean of Group Frequencies — Midpoint Method

Using a mean of grouped frequencies is useful to get an approximation of the sample mean in case the original data is missing or instead a frequency table or a histogram is provided. The higher the count or the frequency in each group or bin and the narrower the bin range, the better the approximation will be.⁵

The formula for a mean of group frequencies, where:

Mᵢ — equals the midpoint of the i-th range of grouped data,
Fᵢ — equals the frequency of the i-th rang,
n — equals the number of samples (sum of all frequencies).⁶

Why does this method work? Each bin can be thought its own sample of the overall data which contains only the values in that bin. What we know is that if we were to take the mean of just the data in that bin, the result would be close to the mean or the middle of that bin. This mean would be close to the midpoint or the center of that bin — that’s why the midpoint can be used as an approximation. This aims at estimating that within each bin on average, the values would converge around that midpoint. The frequency weights the midpoint for each bin — bins are weighted according to the number of observations in them — therefore using the midpoint as sort of a stand-in for the observations and the mean of those group observations is a way to go.⁵

Standard Deviation of Group Frequencies

Variance and standard deviation measure the dispersion of the data — how much variation there is from the mean.

A few things that help to get grouped data calculations extremely close to the actual data values:

a large dataset,
a huge sample, each bin has an adequate representation (weighting) in it,
bins are close in the number of observations inside each bin.⁷

After calculating all midpoints and the mean, the variance of grouped data needs to be found and ultimately the standard deviation.⁴

Here is the formula for sample variance of group frequencies, where:

Fᵢ— equals the frequency of the i-th rang,
Mᵢ — equals the midpoint of the i-th range,
n — equals the number of samples (sum of all frequencies),
μ — represents the mean.⁴

LaTeX: \sigma² = \frac{\sum (F_iM_i²)-(n\mu²)}{n-1}

The standard deviation equals the square root of the variance:

Geometric Mean & Geometric Standard Deviation

Geometric mean and standard deviation are extremely useful, especially in business when you’re dealing with rates of return on investments, or other types of financial instruments, but it’s also useful in other disciplines like biology, medicine, agriculture, or any other discipline while dealing with growth rates over periods of time.⁸

Geometric Mean

Given a square with each side of 4, thus, the value of the area of 16, what is the average length of each side? In this case, we are dealing with the two sides to find the area, of course, 4 x 2 equals 8, and then 8 divided by 2 gives 4. And obviously, 4 squared equals 16 — the original area of 16 units.⁸

However, what happens if not dealing with a perfect square? In a case of a rectangle, where one of the sides is 1.3 units, and the other side is 2.9 units giving an area of 3.77. What about the average length of the sides? The average of those sides gives 2.1. Squaring 2.1 gives 4.41 if using the average side lengths. However, that is not the same as the original area of a rectangle which was 3.77. In order to represent the original area of this rectangle using a single squared value, first the result of the multiplication of sides needs to be calculated (in this case 2.9 x 1.3 = 3.77), then the square root of this value (~1.94) can be squared. Now we get that area we expect — 3.77.⁸

If we want to create a perfect square of average side length, that is the same area as the rectangle, we have to multiply those sides together, then take the square root, not the average, to get that side length squared, and then we get the area in return.⁸

A similar idea can be applied to three dimensions, in a three-dimensional box with sides of 4, 6, and 9, thus, the volume of this three-dimensional rectangle is 216 units. The cubed result of the cubed root, due to the level of dimensionality, of multiplied sides equals the original volume of 216.⁸

LaTeX: \bar{x}_g=\sqrt[n]{(x_1)(x_2)…(x_n)}

Using the natural logarithm method will also lead to the same results:

LaTeX: \log(\bar{x}_g)=\frac{\log(x_1)\log(x_2)…\log(x_n)}{n}

The arithmetic mean is only suitable for additive processes, whereas the geometric mean can be applied for multiplicative processes (multiplying in sequence). All values for the geometric mean must be positive — that’s why growth factors are used. While the geometric mean is often used in business for financial growth, and investment performance, it is also useful for any measure that records growth. In the case of a finance professional wanting to know the average growth rate of a given series of growth rates, a biologist wanting to know the growth rates of bacteria, or a farmer that needs the growth rates of crops — a geometric mean instead of arithmetic mean must be applied.
While mathematically valid, unequal periods should not be used. Keeping the period standard is recommended, whether those are days, weeks, years, trials, or other —the measure of time between each period should be uniform.⁸

Geometric Standard Deviation

The geometric standard deviation describes the spread out of a set of numbers whose preferred average is the geometric mean. For such data, it may be preferred to the more usual standard deviation. Note that, unlike the usual arithmetic standard deviation, the geometric standard deviation is a multiplicative factor, and thus is dimensionless, rather than having the same dimension as the input values. Thus, the geometric standard deviation may be more appropriately called the geometric SD factor.⁹

When using the geometric SD factor in conjunction with the geometric mean, it should be described as the range from (the geometric mean divided by the geometric SD factor) to (the geometric mean multiplied by the geometric SD factor), and one cannot add or subtract geometric SD factor to or from the geometric mean.¹⁰

LaTeX: \log(\sigma_g)=\sqrt\frac{\sum(\log(X_i)-\log(\mu_g))²}{n}

The natural log of the geometric standard deviation equals the square root of the sum of the square deviations to the natural log Xi, which is the natural log of each observation minus the natural log of the geometric mean divided by the number of observations (n).⁸

In the example of financial data, the first growth factor was 1.05, and the geometric mean was 1.0128. First, the difference between the natural log of 1.05 minus the natural log of 1.0128 is squared, then the same happens with the next components. As a result, we end up with the natural log of the geometric standard deviation of 0.03 which is 1.0333 or 3.33% — the geometric standard deviation for the given data.⁸

Interquartile Range (IQR) and Box (Whisker) Plots

Box (whisker) plots are one of the key techniques used to visualize the shape of data. It is a simple way to visualize the shape of the data (normally distributed or skewed), identify outliers, and compare the characteristics of data between categories.¹¹

After sorting values from smallest to largest in order with the minimum value on the left and the maximum value on the right. In order to know the quartiles value, the data needs to be divided into four equal pieces, so that the middle is the 50th percentile (median), which means half of the values are below, and half of the values are above that point. Quartiles further divide those into two parts — the 25th percentile (first quartile) and the 75th percentile (third quartile). In the middle, between the first quartile and the third quartile, there is the IQR (interquartile range) — the middle 50% of the data. It is used when the median is used as the measure of central tendency. It gives the range in which the middle 50% of the distribution lies. In order to describe this in detail, we first need to discuss what we mean by quartiles.¹¹

Traditional Box Plot

In a traditional box plot, the box contains the interquartile range (the middle 50% of the data) starting at the first quartile and ending at the third quartile with a median inside and a lower and an upper limit, which is respectively, the minimum and maximum value.¹¹ The traditional box plot is constructed from five numbers: the minimum value, the first quartile, the median, the third quartile, and the maximum value, therefore, it is sometimes also called five number summary boxplot. The limitation of this type of box plot is that there’s no mechanism for identifying outliers since we define the whiskers as the minimum and maximum and the entire data will be encompassed no matter what.¹²

In the article “How to Compare Two or More Distributions”, Matteo Courthoud also highlights that:

The boxplot is a good trade-off between summary statistics and data visualization. (…) Therefore, the boxplot provides both summary statistics (the box and the whiskers) and direct data visualization (the outliers).¹³

Modified Box Plot

As stated by The Pennsylvania State University:

In a modified box plot, the box is drawn just as in a standard box plot, but the whiskers are defined differently. For a modified box plot, the whiskers are the lines that extend from the left and right of the box to the adjacent values. The adjacent values are defined as the lowest and highest observations that are still inside the region defined by the following limits:
Lower Limit: Q1−1.5×IQR
Upper Limit: Q3+1.5×IQR
(…) the greater than the upper adjacent value and are thus deemed as outliers.¹⁴

Kurtosis & Skewness

The main difference between skewness and kurtosis is that skewness refers to the degree of symmetry, whereas kurtosis refers to the degree of presence of outliers in the distribution.¹⁵

Kurtosis

Kurtosis is a measure that helps to get a better understanding of the tails of the distribution (whether the data are heavy-tailed or light-tailed relative to a normal distribution).¹⁶ A positive kurtosis indicates heavy (long) tails, a profusion of outliers, and peakedness relative to the normal distribution, whereas a negative kurtosis indicates light (short) tails, lack of outliers, and flatness.¹⁷

The formula for kurtosis is as follows:

LaTeX: Fisher’s Kurtosis=\sum_{i=1}^{n}\frac{\frac{X_i-\mu}{N}}{\sigma⁴}-3

Three common errors are that kurtosis is defined solely in terms of peakedness, with no mention of the importance of the tails, the relation between the peak and tails of distribution with excess kurtosis is described or illustrated incorrectly, and descriptions and illustrations of kurtosis fail to distinguish between kurtosis and the variance.¹⁸

A number of textbooks do not distinguish between kurtosis and variance. Positive and negative kurtosis is sometimes described as indicating large or small variances, respectively. However, the kurtosis measure (β₂–3) is scaled with respect to the variance, so it is not affected by it (it is scale-free). Kurtosis reflects the shape of a distribution apart from the variance.¹⁸

A normal distribution is a mesokurtic distribution that has a kurtosis of 3, and β₂–3 is often used so that the reference normal distribution has a kurtosis of zero (β₂–3 is sometimes denoted as γ). A positive (leptokurtic) distribution, where β₂–3 > 0, has a higher peak than the normal distribution and has heavier (long) tails. And a negative (platykurtic) distribution, where β₂–3 < 0, has a lower peak than a normal distribution and lighter (short) tails.¹⁹

Skewness

The preferred measure of central tendency often depends on the shape of the distribution. Of the three measures of tendency, the mean is most heavily influenced by any outliers or skewness. For distributions that have outliers or are skewed, the median is often the preferred measure of central tendency because the median is more resistant to outliers than the mean.²⁰

In a symmetrical distribution, the mean, median, and mode are all equal. In these cases, the mean is often the preferred measure of central tendency.²⁰

Conceptually, skewness describes which side of a distribution has a longer tail. If the long tail is on the right, then the skewness is rightward or positive; if the long tail is on the left, then the skewness is leftward or negative.²¹

Right skewness is common when a variable is bounded on the left but unbounded on the right. For example, durations (response time, time to failure) typically have right skewness since they cannot take values less than zero; many financial variables (income, wealth, prices) typically have right skewness since they rarely take values less than zero; and adult body weight has right skewness since most people are closer to the lower limit than to the upper limit of viable body weight. Well-known right-skewed distributions include the Poisson, chi-square, exponential, lognormal, and gamma distributions.²¹

Left skewness is less common in practice, but it can occur when a variable tends to be closer to its maximum than its minimum value. For example, scores on an easy exam are likely to have left skewness, with most scores close to 100% and lower scores tailing off to the left. We are not aware of any widely used distributions that always have left skewness.²¹

There are several distributions that can have either right or left skew depending on their parameters such as ambidextrous distributions including the binomial and the beta.²¹

Source: https://www.oreilly.com/library/view/clojure-for-data/9781784397180/ch01s13.html

Z-score

Z-scores are closely related to the standard deviation. They help to understand how far is any given data point from the mean in terms of distance. Z-score is a measure of distance — how many standard deviations away above or below from the mean is any given data point (above the mean would be to the right and below the main would be to the left). The standard deviation is a ruler of sorts becoming a measuring unit. Z-score is derived using the standard deviation and the mean.²²

The University of Texas-Houston highlights that:

The z-score is particularly important because it tells you not only something about the value itself, but also where the value lies in the distribution. Typically, for example, if the value is 3 standard deviations above the mean you know it’s three times the average distance above the mean and represents one of the higher scores in the sample. On the other hand, if the value is one standard deviation below the mean then you typically know it is on the low end of the midrange of the values from the sample. But, there is much more that is important about z-scores. In fact, the z-score opens the door to doing statistical inference for quantitative variables.²³

The formula for z-score:

Coefficient of variation

The coefficient of variation is a relative measure of variability. It measures the standard deviation relative to the mean answering the question of how large is the standard deviation relative to the mean. Informally, some people use a coefficient of dispersion interchangeably with a coefficient of variation.²⁴ The coefficient of variation can be calculated by dividing the standard deviation by the mean (usually expressed as a percentage):

Since this metric takes the form of a ratio, it is helpful for comparing data that have different means and standard deviations. Using the coefficient of variation it can be said how one dataset's standard deviation relates to its mean as compared to another. The coefficient of variation is also unit independent —different data sets can be compared regardless of their units (e.g. Fahrenheit vs Celsius degrees).²⁵

Covariance

Covariance is a descriptive measure of the linear association between two variables (bivariant relationship). If a variance is the measure of how dispersed a set of observations of a single variable is, covariance is the extent to which the variance in one variable depends on another variable. The higher the covariance, the stronger the relationship. Covariances can be positive (both variables move in the same direction), negative (both variables move in different directions), or in the case of no relationship, zero. The direction or the sign of the covariance (positive or negative) is the key part to focus on while interpreting the covariance.²⁵⁻²⁶

As pointed out in Will Monroe’s lecture notes for CS 109 class at Stanford University, if two random variables are independent their covariance is 0. The reverse is not true in general if the covariance of two random variables is 0, they can still be dependent.²⁷

Looking at the Cartesian coordinate plane is divided into four quadrants with the x-axis being a horizontal axis and the y-axis — being a vertical one. Each quadrant has its own property in terms of the signs of the x and y values — in the first quadrant both values are positive, in quadrant three both values are negative, in quadrant two X is negative but Y is positive and in quadrant four X is positive but Y is negative. For any chosen point, aside from values on the axes themselves, they have signs based on which those points can be assigned to quadrants.²⁵

Source: https://tutorme.com/blog/post/coordinate-plane-quadrants/

In quadrant one and quadrant three, signs are the same for the X and Y values — in the case of covariance, it means that the variables tend to move in the same direction since the slope is positive thus the covariance is positive. A positive slope can be characterized when it starts at the lower left (quadrant three) and goes up to the right (quadrant one).
In a different case when X decreases but Y increases or, the opposite, X increases and Y decreases, the variables move in opposite directions and the slope is negative thus the covariance is negative. The negative slope goes through the upper left and the lower right.
Another case is when the points don’t have any pattern therefore the variables have no linear relationship — the points aren’t grouped up along a positive or a negative one. In this case, the covariance will probably be near or equal to zero because they do not exhibit any sort of pattern.²⁵

The formula for covariance population and sample is as follows:

LaTeX: cov(X,Y)=\sum_{i=1}^{N}\frac{(x_i-\bar{x})-(y_i-\bar{y})}{N}

LaTeX: COV(X,Y)=\sum_{i=1}^{n}\frac{(X_i-\bar{X})-(Y_i-\bar{Y})}{n-1}

Covariance Matrix, Mean Vector, Confidence Ellipse & Kalman Filter

Covariance Matrix & Mean Vector

To summarize datasets consisting of a single feature we can use the mean, median, and variance, and datasets containing two features use the covariance and the correlation coefficient.²⁸ The mean vector consists of the means of each variable and the variance-covariance matrix consists of the variances of the variables along the main diagonal and the covariances between each pair of variables in the other matrix positions. The diagonal values which correspond to the row and the column of the same variable display variance of this variable. The mean vector is often referred to as the centroid and the variance-covariance matrix as the dispersion or dispersion matrix. Also, the terms variance-covariance matrix and covariance matrix are used interchangeably.²⁹

Source: https://towardsdatascience.com/5-things-you-should-know-about-covariance-26b12a0516f1

Confidence Ellipse & Kalman Filter

A Kalman Filter is a technique for taking a sequence of observations about an object or variable and determining the most likely current state of that object.³⁰

The covariance matrix used in the Kalman Filter represents the error of a multidimensional gaussian distributed data set. The covariance matrix is best understood by visualizing it as an error ellipse or error ellipsoid. The unique properties of an ellipse allowing visualization are a center, major axis, minor axis, and angle of rotation. An ellipse's other properties can define its shape (e.g. vertex and foci).³¹

The center of this ellipse is at coordinates (0,0). The major axis of an ellipse is the larger of the two primary axes and the minor axis of an ellipse is the smaller of the two primary axes. These axes are perpendicular to each other and intersect at the center point. In the figure below, the major axis for the ellipse is equal to 2a and the minor axis is equal to 2b. Therefore, the semi-major axis is equal to a and the semi-minor axis is equal to b. The major and minor axes are aligned with the x’ and y’ coordinate frame. The x’-y’ coordinate frame can be thought of as the same as an x-y coordinate frame but rotated at the center by a rotation angle of θ (theta).³¹

Source: https://thekalmanfilter.com/covariance-matrix-explained/

In order to plot a confidence ellipse based on the covariance matrix, the defining characteristics of a confidence ellipse like confidence interval e.g. 95%, the major axis length, the minor axis length, the angle of rotation, and the center of the ellipse need to be identified. As described above, these can be determined by the covariance matrix.³¹

For the covariance matrix without a correlation between X and Y (D), the ellipse based on a 1-sigma or a 68% confidence interval and the other one based on a 2-sigma or 95% confidence interval will respectively look like the blue and the orange ellipse in the picture below.³¹

The variance terms align with the magnitude of the semi-major and semi-minor axes or the spread of the data — a and b.
The cross terms, σ (sigma) xy, are zero because there is no correlation between x and y. When these terms are equal to 0, then the angle of rotation of the ellipse is also 0. So, if there is no correlation between the two parameters, there is no rotation of the ellipse.³¹

In the case of the covariance matrix with the correlation between X and Y (P), in the picture below, similar to the previous one, the blue ellipse is based on a 1-sigma or 67% confidence interval while the orange ellipse is based on a 2-sigma or 95% confidence interval.³¹

The size of the ellipses for both covariance matrices (D and P) are the same but the covariance matrices look different since in the matrix P ellipses are rotated.
Variance terms do not align with the magnitude of the semi-major and semi-minor axes (a and b) as they did when the data was not correlated — this is due to the rotation.
The cross terms (i.e. sigma xy) are non-zero because there is a correlation between x and y. The angle of rotation is 45 degrees.³¹

The confidence interval is based on the chi-square value for the plotted confidence interval. The major and minor axis can be determined based on the eigenvalues of the matrix. And the angle of rotation can be determined from the eigenvectors of the covariance matrix.³¹

The Kalman Filtering process seeks to discover an underlying set of state variables {xk} for k ∈ [0, n] given a set of measurements {yk}. The Kalman filter uses the prediction of a current state based on a previous estimate in conjunction with a current measurement to estimate the true current state.³²⁻³³

Correlation

Correlation, also known as correlation coefficient, is a measure of the direction and strength of the relationship between two variables.³⁴ Correlation between two random variables, r is the covariance of the two variables normalized by the variance of each variable. This normalization cancels the units out and normalizes the measure so that it is always in the range [0, 1].²⁷

LaTeX: r = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})²}\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})²}}

If r = 0 we say that X and Y are uncorrelated. If two variables are independent, then their correlation will be 0. However, a correlation of 0 does not imply independence.²⁷

When people use the term correlation, they are actually referring to a specific type of correlation called the Pearson correlation. It measures the degree to which there is a linear relationship between the two variables.²⁷ In a sample, we use the symbol r. In a population, we use the Greek letter ρ (“rho”).³⁴

For a positive association, r > 0, for a negative association r < 0, if there is no relationship r = 0.
The closer r is to 0 the weaker the relationship and the closer to +1 or −1 the stronger the relationship (e.g. r = −0.88 is a stronger relationship than r = +0.6); the sign of the correlation provides direction only.
Correlation is unit free; the x and y variables do not need to be on the same scale (e.g. it is possible to compute the correlation between height in centimeters and weight in pounds).
It does not matter which variable is labeled as x and which as y. The correlation between x and y is equal to the correlation between y and x.³⁴

The following table may serve as a guideline when evaluating correlation coefficients³⁴:

Source: https://online.stat.psu.edu/stat200/lesson/3/3.4/3.4.2

It is important to note that correlation does not equal causation. A strong relationship between x and y does not necessarily mean that x causes y. It is possible that y causes x, or that a confounding variable causes both x and y.
Pearson’s r or ρ should only be used when there is a linear relationship between x and y. A scatterplot should be constructed before computing Pearson’s r or ρ to confirm that the relationship is not non-linear.
Pearson’s r is not resistant to outliers. Influential outliers are points in a data set that increase the correlation coefficient.³⁴

An alternative measure is the Spearman correlation, which has a formula almost identical to the correlation defined above, with the exception that the underlying random variables are first transformed into their rank.³⁵ This comes from the fact that, unlike Pearson’s correlation, there is no requirement of normality and hence it is a nonparametric statistic. Spearman’s correlation coefficient is a statistical measure of the strength of a monotonic relationship between paired data.³⁶ It assesses how well the relationship between two variables can be described using a monotonic function.³⁷ A monotonic function is one that either never increases or never decreases as its independent variable increases.³⁶

Source: https://www.statstutor.ac.uk/resources/uploaded/spearmans.pdf

Monotonically increasing — as the x variable increases the y variable never decreases;
Monotonically decreasing — as the x variable increases the y variable never increases;
Not monotonic — as the x variable increases the y variable sometimes decreases and sometimes increases.³⁶

Its interpretation is similar to that of Pearson’s — the closer rₛ is to +/- 1, the stronger the monotonic relationship. Spearman’s correlation coefficient is a measure of a monotonic relationship and thus a value of rₛ does not imply there is no relationship between the variables. For example in the following scatterplot rₛ which implies no (monotonic) correlation however there is a perfect quadratic relationship³⁶:

As highlighted by Juhi Ramzai:

The Spearman correlation can evaluate a monotonic relationship between two variables — continous or ordinal and it is based on the ranked values for each variable rather than the raw data.
The fundamental difference between the two correlation coefficients is that the Pearson coefficient works with a linear relationship between the two variables whereas the Spearman coefficient works with monotonic relationships as well.
One more difference is that Pearson works with raw data values of the variables whereas Spearman works with rank-ordered variables.³⁷

More correlation coefficients

The Pearson product-moment correlation coefficient requires quantitative (interval or ratio) data for both variables, whereas the Spearman (“rho”) correlation coefficient applies to ranked (ordinal) data. It is often the case that the data variables are not at the same level of measurement, or that the data might instead of being quantitative be categorical (nominal or ordinal). In addition to correlation coefficients based on the product-moment and thus related to the Pearson product-moment correlation coefficient, there are coefficients that are instead measures of association which are also in common use.³⁸

Source: https://www.andrews.edu/~calkins/math/edrm611/edrm13.htm

Point-Biserial correlation

The point-biserial correlation is a special case of the product-moment correlation in which one variable is continuous and the other variable is binary (dichotomous). The point-biserial correlation is mathematically equivalent to the Pearson correlation where one variable is continuous but the other is dichotomous.³⁹

LaTeX: r_{pb}=\frac{M_1-M_0}{s_n}\sqrt{\frac{n_0n_1}{n²}}

For example, you could use a point-biserial correlation to determine whether there is an association between salaries, measured in US dollars, and gender (i.e., your continuous variable would be “salary” and your dichotomous variable would be “gender”, which has two categories: “males” and “females”). Alternately, you could use a point-biserial correlation to determine whether there is an association between cholesterol concentration, measured in mmol/L, and smoking status (i.e., your continuous variable would be “cholesterol concentration”, a marker of heart disease, and your dichotomous variable would be “smoking status”, which has two categories: “smoker” and “non-smoker”).⁴⁰

Phi coefficient (φ or rφ)

The phi-coefficient is actually a variation of Pearson’s product—moment coefficient of correlation definition of r where both variables are required to be dichotomous. It measures of association between two binary variables. The phi-coefficient was designed for the comparison of truly dichotomous distributions, i.e., distributions that have only two points on their scale which indicate some unmeasurable attribute. Attributes such as living or dead, black or white, accept or reject, and success or failure are examples. It is also sometimes known as the Yule φ or mean square contingency coefficient and is denoted by φ or rφ).⁴¹

The phi-coefficient relates to the 2 × 2 contingency table as below to simplify the calculation of the Phi coefficient:

If A, B, C, and D represent the frequencies of observation, then φ is determined by the relationship:

LaTeX: \Phi=\frac{(bc-ad)}{\sqrt{(a+b)(c+d)(a+c)(b+d)}}

It bears a relationship to χ² (chi-squared), where:

The significance of φ may be tested by determining the value of χ² from the above relationship and testing in the usual way.

As an example, 43 persons were asked if they believed that there was any truth in horoscopes or in the existence of UFOs. Applying the above formula, φ = 0.266. This value of φ corresponds to a value of χ² of 43 × (0.266)² = 3.04. This may then be tested against the relevant value of χ² for 1 degree of freedom. An alternative significance test (rarely used) may be performed by considering the standard error of φ. Calculation of this is laborious but if N is not too small, then 1/ √ N approximates it.⁴¹

Note that this is the Pearson correlation coefficient, just calculated in a simplified manner. However, the extreme values of |r| = 1 can only be realized when the two-row totals are equal and the two-column totals are equal. There are thus ways of computing the maximal values if desired.³⁸

Measures of Association: C, V, Lambda

As product-moment correlation coefficients, the point biserial, phi, and Spearman “rho” are all special cases of Pearson. However, there are correlation coefficients that are not. Many of these are more properly called measures of association, although they are usually termed coefficients as well. Three of these are similar to Phi in that they are nominal against nominal data, but these do not require the data to be dichotomous.³⁸

One is called Pearson’s contingency coefficient and is termed C whereas the second is called Cramer’s V coefficient. Both utilize the chi-square statistic so will be deferred into the next lesson. However, the Goodman and Kruskal lambda coefficient does not but is another commonly used association measure. There are two flavors, one called symmetric when the researcher does not specify which variable is the dependent variable, and one called asymmetric which is used when such a designation is made. We leave the details to any good statistics book.³⁸

Rank-biserial correlation

The rank-biserial correlation is used to assess the relationship between a dichotomous categorical variable and an ordinal variable. The rank-biserial test is very similar to the non-parametric Mann-Whitney U test that is used to compare two independent groups on an ordinal variable. Mann-Whitney U tests are preferable to rank-biserial correlations when comparing independent groups. Rank-biserial correlations can only be used with dichotomous (two levels) categorical variables. Polychotomous (three or more levels) categorical variables cannot be analyzed using rank-biserial correlations.⁴²

The formula is as follows:

where M₁ and M₀ are the mean ranks in the continuous or ordinal variable, in groups 1 and 0, respectively, and n=n₁+n₀ is the total sample size.

Biserial correlation

The biserial correlation is an estimate of the original product-moment correlation constructed from the point-biserial correlation used with an artificial dichotomy. For example, you may want to calculate the correlation between IQ and the score on a certain test, but the only measurement available with whether the test was passed or failed.

Suppose you want to find the correlation between a pair of bivariate normal random variables when one has been dichotomized. Sheskin (2011) states that the biserial correlation can be calculated from the point-biserial correlation rpb using the formula³⁹:

LaTeX: r_b=(\frac{r_{pb}}{h})\sqrt{p_0(1-p_0)}

An example might be test performance vs anxiety, where anxiety is designated as either high or low. Presumably, anxiety can take on any value in between, perhaps beyond, but it may be difficult to measure. We further assume that anxiety is normally distributed.³⁸

Tetrachoric Correlation

The tetrachoric correlation coefficient, rₜₑₜ, is used when both variables are dichotomous, like the φ (phi), but we need also to be able to assume both variables really are continuous and normally distributed. Thus it is applied to ordinal vs. ordinal data which has this characteristic. Ranks are discrete so in this manner, it differs from the Spearman. The formula involves a trigonometric function called cosine — cos(α). The cosine function, in its simplest form, is the ratio of two side lengths in a right triangle, specifically, the side adjacent to the reference angle divided by the length of the hypotenuse. The formula is:

LaTeX: r_{tet}=\cos(\frac{180}{1+\sqrt{\frac{BC}{AD}}})

Coefficient of Nonlinear Relationship (η — eta)

It is often useful to measure a relationship irrespective of if it is linear or not. The η (eta) correlation ratio or η (eta) coefficient gives us that ability. This statistic is interpreted similarly to Pearson, but can never be negative. It utilizes equal width intervals and always exceeds |r|. However, even though r is the same whether we regress y on x or x on y, two possible values for η (eta) can be obtained.³⁸

Final notes

I hope that this article helps in learning new concepts or serves as a refreshment. I strongly believe in living documentation, thus, this article might get updated once I learn new things, want to restructure some points, or for some other reasons.

One of the difficulties I came across while writing this article was scientific typesetting. At first, I wanted to use already written equations found online, however, the final results, in terms of quality, were disappointing. Then, I came across an article from Bill Markos on “Using LaTeX on Medium”, which helped me in finding a way to quickly add high-quality scientific typesetting by using latex2png.com website. Also, extremely helpful resource for working with LaTeX symbols was this cheat-sheet.

latex2png - convert latex equations to images

latex to png image converter

latex2png.com

References

Jim Frost. Descriptive statistics
Sarang Narkhede. Understanding Descriptive Statistics
Brendan Foltz. Statistics 101: Descriptive Statistics, Mean, Median, and Mode
Will Kenton. Trimmed Mean: Definition, Example, Calculation, and Use
Brendan Foltz. Statistics 101: Descriptive Statistics, Mean of Group Frequencies
Michael Darcy. Grouped Data Standard Deviation Calculator
Brendan Foltz. Statistics 101: Descriptive Statistics, Standard Deviation of Grouped Frequencies
Brendan Foltz. Statistics 101: Geometric Mean and Standard Deviation
GraphPad. The geometric mean and geometric SD factor
Thomas B. L. Kirkwood. Geometric Means and Measures of Dispersion
Brendan Foltz. Statistics 101: Descriptive Statistics, IQR and Box Plots
Barbara Illowsky and Susan Dean (De Anza College) with many other contributing authors. Five Number Summary and Box Plots Part 1
Matteo Courthoud. How to Compare Two or More Distributions
The Pennsylvania State University. 13.4 — Box Plots
Sarang Narkhede. Understanding Descriptive Statistics
The National Institute of Standards and Technology. 1.3.5.11. Measures of Skewness and Kurtosis
Sundaresh Chandran. Kurtosis — How to Explain to a 10-year-old
Lawrence T. DeCarlo. Fordham University. On the Meaning and Use of Kurtosis
MVP Programs. Skewness/Kurtosis
The Pennsylvania State University. 2.2.4.1 — Skewness & Central Tendency
Paul von Hippel. Ohio State University, USA. Skewness.
Brendan Foltz. Statistics 101: Understanding Z-scores
University of Texas-Houston Health Science Center. 1.6 Standard Scores
Stephanie Glen. “Coefficient of Dispersion” From StatisticsHowTo.com: Elementary Statistics for the rest of us!
Brendan Foltz. Statistics 101: Understanding Covariance
Sergen Cansiz. 5 Things You Should Know About Covariance
Stanford University. Will Monro. CS 109 Covariance and Correlation
New York University Center for Data Science. Probability and Statistics for Data Science Covariance matrix
The National Institute of Standards and Technology. 6.5.4.1. Mean Vector and Covariance Matrix
Data Skeptic. Kalman Filters
William Franklin. Covariance Matrix Explained With Pictures
Tim Babb. How a Kalman filter works, in pictures
Anwesh Marwade. Kalman Filtering: An Intuitive Guide Based on Bayesian Approach
The Pennsylvania State University. 3.4.2 — Correlation
Stanford University. Chris Piech. CS109 Correlation
Loughborough University Mathematics Education Centre. Dr Alun Owen. Spearman’s correlation.
Juhi Ramzai. Clearly explained: Pearson V/S Spearman Correlation Coefficient
Andrews University. Keith G. Calkins. Applied Statistics — Lesson 13 More Correlation Coefficients
NCSS. Chapter 302 Point-Biserial and Biserial Correlations
Laerd Statistics. Dr Adam Lund & Mark Lund. Point-Biserial Correlation using SPSS Statistics
O. B. Chedzoy. Phi-coefficient
Dr. Eric Heidel. Rank biserial Correlation between dichotomous and ordinal variables

Sumplementary reads

Wayne Snyder. Boston University Computer Science Department. CS 237: Probability in Computing
Jackie Nicholas. University of Sydney Mathematics Learning Centre NSW 2006. Introduction to Descriptive Statistics
Adam S. Charles. Kalman Filtering: A Bayesian Approach
Srishti Saha. Baffled by Covariance vs. Correlation? You’re Not Alone.
San José State University. 14: Correlation
Nicholas School of the Environment, Duke University. Elizabeth A. Albright, PhD. Descriptive Statistics
Projects Lab, University of Utah. Descriptive Statistics
Princeton University. Oscar Torres-Reyna. Data Preparation & Descriptive Statistics (v. 2.7)
University of Colorado Boulder. The mean, variance and covariance