Baby Steps into Data Science 02 — Math&Statistics: Descriptive Statistics

Editor — Ishmael Njie & Sulayman Saleem

DataRegressed Team
DataRegressed
5 min readMar 22, 2018

--

To start off the Math&Statistics chapter, we are going to go into the idea of Descriptive Statistics and its importance in the analysis of data.

First of all, what is a statistic?

A statistic is a metric obtained from an investigation of data.

Now when we talk about Descriptive Statistics, we aim to calculate some metrics that will tell a ‘story’ about our dataset without going through all of the data.

When given a particular dataset, one may want to find a number (or set of numbers) that represents the data. Initially, the first notion that one will look at is the Central Tendency of the dataset.

Generally, we often refer to Central Tendency/Centrality as the Average of the dataset; there are many types of Averages, however, we will look at the 3 most commonly used:

Mean: The arithmetic mean to be more precise, is the sum of numbers in a set, divided by the number of values in the set.

Median: This is the value that separates the upper half and the lower half of the dataset. Essentially, it’s the middle number of the dataset.

Mode: This is the value that occurs the most within a given dataset.

Now some may wonder why one of these measures is not enough to describe the central tendency of a dataset. Let us show you an example to highlight the issues in the individual averages.

Take the following arbitrary dataset:

4,4,4,4,100

Let’s look at computing the averages of this set of data:

Mean: 23.2, Median: 4, Mode: 4

Now looking at the averages computed, we can see that the mean value of 23.2 is not a good representation of the whole dataset. It has been affected by the large value of 100. In statistics, we call this number an outlier - a number that does not particularly fit in within the rest of the values in the dataset. In contrast to the mean, our median and mode are more representative of the dataset. The median and the mode also have their downfalls. The median does not take into account the magnitude of each value in the dataset. The idea of the mode falls when 1) There is no mode and 2) when there is more than one mode.

Following the Centrality of the dataset, one may want to look at how spread the values in a dataset are. We call this Dispersion; we will cover two measures of dispersion:

Variance: squaring the differences between the data points and the mean and then taking the average of those squares.

Standard Deviation: the square root of the variance.

Formula for variance

The variance metric will give us an indication as to what degree, on average, do the data points differ from mean of the dataset. Generally, the variance gets smaller as the number of data points increases.

The differences are squared during calculation to avoid differences above the mean being neutralised by differences below the mean, which in some cases could cause the variance to become 0.

However, by squaring the differences, the variance metric is now no longer in the same unit of measure as the data points (the same way cm is in a different unit of measure to cm squared). This is where the standard deviation comes into play. By taking the square root of the variance metric, we now return to the original unit of measure for the data points.

Closely linked with the averages, skewness illustrates the asymmetry of a probability distribution:

Positive skew: where the tail of the distribution is on the far right. This occurs if most of the data is on the left-hand side.

Negative skew: where the tail of the distribution is on the far left. this occurs if most of the data is on the right-hand side.

Symmetrical: Where the data is evenly distribution.

As mentioned above, the averages are closely linked to the skewness of a probability distribution.

From the graphic, you can see the position of the 3 averages and how they differ in each of the 3 distributions.

The mean always tends to be right in the middle of the curve. The median differs depending on the location of the majority of the data. The mode is always represented by the peak of the curve.

Summary

This introduction to statistics is important to understand when analysing data and making conclusions about your data. The central tendency metrics are used to portray a value that best describes the central position of the entire dataset. Dispersion looks at how far or close the data points are from the central position. Finally, skewness looks at whether the distribution of the data is distorted or symmetrical.

Kurtosis, similar to skewness, is a measure of the shape of a probability distribution which illustrates the peak of the curve. Quantiles are another brand of statistics that captures the value that best represents the data under a particular position (similar to the median).

In analysis, use a combination of the metrics described in this post to make a conclusion of the dataset; each metric individually is not robust enough to make a clear judgement of the overall dataset.

--

--