Descriptive Statistics for Data Science.

Sahil Mankad
Analytics Vidhya
Published in
5 min readDec 27, 2019

Introduction:

Statistics is the building block for data science and it’s important for a data scientist to have a hold on it. Learning and staying up to the mark is a tedious task and is something data scientists struggle with. This article is intended for people who are getting introduced to data science and need an overview. For others, this could be a refresher to the basics. Feel free to bookmark this link and come back to it whenever required if you find it helpful. Let us get started.

Outliers:

Outliers can be defined as values that fall outside normal range. For example in the series: 4,7,19,999,8,14, 999 will be the outlier. The analyst decides if value is an outlier or not, for a particular dataset.

Mode:

Mode can be defined as the most frequently occurring value in a distribution.

In the series 5,10,5,6,8,3,21 ,5 is the mode since it occurs the most number of times(twice).

All values are not important for a mode, since we only need to check for frequency of occurrence of numbers, which is also the reason why mode is robust to outliers. Let’s say we add 900 to the series 5,10,5,6,8,3,21,900, the mode stays the same. Mode is generally used for categorical variables, similar to the example below.

Let’s consider the distribution of 5 color balls: Red,Red,Green,Blue,Red

Here, Red is the mode, as it occurs most frequently,i.e 3 times.

A distribution can have 1 or more than one modes. A single mode distribution is unimodal, two mode distribution is bi-modal and distribution having many modes is a multi-modal distribution.

Mean:

Mean is the average of numbers in a distribution, or generally speaking:

Mean = (Sum of terms)/(number of terms)

Mean is sensitive to outliers, therefore it’s not a very robust measure.

Example: Let us consider previous distribution

  1. Without outlier: 5,10,5,6,8,3,21, mean = 58/7 = 8.29
  2. With Outlier: 5,10,6,8,3,21,900,5 mean = 958/8 = 119.75.

As seen in the example above, adding outliers can drastically change the mean value. Mean is generally used for continuous variables.

Median:

Median can be defined as the absolute central value of a numeric distribution sorted in ascending order. The median for an odd length series is the middle most element and for even length series it’s the mean of the middle two elements

Examples:

  1. 3,5,5,6,8,10,21. Here the series length is odd and the middle element is 6, so 6 is the median.
  2. 3,5,5,6,8,10,21,900. Here the series length is even and the middle elements are 6 and 8, so mean of 6 & 8, i.e 7 is the median.

We can also observe from the above examples that addition of outlier in the second example did not affect the mean. Thus, median can be used as a more robust alternative to mean

Median is also generally used for continuous variables.

Quantile and Quartile:

A Quantile is an arbitrary point of data, while quartiles are values dividing dataset into quarters While we will deal with quartiles mostly while working on a data problem, it is better to understand the difference between both and clear out the confusion.

Median divides the dataset into 2 parts. Median of the data on the left of the median is the 1st quartile, and that to the right of the mean is the 3rd quartile of the distribution. This can be clearly understood with the example below:

https://www.mathsisfun.com/data/quartiles.html

The quartiles and other important values can be represented by a box plot as shown below:

https://www.mathsisfun.com/data/quartiles.html

Spread of Data:

We may need to check how similar or varied our set of observations are, while working on a data science project. There are 2 measures to calculate this:

  1. Range: It is the difference between maximum and minimum values. It is directly proportional to the spread of data. Range is sensitive to outliers
  2. Interquartile Range(IQR): It is the difference between 3rd quartile and 1st quartile. It is robust to outliers, since it takes into account the quartiles, which as we know are derived from medians, which are robust to outliers

Note that, we use a similar approach to calculating the quartiles as covered before.

Example:

  1. Without outlier: 3,5,5,6,8,10,21

Quartile 1: 5

Quartile 2(Median): 6

Quartile 3: 10

Range: 21–3 = 18

IQR: 10–5 = 5

2. With Outlier: 3,5,5,6,8,10,21,900

Quartile 1: 5

Quartile 2(Median): (6+8)/2 = 7

Quartile 3: 21

Range: 900–3 = 897

IQR: 21–5 = 16

3. Variance and standard deviation can also be used to measure the spread of the data. We’ll cover them later in this article.

Below is a table of contents showing measures and their sensitivity to outliers.

Variance and Standard Deviation:

Let us have a look at the wikipedia definitions for both these terms

Variance: The expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of (random) numbers are spread out from their average value.

Standard Deviation: A measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

Below are the formulas:

https://towardsdatascience.com/intro-to-descriptive-statistics-and-probability-for-data-science-8effec826488

We use squares of deviation for variance ensure that deviation above and below the mean do not nullify each other, this can be understood by the little example below

Adding the absolute differences between deviations we get: -5+0+5 = 0.

Adding squared difference of deviation from mean we get: 25+0+25 = 50.

The added benefit is that we penalize the outliers heavily. However, because of the squaring, variance is not in the same unit of measurement as the original data. This is the reason we generally use standard deviation, the square root of the variance for calculation purposes.

Conclusion:

In this article, we covered some basics of descriptive statistics, hope you enjoyed. I have not covered the part involving the central limit theorem and Z-scores, which I intend to cover in a later article, along with some probability concepts. Until then, Sayonara!

--

--