Statistics — 101

Peeush Agarwal
Analytics Vidhya
Published in
5 min readJul 25, 2021

I have come up with yet another story, and this time it is the basics that we require in any data analysis problem, Statistics.

Photo by Jesse Collins on Unsplash

What is Statistics?

It is a collection of methods for performing experiments in obtaining data and then organizing, summarizing, presenting, analyzing, interpreting, and then drawing inferences out of the data.

We have 2 types of Statistics:

  1. Descriptive statistics — Collection, organization, summarization and presentation of the data.
  2. Inferential statistics — Analysis, interpretation, making predictions out of the data.

We have a concept of Population and Sample, that we need to understand before proceeding further.

Population — Every possible individual element that we are interested in measuring.

Sample — It is the subset of the population on which we can do analysis and make inferences about the population.

Example:

Suppose, we want to calculate the average height of all people in the state of Karnataka.

Now, one method can be to measure height of each person in the state, but is that really possible? Next to impossible, in my opinion.

So, another method is, we take subset of the population as our Sample and then calculate average height of the sample and then specify same for the population. We just need to make sure our sample is not biased and represent the population in every manner.

Descriptive statistics

It basically describes the data we’re concerned with. It gives us majorly 2 things:

  • Measures of Central Tendency
  • Measures of Variation
Photo by Kelly Sikkema on Unsplash

Measures of Central Tendency

This measure tries to describe the data in a single value. It averages the data and provides the “middle” value which can best describe the given data.

We have 3 types of measures:

  • Mean
  • Median
  • Mode

Mean is the arithmetic mean of the values.

Mean formula

Example:

Note: Mean can easily be affected by outliers in the data. Consider an example of salaries of employees in a hypothetical company. Suppose, we have 3 employees in the data: Janitor, Software developer, CEO and their salaries as $1K, $10K, $1M respectively. Now, if we calculate the mean of these 3 values, it will tend towards $1M, which, of course, cannot be compared to the rest of the data. We can observe this in action in the following snippet:

To encounter this issue, we use Median often.

Median is another measure that gives us the center value of the data.

To find the median, follow these steps:

  • Sort the values in ascending or descending order, and
  • Then find the central value. In case if total count of values is even, we calculate the arithmetic mean of 2 central numbers to get the median.

Let’s see this in action:

Median calculation when count of values is odd
Median calculation when count of values is even

In our outlier case, we get the median as $10K, which in the case is the actual central value in the data.

Mode is the most frequent value in the data.

Suppose, we have collected height data for 5 people randomly, which is as follows:

5, 5.5, 5.5, 5.2,5.6 (all in ft.)

Now, the mode is equal to 5.5ft., as it is the most frequent value in the data.

Photo by Pritesh Sudra on Unsplash

Measures of Variation

The central tendency alone does not provide complete information about the dataset. It should always be looked at alongside variation in the dataset to get the complete picture.

Measures of Central Tendency provides us just the central value, while Measures of Variation provides us the amount of dispersion in the dataset.

We have the following popular types of variation measures:

  1. Range
  2. Interquartile range
  3. Variance
  4. Standard deviation

Range is the difference between maximum and minimum values in the dataset. Though it is the simplest of all measures of variation, it is not advisable to use it for larger samples and those also include outliers.

Python implementation for Range

It doesn’t actually give much information about the dataset and the variation within. It does not show how tightly or loosely the data is clustered around the center. Not only that, but it can be easily influenced by the outliers in the dataset.

Interquartile range is the difference between 75th and 25th percentiles of the data. The data is actually divided into 4 equal parts (after sorted in ascending or descending order) called as quartiles. Quartiles can be as following:

  • Quartile 1 or Q1 lies between 0 and 25th percentile of the data,
  • Quartile 2 or Q2 lies between 25th and 50th percentile of the data,
  • Quartile 3 or Q3 lies between 50th and 75th percentile of the data,
  • Quartile 4 or Q4 lies between 75th and 100th percentile of the data.

Interquartile range is nothing but Q3 — Q1, i.e. range of middle 50% of the data.

Python implementation for IQR

As compared to Range, it is not influenced by the outliers in the dataset. It is a good measure of variation in case of skewed distribution of the dataset.

Variance summarizes how far each observation is from the mean. Unlike, range and interquartile range, it considers each data point in the dataset.

It can be calculated depending on which dataset you’re calculating it:

  • Population variance
Population variance formula
Population variance (mu=population mean, N=Size of population)
  • Sample variance
Sample variance formula
Sample variance (x-bar=sample mean, n=sample size)

A sample usually has the tendency to underestimate the population variance. Hence, we use n-1 to correct for this underestimation.

Python implementation for Variance

Higher value of variance does mean higher variation in the dataset and vice-versa. However, since this is a squared entity, there is no intuitive way to compare this variance directly with the specific data values or the mean.

Standard Deviation is the square root of the variance. It can be used as a measure of variation which is easier to interpret and relate to the dataset.

Python implementation for Standard Deviation

This is it for Part-1, and, I’ll come up with another part to continue on Statistics as it is an important topic for any kind of analysis.

Statistics-102 — It is part-2 of “Statistics” series. This contains various Probability Distributions, Central Limit Theorem, Chebyshev’s Inequality.

--

--