Gaussian Distribution and Descriptive Statistics

Published in

dsnaiplusui

6 min readJul 3, 2020

Usually, when dealing with data, data collection is the first task. After which we make inferences from the data. Inferences can only be made when we understand the data distribution and parameters used to properly describe the data distribution.

The 7 days of statistics challenge by DSN AIPlus UI , is meant to explain essential theoretical statistics, which also includes descriptive statistics and data distribution.

Data distributions are graphical methods of organizing and displaying useful information about data. There are several data distributions including Gaussian distribution, the binomial distribution, and the Poisson distribution.

This article will focus on the most important distribution, Gaussian, or normal distribution. By the end of this article you will:

Become familiar with simple descriptive statistics
Identify Gaussian distribution and its use cases
Implement the calculation of sample mean and variance from scratch using python.

Descriptive statistics

Descriptive statistics are used to describe the basic features of data in a study. They do this by providing simple quantitative summaries about a sample or population.

Before I continue, we need to be familiar with two terms:

Population: includes all the elements from a set of data. Here is the overall list of all entities from which inference can be drawn.
Sample: a sample is a smaller group of members of a population selected from the population. It is usually selected randomly.

For example, if I am studying the factors that affect relegated teams in the English Premier League; my population will be in the Premier league table from its inception in 1992 to date. I can study the table from 2010 to 2020, which will be the new sample for the study.

We usually cannot get data about an entire population. Due to this limitation, samples are used to make statistical inferences about the population.

The mean

A great attempt to understand distribution is to locate where most of the entries are concentrated.

The mean is the numerical average of entities in a data set. It describes the central location within a data set.

It can be calculated by dividing the sum of all the entities in a sample by the total number of entries in the sample.

Standard deviation

Data distributions can be described by how far each entity is from the mean. This is known as a measure of variation.

Standard deviation is a measure of the amount of variation or dispersion of a set of values. Theoretically, it represents the average distance of each entity from the mean.

It can be calculated by these steps:

Calculate the mean of the dataset
From each entry in the list, subtract the mean…
and square the differences
Find the average of the squared differences
Find the square root of the average .

Here is the formula for calculating sample standard deviation.

Gaussian distribution

The Gaussian distribution also called normal distribution is a continuous probability distribution that is symmetrical about the mean.

The Gaussian distribution is the most popular and most important distribution because it fits several natural phenomena like heights, blood pressure and some other continuous variables.

It is symmetrically shaped with most of the data are clustered around the central peak/mean giving it a bell-shaped curve.

The curve is equally spread on both sides of the peak. The standard deviation of any point at one side of the mean is equal to the standard deviation at the same point at the other side of the mean.

Gaussian distribution can be represented by two parameters: the mean and the standard deviation.

Examples:

The distribution in this example fits real data collected from 14-year-old girls during a study. Y-axis represents the frequency density, X-axis represents that height.

The graph is concentrated around the mean, 1.512 and reduces gradually on both sides with a standard deviation of 0.00741.

This graph shows normal distributions and parameters for different data using different colors. Note that small standard deviation corresponds to smaller spreads and taller peaks.

Now, after identifying the data as a normal distribution, we can make statistical generalizations such as the empirical rule of estimation.

The empirical rule is a statistical rule that states that for a normal distribution:

99.7% of the data set falls within 3 standard deviations of the mean.
95% falls within 2 standard deviations of the mean.
68% falls within 1 standard deviation of the mean.

Note that no data set can occur as perfectly normal. But this abstraction only makes it easier to understand the distribution and make generalizations about it.

Implementation of mean and standard deviation from scratch with python

Only basic knowledge of coding is required here.

Mean

To calculate the mean, I created the function calc_mean. The function takes in one argument, “data_l” which is the list of entities. It returns the mean of the list as a float.

The mean is equal to the sum of the list divided by total number of entries.

We do not want the mean to be returned as an integer, so I multiplied it by 1.0 to return the mean as float.

Standard deviation

To calculate the standard deviation, I created a function called calc_sdev. The function takes in the list of sample data as the argument and returns the standard deviation of the list as a float.

First, We need to store the fundamental variables which will be used in the standard deviation formula.

I used the function calc_mean to calculate the mean of the data set and stored it as a variable named “mean”.
I calculated n which is N — 1 for sample.
I created the variable “sigma” and stored it as 0. This will help us store our standard deviation.

Next, I implemented the formula in two steps by looping through all elements in the data set as “s”.

In the first step: sigma is equal to the sum of square of the difference between s and the mean, for all s.
Then, the standard deviation is equal to the square root of sigma divided by n.

This functions can then be used to calculate mean and sample standard deviation of any data set.

Execution

In this section, we will calculate the mean and standard deviation of 100 numbers using the functions we created.

First, I will generate a list of 100 random numbers with the following:

I imported seed to initiate a random number generator
I imported randn to generate random numbers

You can use any seed value to generate your random numbers, for example i used seed(1).

I stored my random numbers in the list, data_list.

Finally, I used the functions calc_mean and calc_sdev to print the mean and standard deviation for data_list.

This can be used on any data set as long as we store them as lists.

Note that we created a function to calculate standard deviation for the sample.

Here is a challenge: create a function to calculate standard deviation for population data.

Hint: Change N-1 to N.

Conclusion

The importance of understanding data distributions and the parameters used to summarize them cannot be overemphasized, and we have discussed them in this article.

All distributions have specific parameters describing them just as Normal distribution can be summarized by numerical values of its mean and standard deviation.

You can read about other distributions and the parameters that summarize them via this link.

Thanks for reading, happy learning from myself, and DSN AIPlus UI.

Gaussian Distribution and Descriptive Statistics

Written by Giwayusuf