Normal distributions (Gauss)

Vicente P. Soloviev
Analytics Vidhya
Published in
3 min readMay 1, 2020

In Data Science it is often assumed some variable to fit a normal distribution. This is due to, it is easy to manage and less computationally expensive as estimating the distribution with other methods as kernel density estimation. This is one of the easiest ways to manage continuous variables.

A normal distribution is a theoretical model able to adjust the random variable data to a density function, centred in a mean, with a variance. An example of a normal distribution can be found in Figure 1. In Y axis, the figure shows the density function, while in X axis, values of the random variable are contrasted.

Figure 1: a normal distribution, with mean 0, and variance 0.1. Y axis show density function, while X axis shows the variables of the random variable

What is characteristic of this distribution is the way the data is distributed in the domain. Data is centred in the mean and with a variance with respect to the mean. This means that the majority of the instances of the variable are near or in the mean. The limit central theorem says that around 69% of the data is in the range [mean — variance, mean + variance] and the 95% is in the range [mean — 2*variance, mean + 2*variance]. This is the reason os the called Gauss Bell shown in the figure. As I get closer to the mean value of the variable, more probable is to find more data (more density of data), and as I get away from the mean, less probable is to find data (less density of data).

In the Figure 1 was shown a normal distribution in a 1-dimension space. But, the theory explained is also applicable for more dimensions. In Figure 2, is shown a normal distribution in a 2-dimension space.

Figure 2: normal distribution in a 2-dimension space. Both variables are centred in mean 0, with variance 5. X axis shows values of variable 1, Y axis shows values of variable 2, and Z axis, show the density function of data.

In Figure 2, it is shown a normal distribution with two variables centred in the same mean and with de same variance. When using a two-dimensional normal distribution a dependence between both variables is established, as both variables must comply with the central limit theorem. Thus, instances in the space will be distributed so that the 69% of data will be in range [(0,0) +/- variance], and so on. An example of the distribution is shown in Figure 3.

Figure 3. Distribution of instances of a bi variate normal distribution centred in 0, and variance 5

As shown in the Figure 3, the density of instances decreases as far is from the mean value (0,0) of the distribution. Let see what happens when both variables are not centred in the same mean (Figure 4).

Figure 4: normal distribution in a 2-dimension space. First variables is centred in mean -4 with variance 25 while second variable is centred in -2 with variance 1. X axis shows values of variable 1, Y axis shows values of variable 2, and Z axis, show the density function of data.

Figure 4 shows a situation in which both variable have different mean and variance. Thus, the Gauss bell is not as perfect as it was in the last two-dimension plot. The density function is conformed in a way that can comply with de central limit theorem for both variables. Let see what happens with the distribution of the instances.

Figure 5. Distribution of instances of a bi variate normal distribution with both variables with different parameters.

As shown in Figure 5, the density of the instances is not distributed as a perfect circle as it was in the Figure 3.

There are more ways to manage with continuous random variables. Other ways are for example KDE (Kernel Density Estimation) or transform the data into a discrete domain (discretization), to use discrete domains techniques.

Many algorithm use this distributions with different objectives. For example, a multivariate normal

Plotting code for the two-dimensional normal distribution is the following:

--

--

Vicente P. Soloviev
Analytics Vidhya

PhD student in Computational Intelligence Group (Universidad Politécnica de Madrid). Applying Machine Learning and Bayesian networks in industrial problems.