Normal distributions (Gauss)
In Data Science it is often assumed some variable to fit a normal distribution. This is due to, it is easy to manage and less computationally expensive as estimating the distribution with other methods as kernel density estimation. This is one of the easiest ways to manage continuous variables.
A normal distribution is a theoretical model able to adjust the random variable data to a density function, centred in a mean, with a variance. An example of a normal distribution can be found in Figure 1. In Y axis, the figure shows the density function, while in X axis, values of the random variable are contrasted.
What is characteristic of this distribution is the way the data is distributed in the domain. Data is centred in the mean and with a variance with respect to the mean. This means that the majority of the instances of the variable are near or in the mean. The limit central theorem says that around 69% of the data is in the range [mean — variance, mean + variance] and the 95% is in the range [mean — 2*variance, mean + 2*variance]. This is the reason os the called Gauss Bell shown in the figure. As I get closer to the mean value of the variable, more probable is to find more data (more density of data), and as I get away from the mean, less probable is to find data (less density of data).
In the Figure 1 was shown a normal distribution in a 1-dimension space. But, the theory explained is also applicable for more dimensions. In Figure 2, is shown a normal distribution in a 2-dimension space.
In Figure 2, it is shown a normal distribution with two variables centred in the same mean and with de same variance. When using a two-dimensional normal distribution a dependence between both variables is established, as both variables must comply with the central limit theorem. Thus, instances in the space will be distributed so that the 69% of data will be in range [(0,0) +/- variance], and so on. An example of the distribution is shown in Figure 3.
As shown in the Figure 3, the density of instances decreases as far is from the mean value (0,0) of the distribution. Let see what happens when both variables are not centred in the same mean (Figure 4).
Figure 4 shows a situation in which both variable have different mean and variance. Thus, the Gauss bell is not as perfect as it was in the last two-dimension plot. The density function is conformed in a way that can comply with de central limit theorem for both variables. Let see what happens with the distribution of the instances.
As shown in Figure 5, the density of the instances is not distributed as a perfect circle as it was in the Figure 3.
There are more ways to manage with continuous random variables. Other ways are for example KDE (Kernel Density Estimation) or transform the data into a discrete domain (discretization), to use discrete domains techniques.
Many algorithm use this distributions with different objectives. For example, a multivariate normal
Plotting code for the two-dimensional normal distribution is the following: