Statistics Part 3:Gaussian Distribution- Demystifying the bell curve!

Published in

kgxperience

4 min readJul 19, 2023

Hello all🙋‍♂️! In this blog post, we’ll see about the widely used distribution in statistics. The universal distribution or the normal distribution or the gaussian distribution. The bell-shaped curve comes in many areas responsible for representing the fundamental distribution of a dataset. But what is it? what does it represent?🤔 Let’s find out!

Normal distribution is a symmetrical curve. Which means the data will be evenly spreaded over both sides, by keeping the mean as the center. The top most point of the gaussian distribution is the mean of the data. Though every normal distribution is symmetrical, every symmetrical distributions are not normal distributions. The Gaussian distribution mainly has two parameters-mean and standard deviation. The Gaussian distribution is plotted with these two parameters

The mean (μ) will determine the peak of the bell curve. The mean tells us the value at which the data is centered
The standard deviation(σ) tells us the spread of the data. A higher standard deviation means the data points are more spread out from the mean, while a lower standard deviation means the data points are closer to the mean.

When the standard deviation is large, the distribution is wider and flatter. Conversely, when the standard deviation is small, the distribution is narrower and taller.

If you were to increase the value of the standard deviation while keeping the mean constant, the distribution would become wider, and the data points would be more spread out from the mean. This makes sense since a larger standard deviation means more variability in the data points.

On the other hand, if you were to decrease the standard deviation while keeping the mean constant, the distribution would become narrower, and the data points would be closer to the mean. This is because a smaller standard deviation means less variability in the data points.

The standard deviation can be represented as a following equation:

Now we will use a dataset to plot the Gaussian distribution using only numpy. For that we will use the titanic dataset from kaggle.

Code representation:

The above code is executed on Kaggle Notebook and i would encourage you to do the same. Unless you want to download the dataset, extract it and upload on whatever editor you are using and wasting your time.😊

JUST USE KAGGLE!!

Alright, now after setting up the notebook with the titanic datset. We’ll extract the ‘Age’ column from the dataset and convert them into a numpy array.

Importing the required libraries:

import numpy as np 
import pandas as pd 
from scipy.stats import norm
import matplotlib.pyplot as plt

After importing the libraries we’ll load the dataset:

train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")

Next, we’ll fill the null values with the mean of the data. This will produce a more cleaner distribution.

train_data['Age'].fillna(train_data['Age'].mean(),inplace = True)

Now we’ll convert the dataframe to a numpy array. Before that we’ll sort the values for the arrangement of values from 0 to 100.

age_df = train_data['Age'].sort_values(ascending=True) 
age_df = pd.DataFrame(age_df)
age_df.reset_index(inplace=True)
age_df.drop(['index'],axis=1,inplace=True)
age_df = age_df.to_numpy()

Now that the data is set, we’ll plot it using plt.plot() function.

#normal distribution formula
mean = age_df.mean()
std = age_df.std()
coefficient = (1/std*np.sqrt(2*np.pi))
exponent = (-0.5)*((age_df-mean)/std)**2
output = coefficient * np.exp(exponent)
plt.figure()
plt.title('plotting the pdf')
plt.xlabel('Age')
plt.plot(age_df, output)

The normal distribution seems like a weird equation containing the pi and e values making it a much more complicated equation for a simple bell curve graph. How does the pi and e values affect the bell curve? Why this complex equation for a normal distribution? We’ll see in the upcoming blogs. Thank You👋

Statistics Part 3:Gaussian Distribution- Demystifying the bell curve!

Code representation:

Written by Nawin Raj Kumar S