Normal Distribution and Beta Distribution: What They Are, and How to Generate Them and Visualize Them in Python

Published in

The Startup

7 min readDec 2, 2020

This article describes two popular distributions, the Normal Distribution and the Beta Distribution. It also shows how these can be generated and plotted in Python. If you find this article interesting, helpful, or you simply just liked it, remember to CLAP!! Please don’t hesitate to contact the author with questions, feedback, and if any errors were made. Enjoy, and happy coding!

The normal distribution is a probability function that describes how the values of a variable are distributed. It is a symmetric distribution where most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions. Extreme values in both tails of the distribution are similarly unlikely. [1]

A normal distribution is a common probability distribution . It has a shape often referred to as a “bell curve.” Many everyday data sets typically follow a normal distribution: for example, the heights of adult humans, the scores on a test given to a large class, errors in measurements. The normal distribution is always symmetrical about the mean. [2]

The standard deviation is the measure of how spread out a normally distributed set of data is. It is a statistic that tells you how closely all of the examples are gathered around the mean in a data set. The shape of a normal distribution is determined by the mean and the standard deviation. The steeper the bell curve, the smaller the standard deviation. If the examples are spread far apart, the bell curve will be much flatter, meaning the standard deviation is large. [2]

I recommend trying out this code on your own in Python, and modifying variable values when generating the distribution, then observing how the visualization changes. :-).

# import libraries
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

We will use Python’s np.random.default_rng().normal() function to generate a set of 1,000,000 numbers to create a dataset that follows a normal distribution with mean 0 and standard deviation 1.

# mean
mu = 0
# standard deviation
sigma = 1
# number of observations in our dataset
n = 1_000_000
# construct a random Generator
s = np.random.default_rng(0)
# generate an array of 1_000_000 random values that follow a normal distribution with mean mu and st.dev. sigma
r = s.normal(mu, sigma, n)

We now have an array r which contains our data points that follow a normal distribution. However, we want to look at descriptive statistics, and to do that, it would be more useful to have these data points in a dataframe object. Below is code for transforming the array into a dataframe, and then viewing the descriptive statistics on the dataset.

# transform the array into a dataframe
df = pd.DataFrame(r, columns=[‘value’])
# view descriptive statistics on the data
dstats = df.describe()
dstats

Interpretation of the above result:
count = total number of observations/points in our dataset
mean = sum of all values divided by the total number of values
std = standard deviation (we chose this to be 1)
min = the minimum value in our dataset
25% = 25th percentile, 25% of the values in our dataset are less than -0.67
50% = 50th percentile or median, 50% of the values are greater than 0, and 50% of the values are less than 0
75% = 75th percentile, 75% of the values in our dataset are less than 0.67
max = the maximum value in our dataset

# save stats above into individual variables
mean = dstats.loc[‘mean’].value
std = dstats.loc[‘std’].value
p25 = dstats.loc[‘25%’].value
median = dstats.loc[‘50%’].value
p75 = dstats.loc[‘75%’].value
minv = dstats.loc[‘min’].value
maxv = dstats.loc[‘max’].value

Now, let’s plot our distribution, along with vertical lines to indicate the above statistical values and what they look like with respect to our histogram.

# plot distribution
plt.figure(figsize=(10,5))
plt.hist(df[‘value’], bins=100, color=’teal’);
plt.axvline(x=mean, color=’red’, ls=’ — ‘)
plt.axvline(x=mean+std, color=’orange’, ls=’-’)
plt.axvline(x=mean-std, color=’salmon’, ls=’-’)
plt.axvline(x=p25, color=’purple’, ls=’:’)
plt.axvline(x=median, color=’black’, ls=’:’)
plt.axvline(x=p75, color=’lime’, ls=’:’)
plt.axvline(x=minv, color=’blue’, ls=’-.’)
plt.axvline(x=maxv, color=’navy’, ls=’-.’)
plt.title(‘Normal Distribution’, size=16, color=’blue’, pad=20)
plt.xlabel(‘Values’, color=’blue’)
plt.ylabel(‘Frequency Density’, color=’blue’)
plt.legend([‘Mean’, ‘+Standard Deviation’,
‘-Standard Deviation’, ‘25th Percentile’,
‘50th Percentile or Median’, ‘75th Percentile’,
‘Minimum Value’, ‘Maximum Value’])
plt.savefig(‘blog_fig1.jpg’, dpi=200)
plt.tight_layout()
plt.show()

The area of each bar represents the frequency, so to find the height of the bar, we would divide the frequency/area by the bin/bar width. This is called frequency density.

If we modify the values of the mean (mu) and standard deviation (sigma), we can get different normal distributions.

A few more observations about a normal distribution:
- the mean and median are the same
- about 68% of the values are within one standard deviation from the mean
- 95% of the values are within two standard deviations from the mean
- 99.7% of the values are within three standard deviations from the mean

A Beta distribution is a type of probability distribution. This distribution represents a family of probabilities and is a versatile way to represent outcomes for percentages or proportions. For example, how likely is it that Kanye West will win the next Presidential election? You might think the probability is 0.2. Your friend might think it’s 0.15. The beta distribution gives you a way to describe this. [3]

The beta distribution also has two characteristic values, usually called alpha and beta, or more succinctly, just a and b. Each set of (a,b) pairs determine a different beta distribution. When we sample from beta(a,b) each sample value (p value) will be between 0.0 and 1.0 and if we sample many values they will average to a /(a+b). For example, if we sample many values from beta(3, 1), each value will be between 0.0 and 1.0 and all the values will average to about 3/4 = 0.75. [4]

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parameterized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution. [5]

Now let’s see how we can generate and plot a beta distribution in Python. Again, I encourage you to try this code out for yourself :-).

# alpha
# alpha
alpha = .5
# beta
beta = .5
# number of values
n = 1_000_000
# random seed
np.random.seed(0)
# x-axis
x = np.linspace(0, 1, 1001)
# beta distribution
s = stats.beta(alpha, beta)
# generate n random values which follow the above distribution
r = s.rvs(n)

Create a dataframe containing our values, and then view descriptive statistics on the data.

# create a dataframe to hold these values
df = pd.DataFrame(r, columns=[‘value’])
# view descriptive statistics on the data
dstats = df.describe()
dstats

Save above values into variables.

# save stats above into individual variables
mean = dstats.loc[‘mean’].value
std = dstats.loc[‘std’].value
p25 = dstats.loc[‘25%’].value
median = dstats.loc[‘50%’].value
p75 = dstats.loc[‘75%’].value
minv = dstats.loc[‘min’].value
maxv = dstats.loc[‘max’].value

Plot the histogram for our distribution.

plt.figure(figsize=(10,5))
plt.hist(r, bins=100, color=’teal’);
plt.axvline(x=mean, color=’red’, ls=’ — ‘)
plt.axvline(x=mean+std, color=’orange’, ls=’-’)
plt.axvline(x=mean-std, color=’salmon’, ls=’-’)
plt.axvline(x=p25, color=’purple’, ls=’:’)
plt.axvline(x=median, color=’black’, ls=’:’)
plt.axvline(x=p75, color=’lime’, ls=’:’)
plt.axvline(x=minv, color=’blue’, ls=’-.’)
plt.axvline(x=maxv, color=’navy’, ls=’-.’)
plt.title(‘Beta Distribution’, size=16, color=’blue’, pad=20)
plt.xlabel(‘Values’, color=’blue’)
plt.ylabel(‘Frequency Density’, color=’blue’)
plt.legend([‘Mean’, ‘+1Standard Deviation’,
‘-1Standard Deviation’, ‘25th Percentile’,
‘50th Percentile or Median’, ‘75th Percentile’,
‘Minimum Value’, ‘Maximum Value’])
plt.savefig(‘blog_fig2.jpg’, dpi=200)
plt.tight_layout()
plt.show()

Plot the function.

plt.figure(figsize=(10,5))
plt.plot(x, s.pdf(x), color=’teal’);
plt.axvline(x=mean, color=’red’, ls=’ — ‘)
plt.axvline(x=mean+std, color=’orange’, ls=’-’)
plt.axvline(x=mean-std, color=’salmon’, ls=’-’)
plt.axvline(x=p25, color=’purple’, ls=’:’)
plt.axvline(x=median, color=’black’, ls=’:’)
plt.axvline(x=p75, color=’lime’, ls=’:’)
plt.axvline(x=minv, color=’blue’, ls=’-.’)
plt.axvline(x=maxv, color=’navy’, ls=’-.’)
plt.title(‘Beta Distribution’, size=16, color=’blue’, pad=20)
plt.xlabel(‘Values’, color=’blue’)
plt.ylabel(‘Density’, color=’blue’)
plt.legend([‘Probability Density Function’, ‘Mean’, ‘+1Standard Deviation’,
‘-1Standard Deviation’, ‘25th Percentile’,
‘50th Percentile or Median’, ‘75th Percentile’,
‘Minimum Value’, ‘Maximum Value’])
plt.savefig(‘blog_fig3.jpg’, dpi=200)
plt.tight_layout()
plt.show()

If we change the values for alpha and beta, the shape of our beta distribution, when plotted, will change. [6]

alphas = [0.5, 1.5, 3.0, 0.5]
betas = [0.5, 1.5, 3.0, 1.5]
lines = [‘-’, ‘ — ‘, ‘:’, ‘-.’]
x = np.linspace(0, 1, 1001)
fig, ax = plt.subplots(figsize=(10,5))
for a, b, l in zip(alphas, betas, lines):
s = stats.beta(a, b)
plt.plot(x, s.pdf(x), color=’teal’, label=r’$\alpha=%.1f,\ \beta=%.1f$’ % (a, b), ls=l);
plt.xlim(0, 1)
plt.ylim(0, 3)
plt.title(‘Beta Distribution’, size=16, color=’blue’, pad=20)
plt.xlabel(‘Values’, color=’blue’)
plt.ylabel(‘Density’, color=’blue’)
plt.legend(loc=0)
plt.savefig(‘blog_fig4.jpg’, dpi=200)
plt.tight_layout()
plt.show()

I hope this article provided a better understanding of Normal and Beta distributions. For more details, please refer to the resources below. Don’t hesitate to contact me with questions :-). If you found this article interesting, helpful, or you simply just liked it, remember to CLAP!! Thank you, and happy coding!

Resources:
[1] Statistics by Jim: Normal Distribution
[2] Varsity Tutors: Normal Distribution
[3] Statistics How To: Beta Distribution
[4] James Caffrey: Beta Distribution
[5] Wikipedia: Beta Distribution
[6] AstroML: Example of a Beta Distribution

Normal Distribution and Beta Distribution: What They Are, and How to Generate Them and Visualize Them in Python

Written by Cristina Sahoo