A basic overview of Distributions in Statistics.

Rakesh Swain

Published in

The Startup

7 min readMay 12, 2020

Topics covered

What is a distribution?
Symmetric and Asymmetric distribution
Kurtosis
Normal Distribution
Standard Normal Distribution
68–95–99.7 rule
Kernel Density Estimation (KDE)
Q-Q Plots or Quantile plots

What is a distribution?

The formal definition of distribution is a function that shows the likelihood of different possible values a random variable can assume.

Lets say you are given a task to order t-shirts for an event in your college. But how do you estimate how many t-shirts of different sizes you need to order without asking each student.

You take the help of a distribution curve. Just by looking at a distribution curve you can assume what % of t-shirts need to be Medium, small or XL?

In the above picture the , x-axis represents height and y-axis represents frequency, more precisely density. Meaning, the number of people clustered around a particular height. The height of the curve depends on the density. More the density , higher the peak.

Note : Do not focus on the y-axis for now. Only focus on the shape of the curve and the x-axis

With this concept, now you can easily assume that most number of students lie between 169–185 cm height and would require medium t-shirts (assuming the t-shirt size based on height).

The above plot is called a Probability Density Function or PDF.

However , it still doesn’t solve the question of how much % of the t-shirts need to be medium or small or XL.

For that estimate we can use a plot called Commutative Density Function or CDF. PDF tells us about the number of students with a particular height. CDF tells us about the number of students with height equal to or less than a particular height.

Looking at this curve and using the previous concept you can assume, 0.75 or 75% of students are shorter than 200 cm. With this now you can easily calculate the different % of t-shirts you need to order.

Symmetric and Asymmetric Distributions

A symmetric distribution is a type of distribution where the left side of the distribution mirrors the right side. The above PDF plot was a symmetric distribution.

As you must have guessed , asymmetric distribution is the opposite. Asymmetric distribution is also known as skewed distribution

A left-skewed distribution or a negatively skewed distribution is a kind of distribution which has large extreme values left to the mean. Our height curve would have been left skewed if we had extremely short people in our class. But not as extreme tall students.

A right-skewed distribution or a positively skewed distribution is a kind of distribution which has large extreme values right to the mean. Our height curve would have been right skewed if we had extremely tall people in our class. But not as extreme short students.

Kurtosis

Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ from the tails of a normal distribution.

Kurtosis basically gives an idea about how far the extreme values go on both ends. For example, If we have extremely short people in our class , approximately till which extreme their height goes. And vice versa for tall students.

Using the concept of kurtosis , investors calculate the risk of an investments by approximating what are the occasional extreme profits or losses they can gain or suffer if they invest. It is called kurtosis risk.

Normal Distributions

Normal distribution is one of the many types of distribution. In this article , we will only discuss normal distributions as our goal is to understand distribution not it’s types.

Normal Distribution or Gaussian Distribution is the most common distribution that is seen in natures. Such as human height / weight etc. That’s why it’s called normal. It is a symmetric distribution where most of the values are clustered around the mean.

In this distribution mean = mode = median.

Normal distribution is denoted by : X ~N(x , s²)

Where x is our mean and s² is our variance. If you know that your data set is normally distributed, just by looking its mean and variance you can assume the nature of the curve

The area under a distribution curve is its density. And the total density is always equal to 1 or 100%. So if you increase the variance of a curve i.e. make it wider, to compensate for the area, the curve will get shorter / height will decrease.

Standard Normal Distribution

Standard normal distribution is a normal distribution where mean = 0 and variance is =1 . (The red curve)

N(0,1) is called standard normal variate.

Standardization

Converting a normal distribution to a standard normal distribution is called standardization. To convert a distribution into a standard normal distribution , we standardize each data point using the following formula.

Each data point x can be standardized using this formula

But why do we bother standardizing?

Lets understand this with an example. Suppose student A scores 85 marks in Maths . And student B scores 70 in English. How do you know who performed better?

We cant just compare marks because both marks are from different subjects , with different difficulty level , hence different scale. The average score in Maths might be higher than english or their variance may differ.

In cases like this, we can standardize both of the marks/data points i.e. bring them to the same scale and then compare.

68–95–99.7 rule in Normal Distribution

“1” standard deviation (1 std) : 1 standard deviation is the range between (mean — 1*variance) to (mean + 1*variance).

The 68–95–99.7 rule states that ,in a normal distribution 68% of the data fall under +1std to -1std , 95% of the data fall under +2std to -2std and 99.7% data fall under +3std to -3std.

Using this , you can now estimate what is the approximate height of 68% of the students and what kind of t-shirts they need.

Kernel Density Estimation

Kernel density estimation is used to smooth out a histogram. This technique is used to convert a histogram into a distribution curve . It replaces each sample point with a Gaussian-shaped (Bell shaped) Kernel, then obtains the resulting estimate for the density by adding up these Gaussians.

Bandwidth of a kernel

Bandwidth of the gaussian kernel is the width of the gaussian kernel that we take for KDE. The more the bandwidth the smoother the curve. As the bandwidth decreases the PDF curves become more and more jagged.

Changing the bandwidth changes the shape of the kernel: a lower bandwidth means only points very close to the current position are given any weight, which leads to the estimate looking squiggly; a higher bandwidth means a shallow kernel where distant points can contribute.[1]

For a better understanding on this , I highly suggest to visit https://mathisonian.github.io/kde/

Q-Q Plots or Quantile plots

Now that you know how useful distributions are , how do you know to which category of distribution your data points belong to as there are many types of distributions in statistics with their own rules. We use a very simple concept called Q-Q Plots.

A Q-Q Plot is basically a scatter plot , plotted between two data sets.

Lets decompose the above pic. Suppose we have a set of data points and we want to check if that data is normally distributed or not. First step is to sort the data and compute percentiles.

1 percentile represents the data point greater than 1% of the data set, 25th percentile represent the data point greater than 25% of the data set and so on.

Now that you have calculated the percentiles of your data from 1 to 100. You can create a data set with standard normal distribution using numpy whose mean = 0 and standard deviation =1.

np.random.normal(loc=0, scale=1, size=1000)

This line of code creates a 1000 data points that are normally distributed.

Now all you have to is plot a scatter plot between these two datasets. If almost all the points in the scatter plot lie on top of a line. Then both the datasets belong to the same distribution (Here normal distribution). Similarly you can use this technique with other distributions as well.