Various Data Distributions in Statistics

Mehul Gupta
Data Science in your pocket
5 min readJul 25, 2019

--

Introduction

Knowing the data is a very important part of the whole data exploration and model-building cycle. While dealing with continuous variables, the first thing that comes to my mind is looking at the distribution of the data. In this article, I am going to put my focus on the different types of data distributions that a Data Science enthusiast must know !!

1. NORMAL DISTRIBUTION

Personally, I believe that this is the most commonly known term — ‘Normal Distribution’ but very few of us really know the properties of a normal distribution. For most of data professionals, their life goes around this only. Though we had a rather small introduction in my previous article, we will be deep-diving this time.

  • In this distribution (if it is a perfect normal distribution), the mean of the data remains 0 while the standard deviation equates to 1.
  • It forms a bell-shaped structure when plotted.
  • This shape has significance. It tells us that most of the data(in terms of frequency) is around the mean only & as the values move away from the mean, the frequency of such instances decreases.

FORMULA

Here,

σ=Standard Deviation

μ=Mean of distribution

x=The random variable whose probability has to be estimated

  • The two major parameters to know are mean & standard deviation.
  • The mean, Median & Mode for such distribution are equal (for perfectly Normal Distribution).
  • The Normal Distribution has to do a lot in Data Science due to the Central Limit Theorem.
  • The area under the curve shown below is always 1
  • An important concept associated with Normal Distribution is the Empirical rule i.e-
  1. 68.3% of the population is contained within 1 standard deviation from the mean(the two 34% division mark this 68.3% from -1 to 1)
  2. 95.4% of the population is contained within 2 standard deviations from the mean(from -2 to 2)
  3. 99.7% of the population is contained within 3 standard deviations from the mean(from -3 to 3)

2. EXPONENTIAL DISTRIBUTION

Although it is not commonly seen in Data Science competitions, Exponential Distribution is still something you should be familiar with. It has nothing to do with what event is occurring. But with the average time between two occurrences of events. Surprisingly, it has nothing to do with the exact timing of when any event occurred!! The occurrence can be random also

Example-Consider a road scene.Here let’s consider the average time between two cars passing a certain shop is 2 minutes!! We have nothing to do whether car_A passed at 5:00 P.M. or car_B at 9:00 P.M.The arrival of cars(events) are though randomly distributed, but the average delay between two events is known and is of interest.

  • This average delay is the parameter used for constructing such a distribution

FORMULA

Here,

Lambda= The avg time between two events

x=variable whose probability has to be calculated

Rx=It is a [0,infinity)

3. POISSON DISTRIBUTION

It can be considered as the flipped version of the exponential distribution. In exponential distribution, we are aware of the average delay between two events, here we know the frequency of any event in a given standard time.

CONFUSED? lets take an example

Example- Lets go again on the road.This time we know in 2 minutes , two cars passes through a certain shop.Hence frequency of the event(car’s arrival) is 2 per 2 minutes or 1/minute(both are the same thing).

  • This frequency is the major parameter used in such distribution.

FORMULA

Here,

μ=The frequency of events in a given standard time

x=Random Variable as above

Don’t you feel Exponential & Poisson distribution sound similar!!

YES!!! both are quite similar. To your surprise, an exponential distribution can be converted to Poisson’s problem and vice versa.

Consider the below example-

Here, the given case is related to Exponential Distribution with lambda=4 minutes. But, wait a minute! doesn’t this mean, per minute,0.25 event is occurring?

Hence for given lambda=4 minute in Exponential Distribution, μ=0.25 for Poisson’s Distribution, and hence even the Poisson distribution formula can be used for the probability prediction.

Check for yourself 😀😀

Do note a very important property for Exponential Distribution i.e.

Memorylessness Property

It usually refers to the ideology that future success doesn’t depend on the past elapsed time i.e. if you have lost the past 10 tosses, it doesn’t increase/decrease your chances of winning/losing the 11th toss!

How this relates to Exponential Distribution?

Example-Going back to the road, if no car has passed past 6 minutes and the average time between two passing cars is 10 minutes, it doesn’t mean in the next 4 minutes, a car will pass. The waiting time still remains 10 minutes (stats for you my friends)

Thank you for reading my article. In the next article, I would be talking about different tests helping us to reject/accept any hypothesis like Z-Test, T-Test, ANOVA, F-Test, etc. Here is a list of my previous articles. If you have any feedback, feel free to share it in the comment section below.

--

--