# Statistical Distributions for a career in Data Science

Like we discussed, you need to know basics of statistics to be able to analyze the data better. And one topic of statistics, the most important in Data science industry, is the *Concept of Distributions*.

What are you waiting for! Lets dive into the important theory needed in data science and its R application.

But first lets understand, what does it mean by Distributions in Statistics? Well, there are certain parameters of the distribution, this makes each distribution unique from the other. It is a pre-determined set of observations that are likely to occur based on the pattern that population has, for example, if whenever we observe the sample of a certain experiment, it ought to originate from a unique distribution which is unique to that specific experiment (*in other words, the sample ought to originate from that distribution which the experiment follows)*.

Graphically, we can say that each distribution has a different pattern with changing values in random variables. One distribution could rise with the rise in the value of the variable, whereas, another could have a falling trend. Each such distribution is an example of a unique “Distribution” as we say.

Additionally, we can find the basic information of the experiments, using the following statistics of the distribution.

**Bernoulli Distribution: X~Binomial (n,p)**

Lets understand this with an example of a the outcome on t*hrowing a die *or the Bernoulli Distribution. Suppose we want to model the ** number of times we get an even number on the die**. Each throw in

*independent of the one earlier*and the

*probability of a specific number occurring in one throw is equal and constant throughout the experiment*. Hence, it satisfies the basic requirements of a Bernoulli distribution. (

*likewise, each distribution has its own set of assumptions, which makes them unique).*

**Probability: **Now, suppose, we want to see what is probability to getting two even number in the next four throws. If we define, X as a random variable as the number of times we observe an even number from n trials. Then mathematically we need probability of X = 2 and n=4. Hence we need: P(X=2)!

*R code: dbinom(x,n,p) #where p is the probability of an even number in one throw.Here, it becomes>dbinom(2,4,1/2)*

**Distribution Function: **Finding the probability of getting even number at least 5 throws out of 7. Here we are technically concerned with the CDF (*or in layman’s terms, a general function that gives the probability of X ≤x, where x is a random value, here, 5*). Here, we use the inbuilt function shown as:

*R code:> pbinom(5,7,1/2)*

**Distribution Quartiles: **The number of times we would get even number given the we want a probability of 0.25 (the number of times we get even number). i.e., we wish to find the first quartile for the distribution, we use the following r-code.

*R code:>qbinom(0.25, n, p)*

**Normal Distribution: X~N( mu, sigma²)**

This is one of the most frequently used distributions in predictive modelling. After all, most of the assumptions of Linear Regression, Multiple Regression and other forms of model building, are based on these. To state one such example: While fitting a Linear Regression model, the residuals obtained from the fitted model should follow a normal distribution.

What is Normal distribution? It takes the parameters, mu (population mean) and sigma² (variance). There can be several trends with infinite possible combinations of mu and sigma². But the general graphical trend is as shown below.

Its mathematical equation, also called the “pdf” is given below. (*the area under the bell-curve is equal to 1, because the total probability is always equal to 1).*

Note: *Since, its a continuous distribution, we can’t find the exact point probabilities.*

Value of the PDF at a point, i.e., P(X=x)*> dnorm(x,mu,sigma) #note, here, second parameter to enter is not variance but standard deviation.*

Distribution function: P(X≤x)*> pnorm(x,mu,sigma)*

Now, 80 percentile

> qnorm(0.8, mu, sigma)

# Poisson Distribution: X~ Posisson(lambda)

*The arrivals of guests at a party* follows this kind of one-parametric distribution or pattern (*for simplicity). *The basic distribution function is given by;

Once we plot the graph for a specific parameter, and for the entire domain of X, we get the following graphical distribution.

Value of the PDF at a point, i.e., P(X=x)*> dpois(x,l) #here, l=lambda*

Distribution function: P(X≤x)*> ppois(x,l)*

Now, 80 percentile

> qpois(0.8, l)

# Gamma Distribution: X~ Gamma (alpha, lambda)

This is also, one of most used distributions with wide range of application. It forms the basis for a lot of the other distributions as well. Its basic distribution function with two parameters as its requirement. The pdf is:

Once we plot the graph for a specific parameter, and for the entire domain of X, we get the following graphical distribution.

Value of the PDF at a point, i.e., P(X=x)*> dgamma(x,a,l) #here, a=alpha and l=lambda*

Distribution function: P(X≤x)*> pgamma(x,a,l)*

Now, 80 percentile

> qgamma(0.8, a,l)

# Exponential Distribution: X~ Exp( lambda)

This is special case of gamma distribution when alpha=1. It is used model the duration or the passage of time between certain events, for example, *the time between the next earthquake.*

Here the pdf is given by:

Once we plot the graph for a specific parameter, and for the entire domain of X, we get the following graphical distribution.

Value of the PDF at a point, i.e., P(X=x)*> dexp(x, l) #here, a=1 and l=lambda*

Distribution function: P(X≤x)*> pexp(x, l)*

Now, 80 percentile

> qexp(0.8, l)

# Chi square distribution

This is also an important distribution with wide range of application. Its basic distribution function with one parameter as its requirement. The pdf is:

Value of the PDF at a point, i.e., P(X=x)*> dchisq(x, df) #here, df is the degree of freedom of the distribution (the parameter it intakes)*

Distribution function: P(X≤x)*> pchisq(x, df)*

Now, 80 percentile

> qchisq(0.8, df)

Hope you enjoyed these very basic, yet extremely useful Statistical Concept.

Happy Learning!