5 Question Series — Data Science & AI — 3

Asitdubey
Analytics Vidhya
Published in
6 min readJul 19, 2021

This is my 3rd series in 5 Question Series. If you haven’t read the first two 5 Question Series, here is the attached article for it.

In this series of questions, I talked about different types of distributions. But why we need so many different distributions when we have the Gaussian distribution. As discussed further we will know that just Gaussian distribution is not enough to get the most probable distributions of outcomes. In some situations, different types of distribution are formed due to random variables, at this time getting the right accuracy for the model is not possible because of the different distributions for different features. We then transform these distributions to Gaussian distribution and then into Standard Normal distribution to bring each feature into same scale, which will help in easy and accurate computation.

Q1. What is log normal distribution and its uses?

Generally, we use Gaussian Distribution in mapping out the probability distribution of random variables or data points. Log Normal is normal distribution of the log transformation of normally distributed random variables using log calculation. This is done because, In Gaussian (or, Normal) distribution we also take negative values for data and log normal always take positive real values; as in many situations’ negative outcomes for the variables or data will have severe impact and can be responsible for making wrong decisions. In Time — series analysis, or stock price predictions we use Log transformation and log normal distribution of the data. Also, people’s Income distribution follows log normal distribution.

We use log transformation of normally distributed data to reduce the variability in data. For in detail explanation of log transformation and log normal distribution, read my article on Time — series analysis and do surely check this in-depth video of log normal distribution by Krish Naik Sir.

Q2. Power Law & Pareto distribution and what are its application in Data Science?

Power law shows the functional relationship between two different quantities. It states that relative change in one quantity results in proportional change in another one. If we change the size of square or triangle, it significantly changes their area.

Where k is constant, Y and X are two variables and ‘a’ is the proportion through which Y changes when X changes.

Few applications of Power law: -

Income Distribution, magnitude of earthquake, city size according to population, stock market trading and price predictions, word frequencies and many more. Power law mostly follows the 80–20 % rule (also known as Pareto Law); in which 80% of the outcomes based on 20% of the work or we can say, 80% of the effect is due to 20% of the cause. Let’s take examples: -

In an industry, 80% of sales revenue is being generated due to overall 20% of the products and the rest 20% of the sales revenue is generated by the rest 80% of the products.

Let say in Type 1 error or False positive — In quality checking of the materials or products, if a small number of products in a package or even one or two products is damaged whole package may reject. This is the effect of only 20% or even less than 20% which causing is rejection of all the packages of products.

80% of the run scored in match by 20% of the cricketers.

For in detail explanation of Power law distribution and pareto distribution, please check the this power law distribution video by Krish Naik Sir.. He really explained well with very good examples.

Q3. Explain Box — Cox transformation and its applications?

This is another transformation in which we transform non — normal distributed data into normally distributed data.

Pic source: — Statistics How to

Read this article on Box — Cox transformation written by Andrew Plummer

*** For more important distribution like Binomial and Bernoulli distribution, Poisson distribution and geometric distribution; you can follow this journal “Empirical Distribution”. ***

Q4. What is the difference between Correlation and co-variance?

Co-variance tells us the relationship between the two features or variables. How Y is dependent on feature X; maybe positive or negative or constant.

But there is one problem with covariance; as it tells us about the relation between two variables but it couldn’t explain the strength of relation, i.e., how much the variable Y is related to X.

Here comes the Correlation: — Pearson’s r or Pearson Correlation and Spearman’s Rank Coefficient.

Pearson Correlation coefficient measures the correlation between the two features, it ranges between (-1,1) so we can know the strength of correlation.

Spearman’s rank coefficient — in this when Pearson correlation coefficient work between the rank of X and Y. it is used when two variables are non-linear and somehow related monotonically.

In Wikipedia, the detail explanation about how the above formula is coming and its nice example is given.

Q5. What do you mean by Correlation and Causation?

They both are different. Usually, many of us gets confused between them and try to find the relation between them. As earlier I said, Correlation tells the relation between the two variables, how and what’s the direction of one variable with respect to the other. But causation is different. Causation says that change in one variable causes the changes in another i.e., one variable is the cause of other and makes it happens. Let’s take an example, in a city rate of people swimming is increases than it normally used to; and also rate of accidents also increased in the city at the same time. Does this mean that more the people swimming more the accidents occurring? These two events are totally different, if the people are swimming, then the same person cannot be in accident casualty and also people swimming doesn’t mean the accident will occur. Yes, they both are related but one action not causing the other. There is another hidden factor that might be causing both incidents. Rise in temperature force more people to swim, and since more the people on the street more casualty can take place. We can also take the similar example with ice-cream and rise in death of elder. Eating ice-cream can never be the reason for increase in death of elders. But rise in temperature can cause both the incidents. So correlation and causation might seems to be relatable but actually not.. Think again…

Hope you liked it. If you want me to add anything or correct anything then do mention in comments and guide me for more questions like this. Most of my work I take reference from Krish Naik Sir, videos and from StatQuest. These are the two most productive and awesome Data Science channel on YouTube.

--

--

Asitdubey
Analytics Vidhya

Started my interest in Data Science and Machine Learning and want to learn more about it.