5 Question Series — Data Science & AI — 2

Asitdubey
Analytics Vidhya
Published in
5 min readJul 17, 2021

This is my next 5 Question on Probability and Statistics; continuing of 5 Question Series — 1. you can find my previous article on 5 Question here: —

Q1. What is Standard Normal Variate and how to calculate the value of Z?

Generally, we have normal distribution or Gaussian distribution (mean = median = mode) which follows the Empirical Rule that I have discussed in 5 Question Series — 1 in which 68% of the data distribution will lie with 1 standard deviation, 95% of the data will lie within 2 standard deviation and 99.7% of the data will lie within 3 standard deviation. But this not that helpful when we have to calculate the distribution outcome of the region beyond this range or at the particular point; for that we have to convert this normal distribution to Standard Normal Distribution in which mean = 0 and standard deviation = 1. To bring all the data in the same scale we use Standard Normal Distribution.

For detail explanation, follow Krish Naik video and my article on Normal Distribution.

Q2. What is Central Limit Theorem?

For the outcome estimation of any experiment or actions it’s barely possible to map out the estimation or result of whole population. For example, measuring the height of the students in a class is possible but height of all the children in a state or in a city in impossible to measure, number of pit whole or traffic light in a city, number of trees in an area or the number of heavy vehicles passing by street in an hour in an area; all these can be easily estimated with the help of Central Limit Theorem. Population is the measure or count of all that includes within a group; well population are always a large count of living beings or things. Number of people or animals in a city or in an area, number of pit holes, number of schools, number of students in all school in a district or city, number of trees or count of fruits on a tree; all these comes within a term population and it is barely possible and much time taking task to keep the count or estimation of these. Whereas, when we subdivide these population in the group of smaller sets then they are known as Samples. Total number of students in a school can become group of students in a school or group of schools in an area, group of fruits each containing different but not large in variation of fruits, Number of pit holes in a city is the population but when we make the group of these pit holes according to area wise then they form the samples. (To learn in detail, follow Krish Naik or my article). Central Limit Theorem states that when we take the mean (mu) of all the samples and standard deviation of all the samples, given that the variation in samples should not that much and the number of samples should be large (greater than 30) enough, then the distribution of mean of all the samples mean will be like normal distribution (independent of population distribution is normal or skewed) and follows the empirical rule of normalization. CLT also states that if we take out the average of all the samples mean and standard deviation then it will be approximately equal to that of Population’s. If we know the mean and standard deviation of all the samples then by empirical rule, we know the distribution and also can know the mean and standard deviation of population which will help in the estimation of population count. It is also known as The Sampling Distribution of the Mean. To know in details about CLT follow the Krish Naik video on CLT.

Q3. What is Q-Q Plot and Quantile Normalization?

Q-Q plot or Quantile-Quantile plot. Before understanding about Q-Q plot, let us know what quantile and percentile is. Quantile is a point or a mark or a line that divides the group of data into equal parts. Quantile can be one diving the group of data into two equal parts or can be many dividing the data into many equal parts. When the group of quantiles divides the data into 100 equal parts they called as Percentile. Q-Q plot helps us to understand the distribution of data. It tells about whether the data is normally distributed or not. We plot the normal distribution quantiles (random any normal plot which is normally distributed) and then in same cartesian we plot the quantiles of our actual data given. And draw the points of contact of these two plots. And compare it with the best fit line to check whether the data is normally distributed or not. To follow the in-detail steps and concept on Q-Q plots, here is the most beautiful and straight explanation given by StatQuest.

Q4. What is Kernel Density Function? And why is it so useful?

Kernel Density Estimation or KDE use to estimate the Probability Density distribution or (PDF) of the outcomes. It helps in creating a continuous curve for the data distribution and changes the way we see the data by adding continuous curve on discreet representation of histograms.

When we take the data points, we add the uniform kernels (symmetrical gaussian curve; we can also use triangular or cosine curve) in such a way that the maximum height of the curve falls in line with the data points. Then we just combine all the kernels to form a larger curve known as Kernel Density Estimation plot. For in detail mathematical explanation of working of KDE and how it’s being optimize; go through this beautiful in-depth article Kernel Density Function written by Niranjan Pramanik.

Q5. What is discreet and continuous uniform distribution?

Uniform distribution is the probability distribution where the chance or likelihood of getting any outcomes are equal. Getting any random number in role of dice, getting any random card from deck of cards are the uniform distribution. Uniform distribution is of two types. Discreet distribution for finite set of outcomes and Continuous distribution for infinite set of outcomes. Getting an outcome for a single role of dice and flipping of coin is an example for discreet uniform distribution. Outcome for generating the random number is an example for Continuous uniform distribution.

Hope you liked it. If you want me to add anything or correct anything then do mention in comments and guide me for more questions like this. Most of my work I take reference from Krish Naik Sir, videos and from StatQuest. These are the two most productive and awesome Data Science channel on YouTube.

--

--

Asitdubey
Analytics Vidhya

Started my interest in Data Science and Machine Learning and want to learn more about it.