5 Question Series — Data Science & AI — 1

Asitdubey
Analytics Vidhya
Published in
7 min readJul 17, 2021

5 Questions from Probability and Statistics. These are the basic questions that can be asked by interviewer yet skimmed by many of students like us. I have started this 5 Question series that I feel important to me. This is my first 5 question from Probability and Statistics in the series of 5 Question of Data Science. Hope you’ll like it and if anything is incorrectly mentioned here, please suggest your corrections and what to change or add more in it.

Q1. What is Random Variable and its types?

Random means anything uncertain. The value or outcome of randomness doesn’t depend or follow the existence of its neighboring number or outcomes. A sudden decision or thinking or an action independent of what situation demands results into random outcomes. For example, if we role a dice; it is not certain or fixed that numbers that will come in a periodic way or will follow some pattern, it can be anything between 1 and 6; might be numbers can come repeatedly but which number repeats how many times is not fixed. If we toss a coin; its outcome of coming of head or tail is fixed, either be head or tail but will it be head or will it be tail that is not fixed; can be head on first toss, in second toss can be head or tail or might be possible that same outcomes can be repeated continuously. We don’t know the outcomes neither it depends on its previous results. Movement of quantum particles, change in stock price, number of peoples on street in a particular hour or in a day, an organization annual turnover, or people on vacation trip or their actions, drawing of a card or two from the deck of cards; all these and many more can be classified as random in nature.

Types of Random Variable:

Random variables can be Numerical or Categorical in nature. Let say, if we talk about numerical RV then it contains the counting of possible outcomes; for example, number of fruits in a bag or number of bags each containing different or same number of fruits, number of students in a class and their ages. But when we talk about categorical RV; it doesn’t tells about the count of any particular outcomes rather it signifies what that outcome means. For example, earlier I said about number of students in a class and their ages, when we categories as the gender of these students then it is categorical in nature. Suppose if we have to check that in each class, count of students by starting alphabet in their names; then it will be numerical RV. But if we check the gender of all the students in same alphabet order in every class then it can be Categorical RV. Even the Numerical Random variables can also have two types; Discreet and Continuous. In discreet RV the outcome will be a whole number and have to non-negative in nature. For example, number of students or the count of things you possess, count of people walking on street. Continuous RV can be outcomes in form of percentage (stock price, interest rates, price of any particular commodities), age of person, person salary. Random variable — Krish Naik

Q2. What are the uses of random variables? PDF and CDF?

When we talk about the uses of random variables, most probably we talk about the PDF and CDF i.e., Probability Distribution Function and Cumulative Distribution Function. To learn in detail about PDF and CDF, and different types of distribution, you can follow my article on Probability and Likelihood.

There can be number of outcomes of any random actions. Probability of coming all the probable outcomes of any random action and the lists of the distribution of these outcomes is known as Probability Distribution. A function which defines the probability distribution is known as Probability Distribution Function. In a random shuffled deck of cards if we have to randomly choose two cards in which the outcome should be coming of two A, then might be the chance that in first attempt only the randomly chosen two cards will be A or might be it can take more than 100 attempts. The distribution of all the outcomes of this experiment is probability distribution. Similarly, the distribution of total outcomes of this random experiment is define as cumulative distribution and the function which define the cumulative distribution is known as cumulative distribution function. mathtutor.

Pic Source — ZedStatistics

Often we get confuse between PDF (Probability Distribution Function) and PDF (Probability Density Function). PMF (Probability Mass Function), often used to describe the distribution of discreet random outcomes and PDF (Probability Density Function), often used to describe the continuous random outcomes and CDF (Cumulative Distribution Function), often used to describe the total distribution of random outcomes. For example, if we roll a dice then the probability of coming of any outcome between 1 to 6 will be 1/6. For CDF, coming of any outcome is not the exact result what we want rather let say if we want the outcome as 4 then probability will be 1/6 but when we take as CDF then the outcome will not be exact 4 but it’ll be from 1 to 4 i.e., probability will be sum of probability from 1 to 4. CDF(4) = PMF(1+2+3+4) = 1/6+1/6+1/6+1/6 = 4/6.

PDF (Probability Density Function) shows the distribution of outcomes of continuous RV i.e., outcome distribution of height of people, or the time distribution, or rate of interests and many more.

Q3. Define Outliers?

By name only we can say outliers is something out of the group, something that make no sense when compared with all others is the group. Can be anything; like, when we take a group of people with height ranging from 5.5 ft to 6 ft, a person with the height of 6.5ft in the group can be an outlier. If in a price distribution of various product in a industry ranging from 1000 to 1500, then an item with price more than 5000 or more than 10,000 can be an outlier. Basically, an outlier having the value which is much higher or lower than all other in a group. An outlier is not only a value of countable things, it can be of uncountable things also; In the bunch of milk jar, a jar of water can be an outlier. In the group of females, a male can be an outlier. Sometimes outliers are defined naturally and sometimes it can be a mistake of machine or people making the records of data. Well outliers are bad for making any decision based on data as it totally deflects the result from normal.

Different methods to detect outliers: -

1. Box plot

2. Z — score or Extreme value analysis.

3. PCA or LDA

4. Information Theory Model

5. High Dimensional Sparse Data.

For in detail concept of outliers and detection, we can read this beautiful article written by Sergio Santoyo on A brief overview of outlier detection technique.

Q4. Define symmetric probability distribution? Skewness and Kurtosis?

Normal distributed data in which left side distribution is the mirror image of right-side distribution. Symmetric distribution is also known as uniform distribution. Probability of randomly choosing a card from the deck of cards is 1/52 or probability of coming any number when a dice is rolled is 1/6 are the examples of uniform or symmetric distribution. In symmetrical distribution; mean, median and mode all are equal and in standard normal distribution; the mean is zero with the deviation of 1.

We say the data is skewed, when most of the data is at the extreme of the distribution. Can be left skewed or negative — skewed (right modal) i.e., most of the data is at the right extreme (long tail will be at left) and can be right skewed or positive — skewed (left modal) where most of the data is at the left extreme (long tail will be at right).

Whereas, kurtosis means the vary in the height of the distribution of data with respect to the normally distributed data. It is defined as of three types: — Lepto Kurtic, Meso Kurtic and Platy Kurtic.

Q5. What are the three deviations rule of distribution?

Three deviation rule of normally distributed data is that all the data lies within the range of 1 deviation, 2 deviation and 3 deviation i.e., 68% of the data in normal distribution lies with I standard deviation from the center, 95% of the data lies within 2 standard deviation of the data and 99.7% of the data lies within 3 standard deviation from the center.

To known more on Normal distribution: -

Check the StatQuest and my article on Linear Regression.

Hope you liked it. If you want me to add anything or correct anything then do mention in comments and guide me for more questions like this. Most of my work I take reference from Krish Naik Sir, videos and from StatQuest. These are the two most productive and awesome Data Science channel on YouTube.

--

--

Asitdubey
Analytics Vidhya

Started my interest in Data Science and Machine Learning and want to learn more about it.