20+ Statistics Concepts for Data Science Beginners

Mehul Gupta
Data Science in your pocket
6 min readJul 13, 2019

--

Statistics is one of the most important components of Data Science, yet it is often ignored. For the same reason, I decided to start off with a series of articles on Stats and I intend to cover all important concepts one might need in a basic data science problem.

Note that this article assumes you are familiar with the common concepts in statistics like mean, median, mode, variance or standard deviation. I won’t be discussing these topics in much detail in this post.

So let's begin!!

  1. Descriptive Stats - This branch of stats describes the entire population using some measures like standard deviation, mean, etc. You can look it as numbers that are used for representing the entire data. Hence in very simple words we can say that it is calculated over the entire data.

Example- Data=1,2,3,4,5.

Mean=3, median=3.

Here the stats are calculated over the entire data

2. Inferential Stats - It refers to representing the entire data on the basis of sample data extracted from the population. It is quite useful when you can’t consider the entire population into consideration.

Example- It is not possible to take in consideration all human beings for calculating average life expectancy. Hence a sample of people(quite big though!!) is taken and average life of this sample is taken as the average for the entire human race.

3. Statistically significant effect - Any effect is considered statistically significant when we have enough evidence that the effect hasn’t occurred by chance. We should have enough proof to believe that the effect is not artificial or due to biased choices of data but has really occurred.

Example- It might happen that after intaking some supplement, many build up 6 pack abs. But these 6 pack abs can be by chance/by luck as well. So the effect of supplement on body can’t be considered significant without enough proofs

4. Probability Mass Function- Often called PMF, these functions are used for calculating the probability of discrete values in a given distribution.

Example-If given data is 1,1,2,3. Let pmf() be are function.Hence

pmf(1)=0.5,pmf(2)=0.25,pmf(3)=0.25

5. Probability Density Function-Also called PDF, It defines a probability distribution for a continuous random variable as opposed to a discrete random variable(as in PMF). When the PDF is graphically portrayed, the area under the curve will indicate the interval in which the variable will fall. The PDF for a discrete value is always 0!!!! (surprisingly). It can be taken as integration with limits.

Example-The probability of life expectancy between 60–65 is given by PDF while probability of it to be exactly 65 will be given by PMF. And will be 0 according PDF

6. Cumulative Mass/Density Function- Often called CMF/CDF, it is the cumulative probability calculated over PDF/PMF i.e. summation of all probabilities associated with a comparatively lower value.

Example-Data-1,1,2,3

cmf(1)=0.5,cmf(2)=cmf(1)+pmf(2)=0.75,cmf(3)=cmf(2)+pmf(3)

Likewise, CDF is for PDF as well!!

7. Percentile- It can be taken as a fraction of people who performed below you.

Example-if your percentile score is 95 in a test, it means 95% of the people who attempted the test scored below you.

8. Sample variance/standard deviation- You might hear this term quite often. It is quite similar to the original versions but with a difference, the denominator (i.e. N=total items for variance/standard deviation) is N-1 and not N. There is no rocket science behind this. When we take a sample(& not the whole data) it is considered that variance/standard deviation might be higher than the calculated values from sample data. Hence we use N-1 to increase this slightly!!

Example- Data=1,2,3 (let it be a sample of a large data)

hence sample variance=1(check for yourself)

but if it is the entire dataset, variance=2/3

9. Data Distribution — The distribution of a statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of the data and how often they occur.

10. Normal Distribution - In such data distribution, the mean of data is 0 with a standard deviation =1. When plotted, it creates a bell-shaped structure with the area under it=1. It is amongst the most important distribution I would be covering in my upcoming articles as well.

11. Central Limit theorem- When a sample of data is taken from a very large data source, it is assumed to be Normally Distributed given a lot of assumptions that can be checked here.

12. Skewness- It refers to asymmetry in the data distribution. The below images can explain better.

13. Kurtosis -refers to irregularities in the tail of the distribution. Below we can compare how kurtosis destroys normal distribution by bringing uncommon tails into a normal distribution.

14. Hypothesis-It is an assumption we take in the very beginning to start off any problem.

15. Hypothesis Testing- This test is carried to figure out whether an effect is statistically significant or not. Using this test, it can be figured out whether the effect is insignificant, but we can’t determine it to be significant on the basis of hypothesis testing. For this, other tests are needed. It has 2 hypotheses.

Null Hypothesis-Assuming the effect is insignificant

Alternate Hypothesis- Assuming the effect is significant.

More can be found out here.

16. Test Statistic- It is a variable calculated using the data which helps us to determine whether to reject the hypothesis or not.

17. Type I error- When we finalize a hypothesis as true though it is false.

Example-Believing horses can fly but actually they can’t

18. Type II error-When we reject a Hypothesis which is actually true.

Example-Rejecting a horse has 4 legs.

19. Confidence Interval-It is a percentage which helps us to get an interval within which we can find any specific/unknown parameter.

Example-If the confidence interval is 95% & parameter(to figure out) is mean, this will yield us an interval of values within which our mean lies. Like if the interval obtained(after some mathematics) is 90–100, mean lies within this range.

20. Bias & Variance in estimated values- Bias refers to estimations where predicted values are far from actual values but not scattered. Variance means though the values are close to actual values but very scattered. The below picture is the best you can find on the internet.

21. Covariance & Correlation- Covariance shows how much two variables are related to each other. One issue with this is that we don’t have a standard scale for estimating this relation.

Example-If X & Y are two variable, Covariance can be -10, 10000, 0.000002, -99999 i.e. it has no predefined range.

To scale Covariance, we use correlation which has a defined range of -1 to +1.

22. Causation- It refers to the root cause of any effect. It must be noted that Correlation isn’t Causation i.e.

Example- If Holidays and Expenses are correlated to each other, we cannot conclude that it is Holidays that is causing expenses(such is stats :P)

I guess this is more than enough, to begin with. If you feel anything missing, comment it out below. I would be coming up with some common data distributions in my next!

--

--