STATISTICS FOR DATA SCIENCE

ALL IMPORTANT CONCEPTS OF STATISTICS IN DATA SCIENCE

Rakib Ansari

Published in

Analytics Vidhya

12 min readSep 23, 2020

1. VARIABLE : It a place holder which stores values.

2. Random variable : It is a random collection of variables.

It is of two types :

A. Numerical variable : A numerical is one that may take on any value within a finite or infinite interval (e.g., height, weight, temperature, blood glucose, …)

Numerical variable is further divided into two parts :

A.1. Continuous(floating number) : A continuous variable is one which have decimal values. For example : 5.6, 7.8, 0.001, 846.245

A.2. Discrete(whole number) : Discrete numbers are the basic counting numbers. For example : 0, 1, 2, 3, 4, 5, 6

B. Categorical Variable : A categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values (e.g. race, sex, age group)

Categorical Variable is further divided into two parts :

B.1. Nominal : A nominal variable does not have orders.

B.2. Ordinal : An ordinal variable is a categorical variable for which the possible values are ordered (e.g. education level (“high school”, ”BS”, ”MS”, ”PhD”))

RANDOM VARIABLE CONCLUSION :

3. MEASURE OF CENTRAL TENDENCIES :

A. MEAN : it is the sum of a collection of numbers divided by the count of numbers in the collection

mean = sum of number of collection / total collection

B. MEDIAN : The “middle” of a sorted list of numbers(When there are two middle numbers we average them).

C. MODE : The mode of a set of data values is the value that appears most often.

NOTE : mean, median, mode helps in handling missing values.

4. RANGE : The Range is the difference between the lowest and highest values. Example: In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9. So the range is 9 − 3 = 6.

5. POPULATION, SAMPLE, POPULATION MEAN, SAMPLE MEAN :

POPULATION : a population is a set of similar items or events.

SAMPLE : small collection of items from population.

Every dataset that we get to perform ML model is a sample of data.

Population vs sample use case : exit poll on election.

POPULATION MEAN : The population mean is an average of a group characteristic.

SAMPLE MEAN : A sample mean refers to the average of the sample data.

6. VARIANCE :

variance : It is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers is spread out from their average value.

7. Standard deviation and measure of dispersion:

Standard deviation (SD) is the most commonly used measure of dispersion. It is a measure of spread of data about the mean. SD is the square root of sum of squared deviation from the mean divided by the number of observations.

The standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

8. GAUSSIAN/NORMAL DISTRIBUTION :

Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graph form, normal distribution will appear as a bell curve.

Gaussian distribution to Standard normal distribution(mean=0 and standard deviation=1) [(x-mean)/standard deviation = (z-score)].

9. STANDARD NORMAL DISTRIBUTION :

The standard normal distribution is a normal distribution with a mean of zero and standard deviation of 1.

Empirical formula :
68.2% lies in 1st standard deviation
95.4% lies in 1st standard deviation
99.7% lies in 1st standard deviation

10. Z-SCORE :

The value of the z-score tells you how many standard deviations you are away from the mean. If a z-score is equal to 0, it is on the mean. A positive z-score indicates the raw score is higher than the mean average. For example, if a z-score is equal to +1, it is 1 standard deviation above the mean.

10. PROBABILITY DENSITY FUNCTION :

A probability density function, or density of a continuous random variable, is a function whose value at any given sample in the sample space can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample.

11. CUMULATIVE DISTRIBUTION FUNCTION :

The cumulative distribution function (CDF) of a real-valued random variable , is the probability that will take a value less than or equal to.

12. HYPOTHESIS TESTING :

Hypothesis testing in statistics is a way for you to test the results of a survey or experiment to see if you have meaningful results. You’re basically testing whether your results are valid by figuring out the odds that your results have happened by chance. If your results may have happened by chance, the experiment won’t be repeatable and so has little use.

13. KERNEL DENSITY ESTIMATION(KDE) :

KERNEL DENSITY ESTIMATION(KDE) is a non-parametric way to estimate the probability density function of a random variable.

Kernel density estimates are closely related to histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel.

14. CENTRAL LIMIT THEOREM :

The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement , then the distribution of the sample means will be approximately normally distributed.

The central limit theorem tells us that no matter what the distribution of the population is, the shape of the sampling distribution will approach normality as the sample size (N) increases.

15. SKEWNESS :

Skewness refers to distortion or asymmetry in a symmetrical bell curve, or normal distribution, in a set of data. If the curve is shifted to the left or to the right, it is said to be skewed. Skewness can be quantified as a representation of the extent to which a given distribution varies from a normal distribution.

16. COVARIANCE :

covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, the covariance is positive.

covariance only tells magnitude.

17. PEARSON CORRELATION COVARIANCE :

Pearson’s correlation coefficient (r) is a measure of the strength of the association between the two variables.

Pearson Correlation Coefficient helps in feature selection.
Pearson Correlation Coefficient lies b/w -1 to 1.

Pearson Correlation Coefficient tells magnitude and direction.

FORMULA OF PEARSON CORRELATION COEFFICIENT

18. SPEARMAN RANK CORRELATION :

It assesses how well the relationship between two variables can be described using a monotonic function(function between ordered sets that preserves or reverses the given order.).

Spearman’s rank correlation coefficient tells magnitude and direction even for non linear data and outliers.

FORMULA OF SPEARMAN RANK CORRELATION

SAME RESULT WHEN THERE IS NO OUTLIER

SPEARMAN GIVE BETTER RESULT IN OUTLIER

POSITIVE SPEARMAN CORRELATION

NEGATIVE SPEARMAN CORRELATION

19. Q-Q PLOT :

Q–Q (quantile-quantile) plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other.

A Q–Q plot is used to compare the shapes of distributions, providing a graphical view of how properties such as location, scale, and skewness are similar or different in the two distributions.

20. CHEBYSHEV’S INEQUALITY :

Chebyshev’s inequality guarantees that, for a wide class of probability distributions, no more than a certain fraction of values can be more than a certain distance from the mean.

Specifically, no more than 1/k2 of the distribution’s values can be more than k standard deviations away from the mean (or equivalently, at least 1 − 1/k2 of the distribution’s values are within k standard deviations of the mean)

21. BINOMIAL DISTRIBUTION :

A binomial distribution can be thought of as simply the probability of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated multiple times. The binomial is a type of distribution that has two possible outcomes (the prefix “bi” means two, or twice). For example, a coin toss has only two possible outcomes: heads or tails and taking a test could have two possible outcomes: pass or fail.

Binomial distributions must also meet the following three criteria:

A. The number of observations or trials is fixed.

B. Each observation or trial is independent.

C. The probability of success is exactly the same from one trial to another.

Real Life Examples :

If a new drug is introduced to cure a disease, it either cures the disease (it’s successful) or it doesn’t cure the disease (it’s a failure). If you purchase a lottery ticket, you’re either going to win money, or you aren’t. Basically, anything you can think of that can only be a success or a failure can be represented by a binomial distribution.

n stands for the number of times the experiment runs and p represents the probability of one specific outcome.

22. BERNOULLI DISTRIBUUTION :

A Bernoulli distribution is a discrete probability distribution for a Bernoulli trial — a random experiment that has only two outcomes (usually called a “Success” or a “Failure”). For example, the probability of getting a heads (a “success”) while flipping a coin is 0.5. The probability of “failure” is 1 — P (1 minus the probability of success, which also equals 0.5 for a coin toss). It is a special case of the binomial distribution for n = 1. In other words, it is a binomial distribution with a single trial (e.g. a single coin toss).

23. LOG-NORMAL DISTRIBUTION :

A log-normal distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution.

Lognormal is extremely useful when analyzing stock prices. As long as the growth factor used is assumed to be normally distributed.

The log-normal distribution curve can therefore be used to help better identify the compound return that the stock can expect to achieve over a period of time. Note that log-normal distributions are positively skewed with long right tails due to low mean values and high variances in the random variables.

24. POWER LAW :

The power law (also called the scaling law) states that a relative change in one quantity results in a proportional relative change in another. The simplest example of the law in action is a square; if you double the length of a side (say, from 2 to 4 inches) then the area will quadruple (from 4 to 16 inches squared).

25. BOX-COX TRANSFORM :

A Box Cox transformation is a transformation of a non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means that you are able to run a broader number of tests.

26. POISSON DISTRIBUTION :

The Poisson distribution is the discrete probability distribution of the number of events occurring in a given time period, given the average number of times the event occurs over that time period.

EXAMPLE : A certain fast-food restaurant gets an average of 3 visitors to the drive-through per minute. This is just an average, however. The actual amount can vary.

27. NON-GAUSSIAN DISTRIBUTION :

Although the normal distribution takes center stage in statistics, many processes follow a non normal distribution. This can be due to the data naturally following a specific type of non normal distribution (for example, bacteria growth naturally follows an exponential distribution). In other cases, your data collection methods or other methodologies may be at fault.

Types of Non Normal Distribution

Reasons for the Non Normal Distribution :

Outliers
Multiple distributions may be combined in your data.
Insufficient Data.
Data may be inappropriately graphed.

Dealing with Non Normal Distributions

You have several options for handling your non normal data. Many tests, including the one sample Z test, T test and ANOVA assume normality. You may still be able to run these tests if your sample size is large enough (usually over 20 items). You can also choose to transform the data with a function, forcing it to fit a normal model. However, if you have a very small sample, a sample that is skewed or one that naturally fits another distribution type, you may want to run a non parametric test. A non parametric test is one that doesn’t assume the data fits a specific distribution type. Non parametric tests include the Wilcoxon signed rank test, the Mann-Whitney U Test and the Kruskal-Wallis test.

REFERENCES :

GOOGLE SEARCH
GOOGLE IMAGE
WIKIPEDIA
SOME STATISTICS SITE