Data Science: Statistical Basics

narcis teodoroiu
Analytics Vidhya
Published in
8 min readAug 22, 2021

According to Wikipedia: “Data science is a “concept to unify statistics, data analysis, informatics, and their related methods” in order to“understand and analyze actual phenomena” with data.

Source: pexels.com

If you have ever heard of Data Science, I am sure you already know that statistics are an important foundation of this beautiful field. Therefore I have decided to write this blog to present a series of basic concepts.

My mathematician mind makes me think in a structured way and I want my blogs to follow a similar pattern in which you can found a lot of images, examples and understand the concepts without having to read too much verbiage. That said, let’s start…

Index:

  1. Population and sample

2. Mean, Median, Mode and Range

3. Distributions

  • Normal Distribution
  • Standardized Normal Distribution

4. Central Limit Theorem

5. Variability measures

  • Variance
  • Standard Deviation
  • Covariance
  • Coefficient of correlation

6. Outliers measures

  • Skewness
  • Kurtosis
  • IQR Method

Population and Sample

A population is the entire group that you want to draw conclusions about. Whilst a sample is the specific group tat you will collect data from. It is a subset of the population.

Source: Omniconvert.com

Mean, Median, Mode and Range

They express measures of central tendency. In different ways they each tell us what value in a data set is typical or representative of the data set.

The mean is the same as the average value of a dataset.

The median is the central number of the dataset.

The mode is the number that occurs most frequently in a dataset.

The range is the difference between the lowest value and the highest value.

Example: 7, 3, 4, 1, 7, 6

  • Mean: (7+3+4+1+7+6)/6 → 4.66
  • Median: 1, 3, 4, 6, 7, 7 → (4+6)/2=5
  • Mode: 7, 3, 4, 1, 7, 6 → 7
  • Range: 7–1 → 6

Distributions

Normal/Gaussian Distribution

It is a type of continuous probability distribution for a real random variable.

Can be described with just two parameters, mean and standard deviation.

Source: Michael Galarnyk

Properties:

  • The mean, mode and median are all equal.
  • The curve is symmetric at the center (i.e. around the mean).
  • Exactly half of the values are to the left of center and exactly half of the values are to the right.
  • The total area under the curve is 1.
  • Skewness and kurtosis.

Application in Machine Learning:

  • Data satisfying Normal Distribution is beneficial for model building. It makes math easier.
  • Algorithms which use Normal Distributions: Logistic Regression, Linear Regression, etc., are explicitly calculated from the assumption that the distribution is normal. So, we need to normalize the data before applying some machine learning algorithms.

Why it is important?

  • Found in the natural phenomena: Is the most important probability distribution in statistics because it fits many natural phenomena like age, height, test-scores, IQ scores, sum of the rolls of two dice and so on.
  • Mathematical reason: Central Limit Theorem.
  • Simplicity in mathematics. Namely, it’s mean, median and mode are all same. The entire distribution can be specified using just two parameters: mean and standard deviation.
  • Unlike many other distributions that change their nature on transformation, a Gaussian tends to remain a Gaussian (Product of two Gaussians is a Gaussian, convolution of Gaussian with another Gaussian is a Gaussian).

Normal distribution in real life:

  • Height. Most of the people in a specific population are of average height. The number of people taller and shorter than the average height people is almost equal, and a very small number of people are either extremely tall or extremely short.
  • Rolling a dice. In an experiment, it has been found that when a dice is roller 100 times, changes to get ‘1’ are 15–18% and if we roll the dice 1000 times, the changes to get ‘1’ is, again, the same.
  • IQ. The intelligence quotient of a majority of the people in the population lies in the normal range whereas the IQ of the rest of the population lies in the deviated range.
  • Technical stock market. The changes in the log values of Forex rates, prices indices and stock prices return often form a bell-shaped curve. For stock returns, the standard deviation is often called volatility. If returns are normally distributed, more than 99 percent of the returns are expected to fall within the deviations of the mean value.
  • And many more (Shoe size, birthday weight, income distribution in economy, etc.)

Standard Normal Distribution

The standard normal distribution is a special case of the normal distribution where the mean is 0 and the standard deviation is 1. This process is called standardization.

The normal distribution can take on any value as its mean and standard deviation. In the standard normal distribution, the mean and standard deviation are always fixed.

Every normal distribution can be converted to the standard normal distribution by turning the individual values into z-scores.

N(μ, σ) → Standard Normal Z ∼ N(0, 1)

Source: mathisfun.com

Empirical rule: 68/95/99.7

  • 68% of observations within +- stdev from the mean.
  • 95% of the observations are within +-2 stdev from the mean.
  • 99.7% of observations are within +-3 stdev from the mean.
  • Values outside of +- 3 stedv account for less than 0.3% of observations, and, depending on the situation, could be considered outliers or signal noise.

We convert normal distributions into the standard normal distribution for several reasons:

  • To find the probability of observations in a distribution falling above or below a given value.
  • To find the probability that a sample mean significantly differs from a known population mean.
  • To compare scores on different distributions with different means and standard deviations.

Central Limit Theorem

Introduction in context: “Suppose we want to study the average age of the whole population of China. As the population of China is very high, it will be a tedious job to get everyone’s age data and will take a lot of time for the survey. So instead of doing that we can collect samples from different parts of China and try to make an inference. To work with samples we need an approximation theory which can simplify the process of calculating mean age of the whole population. Here the Central Limit Theorem comes into the picture. “

Definition: If you sample batches of data from any distribution and take the mean of each batch. Then the distribution of the means is going to resemble a Gaussian distribution — no matter what the shape of the population distribution.

Source: Wikipedia

Variability measures

Variance (σ²)

Definition: The average of the squared differences from the mean.

Disadvantage: It is expressed in much larger units (e.g., meters squared)

Standard Deviation (σ)

Definition: Measure of how spread out numbers are. This indicates how much the dataset deviates from the mean of the sample.

Advantage: Is expressed in the same units as the original values (e.g., meters)

Covariance

Definition: Measure the directional relationship between two variables.

Covariance is zero in case of independent variables because then the variables do not necessarily move together.

Disadvantages:

  • Range: -∞ and +∞
  • Is affected by the change in scale.

Coefficient of correlation

Definition: Measure the strength of the relationship between two variables. It is the normalized measurement of the covariance.

Independent movements do not contribute to the total correlation. Completely independent variables have a zero correlation.

Advantages:

  • Range: -1 and +1
  • Is not influenced by scaling.

Outliers measures

Skewness

Skewness is the measure of how much the probability distribution of a random variable deviates from the normal distribution. It is useful for the outliers checking. It measures the lack of symmetry in a data distribution.

There are two types of skewness:

  • Positive skewness. The tail on the right side of the distribution is longer or fatter. Mode< Median < Mean.
  • Negative skewness. The tail on the left side of the distribution is longer or fatter. Mean < Median < Mode.
Image: Sigmamagic.com

Why is important?

The tail region may act as an outlier for the statistical model and we know that outliers adversely affect the model’s performance, especially regression-based models. So there is a necessity to transform the skewed data to close enough to a Gaussian distribution.

Kurtosis

Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ from the tails of a normal distribution. In other words, it identifies whether the tails contains extreme values in a given distribution.

There are three types of kurtosis:

  • Normal Kurtosis. A normal distribution has a kurtosis of 3.
  • High Kurtosis (>3). Distribution is longer, tails are fatter. Is an indicator that data has outliers. If there is a high kurtosis, then, we need to investigate why we have so many outliers.
  • Low Kurtosis (< 3). Distribution is shorter, tails are thinner than the normal distribution. Is an indicator that data has a lack of outliers. If we get low kurtosis (too good to be true), then also we need to investigate and trim the dataset of unwanted results.
Source: Analystprep.com

IQR Method

Interquartile range is the difference between Q3 and Q1.

Properties:

  • The median is the center point, also called second quartile, of the data (resulting from the fact that the data is ordered).
  • Q1 is the first quartile of the data, i.e., to say 25% of the data lies between minimum and Q1.
  • Q4 is the third quartile of the data, i.e., to say 75% of the data lies between minimum and Q3.
Source: Wikipedia

To detect the outliers using this method, we define a new range and any data point lying outside this range is considered as outlier and is accordingly dealt with. The range is as given below:

  • Lower Bound: Q1 = -1.5 * IQR
  • Upper Bound: Q3 = 1.5*IQR

Why ‘1.5’ ?

The rest 0.28% of the whole data lies outside three standard decisions (>3σ) of the mean (μ). This part of the data is considered as outliers. The first and the third quartiles, Q1 and Q3, lie at -0.675σ and +0.675σ from the mean, respectively. To get exactly 3σ, we need to take the scale = 1.7, but then 1.5 is more “symmetrical” than 1.7 and we’ve always been a little more inclined towards symmetry.

Thanks for reading this far!

I hope you found this insightful and helps you in your data science career :) If you enjoyed the content, be sure to follow me on Medium. As always, I wish you the best in your learning endeavors!

Narcis Teodoroiu

  • Did you found the article interesting? FOLLOW me on Medium.
  • If you are interested in networking, let’s CONNECT on LinkedIn.

--

--