Statistics For Data AnalystPart-2

Suraj Gusain
5 min readFeb 4, 2023

--

We have already discussed Statistics for Data Science Part-1. You can refer to the article. Statistics for Data Science Part-1

In this article, we will discuss the Measure of Central Tendency and the Measure of Spread.

The measure of Central Tendency

The measure of central tendency is a statistical term used to describe the “center” or typical value of a data set.

Central tendency (sometimes called the measure of location, central location, or just center), attempts to describe an entire set of data with one figure — a statistic such as an average amount or a median price

The three most common measures of central tendency are:

  1. Mean: It is the average of all the values in a set of data. Example: The mean of {1,2,3,4} is (1+2+3+4)/4 = 2.5
  2. Median: It is the middle value of a set of datasets changed in order. Example: The median of {1,2,3,4} is 2.5
  3. Mode: It is the value that appears most frequently in a set of data. Example: The mode of {1,2,3,4,4} is 4.

Note: The mean, median, and mode are not always the same. It depends on the shape and distribution of the data.

How affected outliers in mean and examples

A real-life example of outliers affecting the mean is the calculation of the average salary of employees in a company. If the company has a small number of highly paid executives, their salaries could significantly increase the mean salary of the entire company.

For example, if a company has 100 employees and 98 of them earn an average salary of $50,000 per year, but two executives earn $1 million per year, the mean salary for the company would be ($50,000 * 98 + $1 million * 2) / 100 = $60,000.

In this case, the mean salary of $60,000 does not accurately reflect the typical salary earned by employees in the company, as the high salaries of the two executives have significantly skewed the mean. In such a situation, it would be more appropriate to use the median salary as a measure of central tendency, which would give a better indication of the “typical” salary in the company.

  1. Student Test Scores: In a class, if one student scores significantly higher or lower than the rest of the class, it can affect the mean test score and not accurately reflect the typical performance of the class.

Three Kinds of Descriptive Statistics: Depending on how many variables are involved

Univariate, Bivariate, and Multivariate Analysis are statistical methods used to describe and analyze data.

  1. Univariate Analysis: It is the simplest form of data analysis that deals with only one variable at a time. The univariate analysis involves summarizing the main features of a single variable, such as its mean, median, mode, range, and frequency distribution.
  2. Bivariate Analysis: It involves the analysis of two variables and their relationship with each other. Bivariate analysis helps to determine if there is a relationship between two variables and the strength of that relationship. Common bivariate analysis methods include scatter plots, correlation, and regression analysis.
  3. Multivariate Analysis: It involves the analysis of three or more variables. Multivariate analysis helps to understand the relationships between multiple variables and how they affect each other. Common multivariate analysis methods include principal component analysis, factor, and discriminant analysis.

Variance:

Variance is a statistical measure that quantifies the spread or dispersion of a set of data points around the mean or average. It represents the average squared deviation from the mean.

Standard Deviation:

Standard deviation is the square root of the variance, and it provides a way to describe how much individual data points deviate from the mean. It gives an idea of how widely spread the data is.

Here’s an example:

Suppose you have a set of 5 numbers: 2, 4, 5, 4, and 9. The mean of these numbers is 5.

To find the variance, you first find the deviation of each number from the mean:

  • 2 deviates from the mean by 5–2 = 3
  • 4 deviates from the mean by 5–4 = 1
  • 5 deviates from the mean by 5–5 = 0
  • 4 deviates from the mean by 5–4 = 1
  • 9 deviates from the mean by 9–5 = 4

Next, square each deviation:

  • 3² = 9
  • 1² = 1
  • 0² = 0
  • 1² = 1
  • 4² = 16

Finally, average the squared deviations:

(9 + 1 + 0 + 1 + 16) / 5 = 27 / 5 = 5.4

So the variance is 5.4.

To find the standard deviation, simply take the square root of the variance:

√5.4 = 2.3

So the standard deviation is 2.3, which indicates that the data is spread out by an average of 2.3 units from the mean.

Variance and Standard Deviation:

Variance and standard deviation are related statistical measures describing the spread or dispersion of data points around the mean. The main difference between them is that variance is a measure of variability, expressed in squared units, while standard deviation is the square root of the variance and is expressed in the same units as the original data.

In other words, variance provides a sense of how far the individual data points are from the mean, while standard deviation provides a sense of how much the data deviates from the mean in a meaningful, interpretable way. For this reason, the standard deviation is often used as a more intuitive and easier-to-understand measure of variability in data.

what if our mean is the same for two datasets then how to differentiate the datasets

If the mean of two datasets is the same, it does not necessarily mean that the datasets are the same. There are other measures of dispersion, such as variance and standard deviation, which can be used to differentiate the datasets.

For example, two datasets may have the same mean but have different variances, indicating that one dataset is more spread out and has a greater variability of values than the other. Similarly, two datasets may have the same mean but have different standard deviations, providing a more intuitive and easily interpretable measure of how much the data deviates from the mean.

Additionally, other measures of central tendency and dispersion, such as the median and interquartile range, can also be used to differentiate datasets, as well as other descriptive statistics, such as skewness and kurtosis, which describe the shape and distribution of the data.

Thank you for taking the time to read the article. I appreciate your attention and feedback.”

If you found this article helpful, please consider sharing it with others who might benefit from it. Your support in spreading the word is greatly appreciated.”

Connect me on Linkedin

Connect me on GitHub

Connect me on Instagram

Connect me on Kaggle

--

--