Statistics for Data Analyst Part - 2

Shivani Dashore
9 min readMay 29, 2023

--

Image Source — Getty

In the first part of our series on statistics, we explored the fundamentals of statistical analysis. Now, in this second part, we will delve into the types of data, measures of central tendency, and measures of dispersion. Understanding these concepts is crucial for effectively analyzing and interpreting data. So, let’s dive in!

Before going to jump into this article CHECK IT ONCE

BASIC OF STATISTIC PART 1

Importance of Data Types in Statistics

Understanding data types is crucial in statistics. When conducting an experiment or analyzing data, it’s important to know the type of data you’re working with. This information determines the appropriate statistical analysis techniques, visualizations, and prediction algorithms that can be applied. By understanding data types, you can make informed decisions about how to handle and interpret your data effectively.

Image Source- Intellispot.com

Two types of Data :

Qualitative & Quantitative are two basic types of data. As the name suggests, qualitative deals with the quality (Characteristics) & some statisticians call it categorical. Quantitative deals with numbers. Below is the difference between Data Types.

Image Souce — Author

Quantitative are of two types viz. Discrete and Continuous. They are explained below:

Discrete: Discrete data are countable or Finite. Finite means there are a certain number of values you can pick from and countable means you can count them. This type of data can’t be measured but it can be counted. These are natural numbers and are counts of something.

Let me give you an example of countable

An example of finite is the number that comes on rolling a die. There are only 6 possible choices like 1,2,3…6 but never more than 6 or 4.5 etc. likewise if you flip a coin it has only heads or tails, so, there are a certain number of values you can pick from.

Continuous: Continuous data are uncountable or infinite or, to put it differently, there are endless number of possible values that are not countable. Usually is a measurement of something and cannot be counted. Continuous can take absolutely any value.

For example, a person’s height or weight. Height can be 5.23, 5.24, 5.76, etc. Similarly, weight can be any value like 75.82 Kg or 62.35 kg, etc. Height, weight, length, speeds, and temperatures are continuous data examples.

Exploring Measures of Central Tendency and the Impact of Outliers

A measure of central tendency is a descriptive statistic that describes the average, typical value of a set. Measures of central tendency are also usually called averages.

They give us an idea about the concentration of the values in the central part of the distribution

The following are the measures of central tendency that are in common use:

  1. Mean
  2. Median
  3. Mode

Mean (Average)

Mean locate in the center of the distribution, the mean is simply the sum of the values divided by the total number of items in the set The mean of {1,2,3,4} is (1+2+3+4)/4 = 2.5

2. Median: It is the middle value of a set of data when it is arranged in order. Example: The median of {1,2,3,4} is 2.5

3. Mode: It is the value that appears most frequently in a set of data. Example: The mode of {1,2,3,4,4} is 4.

How outliers affect the mean

Suppose you are conducting a study on the average income of a neighborhood. You collect data from ten individuals, and their incomes are as follows (in dollars): $30,000, $40,000, $45,000, $50,000, $55,000, $60,000, $65,000, $70,000, $75,000, and $10,000,000.

In this case, the majority of the incomes fall within a reasonable range, reflecting the typical earnings of individuals in the neighborhood. However, there is one outlier with an income of $10,000,000, which is significantly higher than the rest.

If we calculate the mean income by summing up all the incomes and dividing by the number of individuals (10), we get:

Mean = ($30,000 + $40,000 + $45,000 + $50,000 + $55,000 + $60,000 + $65,000 + $70,000 + $75,000 + $10,000,000) / 10 = $1,003,500.

As you can see, the mean income of $1,003,500 is heavily influenced by the outlier of $10,000,000. The presence of this extreme value skews the mean upward, making it appear much higher than the typical income of the neighborhood. This could lead to an inaccurate representation of the average income in the area.

In this scenario, the mean fails to capture the central tendency of the data effectively due to the outlier. Instead, using other measures such as the median

MEDIAN

The median represents the central value in a dataset when the data is arranged in ascending or descending order. It is less influenced by outliers or skewed data. To calculate the median, let’s consider the following data:

65 55 89 56 35 14 56 55 87 45 92

First, we need to rearrange the data in order of magnitude (from smallest to largest):

14 35 45 55 55 56 56 65 87 89 92

The median is determined by locating the middle value, which in this case is 56 (highlighted in bold). It is the middle value because there are 5 scores before it and 5 scores after it. This method works well when there is an odd number of scores. However, if there is an even number of scores, we need to take the average of the two middle scores. For example:

65 55 89 56 35 14 56 55 87 45

Again, we rearrange the data in order of magnitude:

14 35 45 55 55 56 56 65 87 89

Now, we need to average the 5th and 6th scores in our dataset, resulting in a median of 55.5.

Mode:

The mode in descriptive statistics refers to the value or values that occur most frequently in a dataset. It represents the highest peak(s) or the most common observation(s) in the data. Here’s an explanation of the mode with an example:

Consider a dataset representing the ages of participants in a marathon race: 28, 35, 32, 45, 35, 28, 41, 28, 35, 37.

In this dataset, the age “28” appears three times, “35” appears three times, and the rest of the ages occur once. Both “28” and “35” have the highest frequency, making them the modes of the dataset.

So, in this example, the modes of the dataset are “28” and “35,” as they are the most frequently occurring ages among the participants.

The mode provides valuable information about the most common or popular value(s) within a numeric dataset.

Variance and Standard Deviation

Variance and standard deviation are statistical measures that quantify the dispersion or spread of data points around the mean. They provide insights into the variability or deviation from the average value. Here’s a brief explanation of variance and standard deviation, along with an example and how to calculate them:

Variance:

Variance is the average of the squared differences between each data point and the mean. It measures the average distance of individual data points from the mean, indicating how much the data points vary from the average value. A higher variance indicates greater dispersion.

Example: Let’s consider a dataset representing the daily temperatures in Celsius for a week: 20, 22, 23, 19, 25, 21, 24. To calculate the variance:

  1. Find the mean: (20 + 22 + 23 + 19 + 25 + 21 + 24) / 7 = 22.
  2. Subtract the mean from each data point: (20–22), (22–22), (23–22), (19–22), (25–22), (21–22), (24–22).
  3. Square each difference: (-2)², (0)², (1)², (-3)², (3)², (-1)², (2)².
  4. Calculate the average of the squared differences: (4 + 0 + 1 + 9 + 9 + 1 + 4) / 7 = 4.28 (rounded to two decimal places).

Therefore, the variance of the temperature dataset is approximately 4.28.

Standard Deviation:

The standard deviation is the square root of the variance. It provides a measure of the average amount of deviation or dispersion from the mean. The standard deviation is useful in interpreting the spread of data, as it is expressed in the same unit as the original data.

To calculate the standard deviation:

Take the square root of the variance calculated in the previous example.

  1. The square root of 4.28 = approximately 2.07 (rounded to two decimal places). Thus, the standard deviation of the temperature dataset is approximately 2.07 degrees Celsius.

Both variance and standard deviation are commonly used in statistics to quantify the spread or variability of data. They help in understanding the dispersion of data points around the mean and provide insights into the overall distribution of the dataset.

In the above diagram, As you can see if our variance value is large then the data points are far away from the mean and the spread also increases. However, if our variance value is small then the data points are close to the mean and the spread doesn’t increase.

What if our mean is the same for two datasets then how to differentiate the datasets

Here’s an example that illustrates a scenario where two datasets have the same mean but different levels of dispersion:

Consider two datasets representing the number of goals scored by a football team in two different seasons:

Dataset A: 2, 2, 2, 2, 2, 2, 2

Dataset B: 0, 1, 2, 3, 4, 5, 6

Both datasets have a mean of 2 goals per game. However, the dispersion, or spread, of the data is different between the two datasets.

To measure dispersion, we can calculate the variance for each dataset.

For Dataset A:

  1. Calculate the mean: (2 + 2 + 2 + 2 + 2 + 2 + 2) / 7 = 2
  2. Calculate the squared differences from the mean: (0–2)², (1–2)², (2–2)², (3–2)², (4–2)², (5–2)², (6–2)²
  3. Calculate the average of the squared differences: (4 + 1 + 0 + 1 + 4 + 9 + 16) / 7 = 35 / 7 = 5

For Dataset B:

  1. Calculate the mean: (0 + 1 + 2 + 3 + 4 + 5 + 6) / 7 = 21 / 7 = 3
  2. Calculate the squared differences from the mean: (0–3)², (1–3)², (2–3)², (3–3)², (4–3)², (5–3)², (6–3)²
  3. Calculate the average of the squared differences: (9 + 4 + 1 + 0 + 1 + 4 + 9) / 7 = 28 / 7 = 4

Even though both datasets have the same mean of 2, Dataset A has a larger variance (5) compared to Dataset B (4). This indicates that the data points in Dataset A are more spread out from the mean compared to Dataset B, which has a relatively smaller dispersion.

This example demonstrates that relying solely on the mean as a measure of central tendency can be misleading, as datasets with the same mean can exhibit different levels of dispersion. It highlights the importance of considering measures of dispersion, such as variance, to fully understand the distribution and variability of the data.

In the case of Dataset A, where all values are 2, the normal distribution graph would show a single, narrow peak centered around the mean of 2. Since there is no variation in the data, the dispersion would be minimal, and the graph would depict a sharp, symmetrical distribution.

On the other hand, for Dataset B, which consists of values 0, 1, 2, 3, 4, 5, and 6, the normal distribution graph would display a broader distribution compared to Dataset A. The graph would exhibit a bell-shaped curve centered around the mean, which in this case is approximately 3. The spread or dispersion of the graph would be larger, reflecting the variability in the dataset.

In summary, when comparing the dispersion in a normal distribution graph, Dataset A with all values of 2 would show a narrow and concentrated distribution, while Dataset B with a range of values would exhibit a broader and more spread-out distribution

Connect me on Linkedin

Thank you for taking the time to read the article. I appreciate your attention and feedback.”

If you found this article helpful, please consider sharing it with others who might benefit from it. Your support in spreading the word is greatly appreciated.”

--

--