Descriptive Statistics Concepts Required for Data Science

Rina Mondal
8 min readJan 4, 2024

--

Descriptive statistics are methods used to summarize and describe the main features of a collection of data in a straightforward way. They help us understand and interpret the data without making any assumptions or inferences beyond the data at hand.

Common measures of descriptive statistics include:

.Descriptive Statistics

1. Measures of Central Tendency:

Measures of central tendency are statistical metrics used to identify the central point or typical value within a set of data. They provide a summary of the data and give an indication of where most values in a distribution fall.

i. Mean (Average): The sum of all values divided by the number of values.

Suppose you have the following list representing the scores of five students
in a mathematics test:

#import numpy as np
x=[85,90,88,92,87]
np.mean(x)// using numpy function

#O/t- 88.4 (Avg of all values in a sample)

Two Other types of Means are: A. Weighted Mean B. Trimmed Mean

A. Weighted Mean: The weighted mean is a measure of central tendency that takes into account the weights assigned to each data point. It is calculated by multiplying each data point by its corresponding weight, summing up these products, and then dividing by the sum of the weights.

It is useful when some values contribute more to the overall average than others.

# Example data and weights
data = [2, 3, 4, 5]
weights = [0.1, 0.2, 0.3, 0.4]

# Calculate the weighted mean
weighted_mean = sum(x * w for x, w in zip(data, weights)) / sum(weights)

# Print the result
print("Weighted Mean:", weighted_mean)

#O/t- Weighted Mean: 4.0

#By numpy Function

# Calculate the weighted mean
weighted_mean = np.average(values, weights=weights)
print("Weighted mean:", weighted_mean)

B. Trimmed Mean: A measure of central tendency that involves removing a certain percentage of the smallest and largest values in a dataset and then calculating the mean of the remaining value.

from scipy.stats import trim_mean
import numpy as np

# Example dataset
data = np.array([2, 3, 4, 5, 2, 6, 4, 8, 2, 6, 4, 8, 8, 8])

# Specify the percentage to trim from both ends (e.g., 10%)
trim_percentage = 0.10

# Calculate the trimmed mean
trimmed_mean = trim_mean(data, proportiontocut=trim_percentage)

# Print the result
print("Trimmed Mean:", trimmed_mean)

#O/t- Trimmed Mean: 5.0

ii. Median: The middle value in a dataset when it is ordered. It separates the higher half from the lower half.

import numpy as np
x=[85,90,78,92,87]
np.median(x)

#O/t- 87 // It will first sort and then find the middle value

iii. Mode: The value that occurs most frequently in a dataset.

from scipy.stats import mode
import numpy as np
x=np.array ([85,78,78,92,87])
mode(x)

#O/t- ModeResult(mode=78, count=2)

2. Measures of Dispersion (Variability):

Measures of dispersion, also known as measures of variability or spread, describe how much the data values in a dataset differ from each other and from the central tendency (mean, median, mode).
i. Range:
The difference between the maximum and minimum values in a dataset.

The formula for the range (R) is:
R=Maximum Value − Minimum Value

Ex: {75,82,88,92,67,94,80,85}
To calculate the range:
Find the maximum value: 94
Find the minimum value: 67
R=94−67=27

ii. Variance: Variance is a measure of how spread out a set of values is in a dataset. It quantifies the degree to which each number in a dataset differs from the mean (average) of the dataset

import numpy as np

# Example dataset
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

# Calculate the mean
mean_value = np.mean(data)

# Calculate the squared differences from the mean
squared_diff = (data - mean_value) ** 2

# Calculate the variance
variance = np.sum(squared_diff) / len(data)

print("Variance:", variance)

#O/t- Variance: 4.0
#### Numpy provide a single line of code to calculate variance ####

import numpy as np

# Example dataset
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

# Calculate the variance
variance = np.var(data)

print("Variance:", variance)

#O/t- Variance: 4.0

iii. Standard Deviation: The square root of the variance.

import numpy as np

# Example dataset
data = np.array([2, 4, 4, 4, 5, 5, 7, 9])

# Calculate the variance
variance = np.std(data)

print("Variance:", variance)

#O/t-Variance: 2.0

3. Measures of Shape:

Measures of shape in statistics refer to numerical values that describe the distribution’s symmetry, peakedness, and flatness.

i. Skewness: A measure of the asymmetry or skewness in a distribution of data. It indicates the extent and direction of skew (departure from horizontal symmetry) in a dataset.

  • Positively skewed (right-skewed): The right tail (higher values) is longer or fatter than the left tail. The mean is typically greater than the median.
  • Negatively skewed (left-skewed): The left tail (lower values) is longer or fatter than the right tail. The mean is typically less than the median.
  • Symmetric: The tails on both sides of the mean are mirror images. The mean and median are roughly equal.
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import skew

# Example skewed dataset
data_skewed = np.random.exponential(scale=2, size=1000)

# Create a kernel density plot (PDF) for the skewed data
sns.kdeplot(data_skewed)

# Add labels and title
plt.xlabel('Values')
plt.ylabel('Probability Density')
plt.title('Skewed Probability Density Function (PDF)')

# Show the plot
plt.show()

# To measure the amount of skewness
skewness = skew(data_skewed)
print("Skewness:", skewness)

#Skewness: 0.1102396379610246

Here, we can see the data is skewed and we have also measured the amount of skewness.

ii. Kurtosis: A measure of the “tailedness” or sharpness of the peak of a distribution. A statistical measure that describes the distribution of data in terms of its tails and the shape of its peak relative to the normal distribution.

  • If it has high kurtosis, it means the graph has a very tall peak and the tails (the parts at the far ends) are also relatively tall and heavy. This indicates that there are a lot of extreme values in the data.
  • If it has low kurtosis, it means the graph is more spread out and the peak is lower, with lighter tails. This suggests there are fewer extreme values and the data is more clustered around the average.

So, in simpler terms, kurtosis tells us if a distribution is more like a spike (high kurtosis) or a flatter, more spread-out shape (low kurtosis).

from scipy.stats import kurtosis
import numpy as np

# Example dataset
data = np.array([2, 3, 4, 5, 2, 6, 4, 8, 2, 6, 4, 8, 8, 8])

# Calculate kurtosis
kurtosis_value = kurtosis(data)

print("Kurtosis:", kurtosis_value)

#O/t- Kurtosis: -1.4120370370370372

4. Frequency Distribution:

In descriptive statistics, a frequency distribution is a summary of the frequencies (counts) with which values occur in a dataset. It organizes the data into groups or intervals and shows how many times each value or range of values appears.
i. Histograms:
Graphical representations of the distribution of a dataset. It displays the frequencies of different ranges or bins.

import matplotlib.pyplot as plt
import numpy as np

# Example dataset
data = np.array([2, 3, 4, 5, 2, 6, 4, 8, 2, 6, 4, 8, 8, 8])

# Create a histogram
plt.hist(data, bins='auto', edgecolor='black')

# Add labels and title
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram')

# Show the plot
plt.show()

5. Measures of Position

Quantiles -Statistical measures used to divide a set of numerical data into equal sized groups.

  1. Quartile (4)
  2. Deciles (10)
  3. Percentiles (100)
  4. Quintiles (5)

All others can be easily derived using percentiles.

i. Percentiles: The value below which a percentage of data falls. Percentiles show the relative position or rank of a particular value in a dataset.

Ex: Let’s say when a student receives their test score, they also receive a corresponding percentile. If the score falls in 50th percentile, the score is higher than higher than half or 50% of all test scores. It is useful when different exams have different test scores.

Percentiles are values that divide a dataset into 100 equal parts. p is the desired percentile (e.g., 50 for the median, 25 for the first quartile, 75 for the third quartile). N is the number of observations in the dataset. Values below which a given percentage of observations fall.

import numpy as np

# Example dataset
data = np.array([2, 3, 4, 5, 2, 6, 4, 8, 2, 6, 4, 8, 8, 8])

# Calculate percentiles
percentile_10 = np.percentile(data, 10)
percentile_25 = np.percentile(data, 25)
percentile_50 = np.percentile(data, 50) # Median
percentile_75 = np.percentile(data, 75)
percentile_90 = np.percentile(data, 90)

# Print the results
print("10th Percentile:", percentile_10)
print("25th Percentile:", percentile_25)
print("50th Percentile (Median):", percentile_50)
print("75th Percentile:", percentile_75)
print("90th Percentile:", percentile_90)

# O/t- 10th Percentile: 2.0
25th Percentile: 3.25
50th Percentile (Median): 4.5
75th Percentile: 7.5
90th Percentile: 8.0

ii. Quartiles: Values that divide a dataset into four equal parts. Quartiles are values that divide a dataset into four equal parts. The three quartiles, denoted as Q1, Q2 (the median), and Q3, are used to describe the distribution of the data.

25% of the data points are below Q1 and 75% are above it. Q1=25%

50% of the dataset are below Q2 and 50% are above it. Q2=50%

75% of the dataset are below Q3 and 25% are above it. Q3=75%

import numpy as np

# Example dataset
data = np.array([2, 3, 4, 5, 2, 6, 4, 8, 2, 6, 4, 8, 8, 8])

# Calculate quartiles
q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50) # Median
q3 = np.percentile(data, 75)

# Print the results
print("Q1 (25th Percentile):", q1)
print("Q2 (Median - 50th Percentile):", q2)
print("Q3 (75th Percentile):", q3)

# O/t-
Q1 (25th Percentile): 3.25
Q2 (Median - 50th Percentile): 4.5
Q3 (75th Percentile): 7.5

Interquartile Range (IQR): The middle 50 % of the data.

IQR= Q3-Q1 = 75% percentile- 25%percentile. 25% is the lower quartile and 75% is the upper quartile.

Five number Summary: 1. Minimum Value, 2. Q1 or First Quartile, 3. Q2 or Median, 4. Q3 or Third Quartile, 5. Maximum value.

Box Plot can be used to visualize this five number summary and find the outliers in a dataset.

6. Summary Tables:

Summary tables are a concise way to present key information from a dataset, providing a structured overview of the data’s main characteristics. These tables typically include various statistics and metrics that summarize different aspects of the data, such as central tendency, dispersion, and frequency distributions.
i. Count: The number of observations in a dataset.

# Example dataset
data = [2, 3, 4, 5, 2, 6, 4, 8, 2, 6, 4, 8, 8, 8]

# Count the number of elements in the dataset
count = len(data)

# Print the result
print("Number of elements:", count)

# O/t- Number of elements: 14

ii. Sum: The total sum of the values in a dataset.

# Example dataset
data = [2, 3, 4, 5, 2, 6, 4, 8, 2, 6, 4, 8, 8, 8]

# Calculate the sum of elements in the dataset
total_sum = sum(data)

# Print the result
print("Sum of elements:", total_sum)

# O/t- Sum of elements: 70

Descriptive statistics provide a concise summary of the main characteristics of a dataset, facilitating a better understanding of its properties and patterns. These statistics are essential for Exploratory Data Analysis and forming the foundation for more advanced statistical analyses.

Explore Data Science Roadmap.

Explore my Channel where I explain Data Science topics.

If you found this guide helpful , why not show some love?

Give it a Clap 👏, and if you have questions or topics you’d like to explore further, drop a comment 💬 below 👇

--

--

Rina Mondal

I have an 8 years of experience and I always enjoyed writing articles. If you appreciate my hard work, please follow me, then only I can continue my passion.