Statistics for Data Science 101 Series — Descriptive Statistics

Adith Narasimhan Kumar
Analytics Vidhya
Published in
5 min readJul 28, 2024

In continuation of the previous article in the series, we will deep dive into the area of descriptive statistics! What is it? What does it comprise? Let’s find out!

Credits: CFI

What is descriptive statistics ?

Descriptive statistics is a set of computations that help us summarize a data set. This data set can either be an entire population or a part of it called a sample of a population. however, the formulae might differ based on the type of the underlying data set.

Descriptive statistics can be broken down into major categories

  1. Measures of Central Tendency
  2. Measures of Variability
  3. Frequency Distribution

1. Measures of Central Tendency

Measures of central tendency are certain calculations that allow us to calculate the data point around which the data is distributed in a data set. To describe it in simple terms, central tendency measures the center or the middle point of a data set around which the entire data set is distributed.

The central tendency of a dataset can be measured using Mean, Median and Mode.

Mean

Mean is the average of a dataset. it can be measured as the sum of all the values in the data set divided by the number of values. Types of mean include Geometric mean, Weighted mean and Harmonic mean however we will not go into those in this article.

Note: Using mean to measure the central tendency is most preferred and recommended for symmetric data distribution as the mean will be accurately located in the center while the mean is pulled away from the center for skewed distributions.

The formula for mean is

Mean formula

Median

Median is calculated as the middle value in a dataset which is sorted in ascending or descending order. However, if the dataset contains even number of values, the median is taken as the average of the two middle values.

Formula for mean if the number of values in the data set are odd

Credits: GeeksForGeeks

Formula for mean if the number of values in the data set are even

Credits: GeeksForGeeks

Mode

The mode of a dataset is the most occurring value in it. It is important to note that a population may contain multiple modes or no mode at all.

Formula for mode is

Credits: Byjus

Note:

For a symmetrical distribution all measures of central tendency yield good result but mean is the most preferred as it includes all the values in a data set.

For a skewed data set, median yields the best results and mode yields the best results for categorical data.

2. Measures of variability

Similar to how central tendency measures the point around which the data is distributed, measures of variability calculates the dispersion of those points around the mean or in words, it shows us how spread out the points are from the mean of the dataset.

For example, the average of a data set might be the middle value but the data points can be in both the extremes. Measures of variability shows us how spread out they are.

Measures of variability are calculated using Standard Deviation, Variance and Range.

Standard Deviation

Standard deviation calculates the average distance of data points from the mean of the data set. A smaller standard deviation suggests a less variability while a larger one indicates a greater variability.

The formula for standard deviation is

Credits: Khan Academy

Variance

Variance is the average of the squared distances from the mean of the data set.

The formula for variance is

Credits: Byjus

Range

Range is the simplest calculation of variability. It tells us the spread of data lowest and the highest value of the data set. To calculate the range, we subtract the highest and the lowest value of the data set.

The formula for range is

Credits: Byjus

3. Frequency distribution

Frequency distribution shows us the number of times each data point occurs in a data set. This information is presented either in a table format or in a graph format.

Example of a frequency distribution table

Credits: GeeksForGeeks

Conclusion

To conclude, descriptive statistics give us a summary or a “description” of the data set we are dealing with. This summary is vital in helping us understand the nature of data in hand and to treat the data accordingly for future model training.

Hope you liked my article! Do share your thoughts and inputs and I’ll try to have them answered to my knowledge in future articles!

Happy Learning!

Check out my other articles on Blockchain and Machine Learning/Deep Learning. Let me know about any other topics to cover in the future!

Catch my previous articles here 👇

--

--