Statistics for Data Science 101 Series — Descriptive Statistics
In continuation of the previous article in the series, we will deep dive into the area of descriptive statistics! What is it? What does it comprise? Let’s find out!
What is descriptive statistics ?
Descriptive statistics is a set of computations that help us summarize a data set. This data set can either be an entire population or a part of it called a sample of a population. however, the formulae might differ based on the type of the underlying data set.
Descriptive statistics can be broken down into major categories
- Measures of Central Tendency
- Measures of Variability
- Frequency Distribution
1. Measures of Central Tendency
Measures of central tendency are certain calculations that allow us to calculate the data point around which the data is distributed in a data set. To describe it in simple terms, central tendency measures the center or the middle point of a data set around which the entire data set is distributed.
The central tendency of a dataset can be measured using Mean, Median and Mode.
Mean
Mean is the average of a dataset. it can be measured as the sum of all the values in the data set divided by the number of values. Types of mean include Geometric mean, Weighted mean and Harmonic mean however we will not go into those in this article.
Note: Using mean to measure the central tendency is most preferred and recommended for symmetric data distribution as the mean will be accurately located in the center while the mean is pulled away from the center for skewed distributions.
The formula for mean is
Median
Median is calculated as the middle value in a dataset which is sorted in ascending or descending order. However, if the dataset contains even number of values, the median is taken as the average of the two middle values.
Formula for mean if the number of values in the data set are odd
Formula for mean if the number of values in the data set are even
Mode
The mode of a dataset is the most occurring value in it. It is important to note that a population may contain multiple modes or no mode at all.
Formula for mode is
Note:
For a symmetrical distribution all measures of central tendency yield good result but mean is the most preferred as it includes all the values in a data set.
For a skewed data set, median yields the best results and mode yields the best results for categorical data.
2. Measures of variability
Similar to how central tendency measures the point around which the data is distributed, measures of variability calculates the dispersion of those points around the mean or in words, it shows us how spread out the points are from the mean of the dataset.
For example, the average of a data set might be the middle value but the data points can be in both the extremes. Measures of variability shows us how spread out they are.
Measures of variability are calculated using Standard Deviation, Variance and Range.
Standard Deviation
Standard deviation calculates the average distance of data points from the mean of the data set. A smaller standard deviation suggests a less variability while a larger one indicates a greater variability.
The formula for standard deviation is
Variance
Variance is the average of the squared distances from the mean of the data set.
The formula for variance is
Range
Range is the simplest calculation of variability. It tells us the spread of data lowest and the highest value of the data set. To calculate the range, we subtract the highest and the lowest value of the data set.
The formula for range is
3. Frequency distribution
Frequency distribution shows us the number of times each data point occurs in a data set. This information is presented either in a table format or in a graph format.
Example of a frequency distribution table
Conclusion
To conclude, descriptive statistics give us a summary or a “description” of the data set we are dealing with. This summary is vital in helping us understand the nature of data in hand and to treat the data accordingly for future model training.
Hope you liked my article! Do share your thoughts and inputs and I’ll try to have them answered to my knowledge in future articles!
Happy Learning!
Check out my other articles on Blockchain and Machine Learning/Deep Learning. Let me know about any other topics to cover in the future!
Catch my previous articles here 👇