What is Descriptive Statistics?

R. Gupta
Geek Culture
Published in
4 min readNov 11, 2022

Part 2: Statistics Series

Hello, welcome back to statistics series part 2. In the last article we have seen, the different types of data to which statistics is applied. In this article, we will cover descriptive statistics. It is used to describe, present and summarize the given data. It is used to present the summary of the dataset for better understanding.

There are commonly three measures used for descriptive statistics.

1. Measures of Central Tendency

2. Dispersion or Variability

3. Distribution or Visualization of data

Photo by Justin Morgan on Unsplash

1. Measures of Central Tendency:

Measures of central tendency are used when you have to give the most expected or most frequently occurring value or the center value of your data.

For e.g. let me ask you one question: How much time it takes you to reach the college/office from your home?

It takes 15 minutes for me to reach my office from my college.

Let me ask another question: Does it always remains the same or sometimes less or more than 15 minutes?

The answer that you would probably give will be: No, sometimes it can take more than 15 minutes or less than 15 minutes depending upon the traffic, rain, or various other factors.

But When I asked you the first question, you come up with some time duration although time varies by some amount from day to day. Why is it so? The answer is because of central tendency measures. The answer that you come up with is the average time to reach to college/office from your home.

The measures of central tendency tell you where most of your data or the center of data is located. Mean, median, and mode are the 3M’s that are known as measures of central tendency.

Mean:

The mean is the arithmetic average of the data. for e.g. we can calculate the average budget for months, the average height in class, the average weight, or age in class. The mean is calculated by summing up all data points and diving by the number of data points.

Mean = ( sum of data points) / (number of data points)

The downside of the Mean: As the mean takes all data points into consideration, it is highly affected by extreme values known as outliers, i.e if one point in your collected data is two smaller or too larger than the other data points, the mean will also get affected and can not give you the true average estimate of most of your data.

The advantage of mean: It is a very popular and vastly used measure to represent the center for numerical data. It is also used vastly in machine learning algorithms.

Median:

The Median is the midpoint of data. To calculate the median, data points should be arranged in ascending order. If the number of data points is odd then the data point presented in the middle is known as the median. If the number of data points is even then the average of two midpoints is known as the median.

The advantage of the median: It does not take the whole data into consideration therefore not affected by extreme values or outliers. It is a very robust measure to represent the central point of data. It is used only for quantitative data.

The downside of the median: To calculate the median, data points should be arranged in ascending order which makes it less attractive due to computational overhead for larger datasets. Therefore, it is not used much in machine learning algorithms.

Mode:

The Mode is the most repeated value or the value whose frequency is most in your data. The mean and median are calculated for quantitative data while the mode is calculated for qualitative as well as quantitative data. For e.g. Which product is sold more than other products can be answered by calculating the order frequency of each product and then selecting the product with the highest frequency. Note there can be more than one product (mode) with the same frequency.

Advantage of the mode: This is the only central measure for qualitative data.

The downside of Mode: There can be more than one mode present in your dataset. It is not used in machine learning values.

Note: Since the mean is most affected by outliers, therefore we can calculate the trimmed mean. It is known as trimmed mean because you trim your data points from both ends to remove outliers. While the median is not influenced by outliers l, it is considered the robust metric for the estimate of location.

2. Measures of Spread:

Measures of spreads are used to tell you how your data is spread around the measures of central tendency. There are most commonly used measures of spread are as follows:

Range: Range tells you about your dataset’s maximum and minimum data points. Each quantitative feature of the dataset can have a different range.

Standard Deviation: Standard deviation tells you about how your data is distributed around the mean. This calculates the dispersion of data from the mean. The larger the standard deviation, the greater your dataset’s variability.

Variance: variance is the square of standard deviation. Generally, the standard deviation is used more.

Interquartile range(IQR): The IQR describes the middle 50% of values when ordered from lowest to highest.

3. Distribution:

To describe the frequency of each potential value of a variable, expressed in percentages or numbers, statisticians employ graphs and tables. For instance, if you were conducting a poll to find out which Beatle people preferred, you would set up two columns: one listing all conceivable options (John, Paul, George, and Ringo), and the other listing the total number of votes.

Statistics professionals display frequency distributions as a table or a graph.

Thank you for taking the time to read this article. I hoped the article will be enjoyable for you. Don’t forget to click the “clap” and “follow” buttons if you enjoyed the article. Keep checking back for new posts.

--

--

R. Gupta
Geek Culture

I am interested in learning new technology. Interested in Programming, AI, Data Science and Networking. Love to explore new places.