DESCRIPTIVE STATISTICS

We’ve already discussed variables and data up to some extent. Now, let’s get some intuition of Descriptive Statistics.

Harsh Nadar
The Business Club, IIT (BHU) Varanasi
4 min readMay 30, 2020

--

In layman terms, descriptive stats let you understand some specific insights of data by giving summaries of the sample or measures of data. According to the book ‘Naked Statistics,’ descriptive stats can be like online dating profiles, technically correct and yet can be pretty darn misleading! (We’ll see that :P)

According to Prof. William M.K. Trochim, “Descriptive statistics are typically distinguished from inferential statistics. With descriptive statistics, you are simply describing what is or what the data shows. With inferential statistics, you are trying to reach conclusions that extend beyond the immediate data alone. For instance, we use inferential statistics to try to infer from the sample data what the population might think.”

But, for now, let’s just focus on descriptive part. Descriptive statistics are broken down into two categories:

  1. Measure of Central Tendency- Generally refers to the idea that there is one number that best summarises an entire set of measurement. (Mean, Median, Mode)
  2. Measure of Dispersion- Tells how dispersed data is (Standard Deviation, Variance)

Now, let’s cover each measure one by one, and we’ll go through some real-life examples as well.

MEAN

Mean is basically the average of the distribution. One of its important properties is that it minimizes error in the prediction of any one value in your data set, i.e., it is the value that produces the lowest amount of error from all other values in the data set.

Let’s take the case of the placement scenario. Suppose your batch has ten students. 9 students get placed at 10 LPA(CTC), and the extraordinary one gets a package of 1.5 Cr (CTC). The mean of the batch turns out to be 24 LPA. Sounds pretty good, huh? A new entrant might think that college has a pretty good placement scene. But you know the reality and would comment that this mean number is deceptive. (1.5 Cr will act as an outlier to your data, or in simple words, it stands out.)

That’s the thing with mean: it’s sensitive to outliers. And the outliers may alter the mean in a significant way.

MEDIAN

Median signals “middle” of the distribution. In simple words, the median divides the distribution into two halves, i.e., half of the observations lie above the mean and half below it. In the above placement case, the median package turns out to be 10 LPA. Median gives a better intuition of the placement of your batch to the new entrant, and he may seek a better college with a higher median package.

Suppose the company offering 1.5 Cr thinks that you are also capable and offers you the same package. In this case, the mean package will rise heavily, but the median will remain the same.

Note: In the case of symmetrical distribution or distribution without serious outliers, the mean and median will have approximately similar values.

So, the key here is to determine which measure will be more accurate in a particular situation.

MODE

The value or category that occurs most frequently in the data. However, it is not a very relevant descriptive stat when the data is continuous.

On the other hand, it is the only measure of central tendency that could be used for categorical data. E.g., Gender, you can simply say, whether the mode is female/male/other.

VARIANCE

It is the average of the squared difference from the mean. However, because of this squaring, the variance is no longer in the same unit of measurement as the original data. Taking the root of the variance means the standard deviation is restored to the original unit of measure and, therefore, much easy to measure.

STANDARD DEVIATION

It is a measure of how much the data is dispersed about its mean. It reflects how tightly the observations cluster around the mean. It is the square root of the variance.

Mathematical Formula for Standard Deviation

Standard deviation and variance play an essential role in risk analysis of stocks. The more the SD of a particular stock, the more is the risk, i.e., the more the stock price fluctuates, the more the risk.

One important thing to note is that for a normal distribution (data that is symmetrically distributed around the mean in a bell shape), 68.2% data lie within one SD of the mean, 95.4% around 2 SD of the mean and so on. Please note that this lays the foundation on which most of the statistics are built.

Image credits: https://www.spss-tutorials.com/normal-distribution/

Descriptive statistics are often used to compare two figures:

1. If you compare two batsmen(cricket), you may look at their mean.

2. In order to compare two colleges, one of the metrics can be the median package that students get.

3. Suppose, we consider two cases: (A) a boys’ hostel having 100 students and (B) a colony having 100 residents. Now, assume that the average age in both cases is 20 years. The hostel will have students of approximately the same age group, whereas the colony may have infants, youngsters, and older people. Here, you may get the intuition that age data of residents of the settlement is more spread out as compared to that of the hostel. Or, you may say that the SD of age in case B is higher than the SD of age in case A.

Stay tuned for more such articles!

--

--