Rushabh Mehta
Sep 1, 2018 · 4 min read

Here are some interesting questions that came to my mind when I started learning Descriptive Statistics, after looking for the answers on numerous platforms I thought I’d make a short blog post compilation of them!

1.Difference between the formula for calculating the standard deviation from the sample and from the population.

Image result for sample standard deviation vs population standard deviation
Image result for sample standard deviation vs population standard deviation

Notice the difference in formula?

Reason why we divide by ‘n-1’ instead of ’n’ while calculating the standard deviation from the sample instead of the population:

Generally the population mean is unknown, so we determine a small sample from the population and calculate its mean (known as the sample mean). The calculated value of the sample mean is not exactly equal to the population mean.

To find the standard deviation we use this sample mean along with the data points of the sample. Since this mean is also calculated using these data points we get a biased result i.e. a lesser than expected value of the standard deviation. It can be proved mathematically that,

Thus we get a value of standard deviation that is lesser than the actual value.Hence to increase this value we reduce the denominator which was earlier ’n’ to ‘n-1’.

2. Why do we use the root mean square deviation(RMSE) instead of absolute deviation?

Image result for root mean square standard deviation
Image result for root mean square standard deviation
  • RMSE is a square function which is a continuously differentiable function meaning it can be differentiated to find the maximum and minimum value unlike the absolute function.
  • RMSE emphasizes the larger differences.
  • Example : Consider the data set A = {10,20,30,50,100}
  • The mean is 42.
  • The absolute value |100–42| is smaller than (100–42)2 therefore the RMSE punishes the outlying values more than absolute deviation.

3. Histograms are not based on heights rather they are based on area!

They are based on heights only when the bin size is equal, for histograms of unequal bin size we need to convert it to a plot of frequency density instead of frequency.

4. Why do we use standard deviation instead of variance as a measure of the spread of data?

Standard Deviation is in the same units as the data while Variance which is the square of Standard Deviation is in squared units.For understanding the spread of data it makes more sense if the value that quantifies this spread is in the same units as the data.

Example: When we have a dataset about heights in meters the S.D. which would also be in meters while the Variance would be in m2 units of area. Thus here if we use variance then we cannot get an idea about the variability since it is representing area (m2) . To avoid this we take the square root of the variance which is the standard deviation and get back the original units. Now using the Standard Deviation one can get an idea of variability.

5.How to determine whether the standard deviation is high or low?

  • The standard deviation of a variable depends upon the values that the variable takes if these values are high the resulting standard deviation would be higher when compared to the same variable with all its values divided by ten.
  • This indicates that one cannot compare the variability of two datasets just by looking at the standard deviation of two datasets (i.e. one cannot conclude that a dataset is more variable than another dataset just because its standard deviation is higher than the other.)
  • This is because standard deviation is an absolute measure of variability.

Thus to determine which dataset is variable we calculate the coefficient of variation (C.V.) is given by,

A higher C.V. signifies higher variability.

Now we can compare the coefficients of variation of the two datasets and determine which is more variable.

GreyAtom

GA DS

Rushabh Mehta

Written by

GreyAtom

GreyAtom

GA DS