Descriptive statistics summary for Data science

Neha Kushwaha
Analytics Vidhya
Published in
7 min readJan 1, 2020

It does exactly as the name suggest ‘describe’ which summarize the raw data with help of graphs and overall summary and is easily interpretable by humans. In short it helps us understand “What has happened?”

It contains a summary of definition, formula followed by its advantage and disadvantage , which gives a sense of usage of various statistics in what situation.

Population vs sample

Population : A data set contain all members of a specified group (the entire list of data values).

Example: The population may be all people living in India.

Sample : A Sample data set contains a part , or a subset of a population. The size of a sample is always less then the size of population from which it is taken.

Example: The sample may be some people living in India.

Descriptive data types

Summarizing Data

Measure of Central tendencies

Mean

Most commonly called as average.The mean for a set of data values is the sum of all of the data values divided by the total number of data values.

Formula :

Advantages :

  • Mean does not require sorting of data, as sorting of data is costly.
  • If data is not available at all points, the mode and median will not give correct representation of data.
  • It can be used for both continuous and discrete numeric data.

Disadvantages :

  • Means can be badly affected by outliers(data point with extreme values unlike the rest).
  • The mean cannot be calculated for categorical data, as the values cannot be summed.
  • Cannot be graphically inspected/found.

Median

The median of a set of data values is the middle value of the data set when it has been arranged in ascending order, for odd number of value in data set the mid number gives median, while for even number of values in data set, average or mean of mid two values give the median.

When the data are listed in orders, the median is the point at which the 50% of the cases are above and 50% below it is also known as 50th percentile.

Formula :

It is unaffected by the outliers and for a symmetric distribution, the mean and median are identical.

In skewed data, the mean lies further towards the skew then the median as shown below.

Advantages :

  • Less affected by outliers and skewed data
  • Can represent data graphically
  • Can be calculated even when No. series is incomplete

Disadvantages :

  • It cannot be identified for the categorical nominal data, as it cannot be logically ordered.
  • The sorting of data can be costly sometime.
  • Doesn’t account for all the observations.

Mode

Mode is nothing but most popular number in any given data set or population. It is the value which occurs most frequently in a set of observations. It is possible for the data set to be multimodal (have more than one mode) which means more than one observation has the same number of frequencies.

It my give most likely experience rather then the “typical” or “central” experience, for example Which size of a shirt should be kept in a store can be decided on mode value of previous sales of shirt.

Formula :

Advantages :

  • It can be obtained for both numerical and categorical data
  • Can be graphically represented with a histogram.

Disadvantages :

  • A data set can have one, or more then one , or no mode at all.
  • For floating data it will be difficult to calculate the mode
  • Could be an inaccurate representation of data as it is not based on all the values.

Measure of Variations

Range

It is the spread or distance between the lowest and highest values of a data set (variables).

Formula :

Advantages :

  • The prime advantage of this measure of dispersion is that it is easy to calculate.

Disadvantages :

  • It is very sensitive to outliers and does not use all the observations in a data set.
  • It is more informative to provide the minimum and the maximum values rather than providing the range.

Interquartile Range

It is defined as the difference between the (Q1)25th and (Q3)75th percentile (also called the first and third quartile). Hence the interquartile range describes the middle 50% of observations.

If the interquartile range is large it means that the middle 50% of observations are spaced wide apart.

Formula :

Advantages :

  • It can be used as a measure of variability if the extreme values are not being recorded exactly (as in case of open-ended class intervals in the frequency distribution).
  • It is not affected by extreme values.

Disadvantages :

  • The main disadvantage in using interquartile range as a measure of dispersion is that it is not amenable to mathematical manipulation.

Variance

Variance (σ2) in statistics is a measurement of the spread between numbers in a data set. That is, it measures how far each number in the set is from the mean and therefore from every other number in the set.

Formula :

Statisticians use variance to see how individual numbers relate to each other within a data set, rather than using broader mathematical techniques such as arranging numbers into quartiles.

Advantages :

  • The advantage of variance is that it treats all deviations from the mean the same regardless of their direction. The squared deviations cannot sum to zero and give the appearance of no variability at all in the data.

Disadvantages :

  • It gives added weight to outliers, the numbers that are far from the mean. Squaring these numbers can skew the data.
  • It is not easily interpreted as we square the data, changing its dimensions from original one.

Standard deviation(SD)

The problem with variance is that it cannot give the correct representation of the deviation as the result is squared and is in different unit from normal set. To overcome this problem we calculate the SD

Standard deviation (SD) is the most commonly used measure of dispersion. It is a measure of spread of data about the mean. SD is the square root of sum of squared deviation from the mean divided by the number of observations.

Formula :

Advantages :

  • The reason why SD is a very useful measure of dispersion is that, if the observations are from a normal distribution, then 68% of observations lie between mean ± 1 SD 95% of observations lie between mean ± 2 SD and 99.7% of observations lie between mean ± 3 SD
  • The other advantage of SD is that along with mean it can be used to detect skewness.

Disadvantages :

  • It is an inappropriate measure of dispersion for skewed data.

Box plot

Box plot help us depict the descriptive statistics data graphically.

Always use box-plot with respect to scale.

Sources:

A very happy and prosperous Happy new year to all medium readers. Thank you for reading the article. Happy learning !!!

--

--

Neha Kushwaha
Analytics Vidhya

Software engineer by profession ….Data science learner by passion!!!!