Descriptive Statistics All In One Cheat Sheet (Part-1)

About Measure of Frequency, Measure Of Central Tendency, Measure Of Dispersion, Measure Of Position and Measure Of Shape

Data Striver
7 min readMar 14, 2023
Photo by Naser Tamimi on Unsplash

Introduction

Descriptive statistics deals with data collection, organization, analysis, interpretation, and presentation. It focuses on summarizing and describing the main features of a data set, without making inferences or predictions about the larger population.

Population Vs Sample

  • Population refers to the entire group of individuals or objects that we are interested in studying. For example, the population might be all the students in a particular school or all the cars in a particular city.
  • A sample, on the other hand, is a subset of the population. It is a smaller group of individuals or objects that we select from the population to study. For example, we might randomly select 100 students from a particular school.
  • Things to take care of while selecting a sample from the population
    - first, it should be random
    - second, it should be representative enough
  • A parameter is a characteristic of a population, while a statistic is a characteristic of a sample.

Types Of Data

  • Quantitative data takes on numeric values that allow us to perform mathematical operations.
    - Continuous data can be split into smaller and smaller units, and still a smaller unit exists. (for example, we can measure the units of age in years, months, days, hours, and seconds, but there are still smaller units that could be associated with it).
    - Discrete data only takes on countable values.
  • Categorical data are used to label a group or set of items.
    - Categorical Ordinal: data take on a ranked ordering (for example a ranked interaction on a scale from small, medium, and large).
    - Categorical Nominal:
    data that do not have an order or ranking.

Types Of Study

Mainly Descriptive Statistics can be with five measures type

  1. Measure Of Frequency
  2. Measure Of Central Tendency
  3. Measure Of Dispersion
  4. Measure Of Position
  5. Measure Of Shape

1. Measure of Frequency

  • percentage
  • frequency

Percentage:

Percentage means per hundred and is represented by the symbol %. One per cent is one-hundredth of a value and is calculated by dividing the value by 100. For example, 10% of 123 is 123/100 i.e 1.23.

Pie Chart is the best graph to represent it

Frequency:

Frequency is the number of times a particular value of a variable has been observed to occur.

Frequency can be expressed in three different ways :

  • Absolute frequency — describes the number of times a particular value of a variable has been observed to occur.
  • Relative frequency — describes the number of times a particular value of a variable has been observed to occur in relation to the total number of values for that particular variable. Ratios, rates and proportions are used to describe Relative frequency
  • Cumulative frequency — describes the sum of all previous frequencies up to the current value.
  • Histograms and Barplots are used to plot frequency.

2. Measure of Central Tendency

A measure of central tendency is a statistical measure that represents a typical or central value for a dataset. It provides a summary of the data by identifying a single value that is most representative of the dataset as a whole.

  • Mean
  • Median
  • Mode
  • Weighted Mean
  • Trimmed Mean

Mean :

Mean is the sum of values of all observations of data divided by the total number of observations. The symbol “µ”(pronounced mu) is used for the mean of a population, and x̄ (pronounced x-bar) is used for the mean of the sample

Here “N” is the number of items in the population and “n” is the number of items in the sample

The mean has one disadvantage. It gets influenced by the outliers.

For example, consider the marks of 5 students — 34,56,67,45,34.

mean = (34+56+67+45+34)/5 = 47.2

Now let's insert an outlier i.e 150 as the marks of the sixth student.

new_mean = (34+56+67+45+34+150)/6 = 64.33

As you can see the mean shifted to 64.33 from 47.2 because of an outlier(150)

Median :

The median is the middle value of a sorted list of numbers.

For an odd number of values, the Median is (n + 1)/2th value and for an even number of values median is the average of (n/2) and (n/2 + 1)th value. Here n is the number of values.

Advantage of the median: The median is less affected by outliers and skewed data than the mean, and is usually the preferred measure of central tendency when the distribution is not symmetrical.

In the above example when we sort the data it becomes 34,34,45,56,67 and the median is 45. After adding the outlier 150 to the data 34,34,45,56,67,150 median becomes (45+56)/2 = 50.5. So shifting of the median is minimal

Limitation of the median: The median cannot be identified for categorical nominal data, as it cannot be logically ordered

Mode:

Mode is the most frequent value in the data.

Advantage of the mode: The mode has an advantage over the median and the mean as it can be found for both numerical and categorical (non-numerical) data.

Limitations of the mode: The are some limitations to using the mode. In some distributions, the mode may not reflect the centre of the distribution very well.
It is also possible for there to be more than one mode for the same distribution of data, (bi-modal, or multi-modal). The presence of more than one mode can limit the ability of the mode in describing the centre or typical value of the distribution because a single value to describe the centre cannot be identified.
In some cases, particularly where the data are continuous, the distribution may have no mode at all (i.e. if all values are different).

In cases such as these, it may be better to consider using the median or mean, or group the data into appropriate intervals, and find the modal class.

Weighted Mean:

Weighted Mean: The weighted mean is the sum of the products of each value and its weight, divided by the sum of the weights. It is used to calculate a mean when the values in the dataset have different importance or frequency.

Trimmed Mean:

A trimmed mean is a method of finding a more realistic average value by getting rid of certain erratic observations. It is calculated by removing a certain percentage of the smallest and largest values from the dataset and then taking the mean of the remaining values. The percentage of values removed is called the trimming percentage.

By WallStreetMojo

3. Measure Of Dispersion

The measures of central tendency are not adequate to describe data. Two data sets can have the same mean but they can be entirely different. Thus to describe data, one needs to know the extent of variability. This is given by the measures of dispersion. It provides information about how the data is distributed around the central tendency (mean, median or mode) of the dataset. Commonly used measures of dispersion are as follows

  • Range
  • Variance
  • Standard Deviation
  • Coefficient Of Variance

Range:

The range is the difference between the maximum and minimum values in
the dataset. It is a simple measure of dispersion that is easy to calculate but can be affected by outliers.

Variance:

The variance is the average of the squared differences between each
data point and the mean. It measures the average distance of each data point from the mean and is useful in comparing the dispersion of datasets with different means. Population variance represent as σ2 and Sample variance represent as s2

here “µ” represent the population mean and x̄ represents the sample mean. “N” is the number of items in the population and “n” is the number of items in the sample

Standard Deviation:

The standard deviation is the square root of the variance. It is a widely used measure of dispersion that is useful in describing the shape of a distribution. Population standard deviation represent as σ and Sample standard deviation represent as s

Because standard deviation unit is the same as the data, it can be easily interpretable.

Coefficient Of Variance:

The coefficient of variation (CV) is a statistical measure that expresses the amount of variability in a dataset relative to the mean. It is a dimensionless quantity that is expressed as a percentage.

CV = (standard deviation / mean) x 100%

This is the end of part 1 of the Descriptive Statistics Cheat Sheet All in One.
In the second part, I am going to discuss the Measure of Position and Measure of Shape which consist of topics like quantiles, five-number summary, boxplot, covariance, correlation, skewness, kurtosis etc. Follow this link.

Thanks for Reading!

If you like this post follow me on Medium and connect me with on LinkedIn

--

--

Data Striver

Hey, My name is Tarun Kumar Mohapatra . I strongly believe that knowledge gets multiply by sharing with other