Statistics 101: Grouped and Ungrouped Data- Let’s talk with data!

Rohan Bali
Analytics Vidhya
Published in
5 min readFeb 24, 2020

Data can be classified in various forms. One way to distinguish between data is in terms of grouped and ungrouped data.

Everyone got DATA!

What is ungrouped data?

When the data has not been placed in any categories and no aggregation/summarization has taken placed on the data then it is known as ungrouped data. Ungrouped data is also known as raw data.

What is grouped data?

When raw data have been grouped in different classes then it is said to be grouped data.

For example, consider the following :

Height of students: (171,161,155,155,183,191,185,170,172,177,183,190,139,149,150,150,152,158,159,174,178,179,190,170,143,165,167,187,169,182,163,149,174,174,177,181,170,182,170,145,143): This is raw/ungrouped data.

The following table shows the grouped data from the above mentioned raw data

NOTE: Grouped-data mean will be explained later in this blog. Click here to read more about the cumulative frequency

Before we study more about grouped and ungrouped data it is important to understand what do we mean by “Central Tendencies”?

As the names suggest, central tendencies have something to do with the center. Central tendency is the central location in a probability distribution. There are many measures for central tendencies like mean, mode, median, interquartile range, percentiles, geometric mean, harmonic mean, etc. The most common measures of central tendencies used are discussed below.

Understanding the measures of central tendencies of ungrouped data.

(i) MODE: The most frequently occurring item/value in a data set is called mode. Bimodal is used in the case when there is a tie b/w two values. Multimodal is when a given dataset has more than two values with the same occurring frequency.

eg 7,11,14.25,15,15,15,15,15,19,19,29,81. Mode is 15

(ii)MEDIAN: The median of a dataset is described as the middlemost value in the ordered arrangement of the values in the dataset.

NOTE: For an odd number of the dataset, the median is the middle value. For an even number of the dataset, the median is the average of the two middle values.

eg 15,11,14,3,21,17,22,16,19,16,5,7,9,20,4

Let’s arrange this data in ascending order

3,4,5,7,8,9,11,14,15,16,16,17,19,19,20,22,22. The median is n+1/2 = 17+1/2 = 18/2 = 9

Advantage of Median : It is not influenced by larger values. It remain immune to outliers.

“The data must be at least ordinal for the median to be meaningful”

(iii)MEAN: Also known as the arithmetic average. It is calculated by the summation of all values divided by the number of values.

eg, The mean of “15,11,14,3,21,17,22,16,19,16,5,7,9,20,4” is 13.26667.

(iv)PERCENTILE: This form of central tendency divides a group of data into 100 parts. The nth percentile of a dataset is described as n values below that “nth value” and (100-n) values above that “nth value”.

Now, let’s see how to calculate percentiles.

STEP 1: Arrange the data in ascending order.

STEP 2: The ith percentile location is :

i = (P/100)*N

i: percentile position

N: total no. in the dataset

P: the percentile of interest.

STEP: Determining the location by either (a) or (b)

(a) If ‘i’ be a whole number, then the percentile is at average the ‘i’ and ‘i+1’ position.

(b) If ‘i’ is not a whole number, then percentile value is at ‘i+1’ position.

eg. Suppose we want to determine the 70th percentile of 1450 numbers.

i = (70/100)*1450

i = 1015

P = 1015th number + 1016th number/2

(v) QUARTILE: This form of central tendency divides a group into four sub-parts.

First Quartile =25th percentile

Second Quartile =50th percentile

Third Quartile = 75th percentile

Fourth Quartile = 100th percentile.

NOTE: The second quartile is equal to the median of the data.

Understanding the measures of variability of ungrouped data.

The measure of variability describes the spread or scatter of the dataset.

NOTE: The variability aspect of any data enables us to a better description of the data.

Both curves have the same mean but their scatter is different.

(i) RANGE: The difference b/w the largest value and the smallest value in a dataset is called the range of the dataset. The range is also a representation of the end/extreme values.

Range helps in the construction of control charts on the data.

(ii) INTERQUARTILE RANGE: The interquartile range is the difference b/w the first and third quartile.

It comes in handy because users are more interested in the middle values than the extreme ends.

(iii) MEAN ABSOLUTE DEVIATION: It is the average of the absolute values of deviations around the mean of the dataset.

(iv) VARIANCE: It is the square of deviations about the arithmetic mean for a set of numbers.

NOTE: The final result is expressed in terms of the squared unit of measurement.

(v) STANDARD DEVIATION: It is the square root of the variance.

eg, the standard deviation of the data in the above example is 6.086

NOTE: Standard deviations are used in computing confidence intervals and hypothesis testing. The standard deviation has the same unit as the raw data.

“The real usage of standard deviation can be understood through the Empirical rule and Chebyshev’s Theorem. Both will be discussed in detail in coming up blogs”

(vi) COEFFICIENT OF VARIATION: It is the ratio of the standard deviation to the mean of the data.

eg The coefficient of variation in the above example is (6.086/9.4)*100=64.7.

Calculating measures of central tendencies of grouped data.

Consider the following data:

Mean = ∑fx/n = 6.93

Median = i+(N/2 — C.W)/MED = 7.105

Mode = The mode of group data is the frequency of the modal class. The max frequency in the above example is for intervals 7to9 i.e 19. Hence, the mode is 8

Abbreviations :

f: frequency

N: total frequency

CW: class width

i: initial point(N/2 will give us the location of the median value, i.e 30 in the above example). 29 entries will fit up to class interval “7 to 9”. Hence, the value of ‘i’ is 7.

MED: the frequency of the class where the median exists. For the above example the value of MED=19.

That’s all for this blog.

Coming up: Statistics 101: Hypothesis Testing and p-value - What’s the fuss about that!

Previous Blog: Statistics 101: Basics Visualization- Its good to be ‘seen’!

--

--

Rohan Bali
Analytics Vidhya

Data Analytics professional with majors in Computer Science Engineering. Enjoys problem-solving and propelling data-driven decisions.