Statistics — Deep dive — Part2

Pavan Ebbadi
Analytics Vidhya
Published in
4 min readAug 31, 2020

In this part we will be looking into
1. Identifying types of data
2. Frequency distribution of categorical and numerical data
3. Histogram
4. Types of histogram

Link to part1
Link to part3

Associating a type with data is trivial but a very important skill in Statistics.
Blindly you can assume any Numerical attribute is continuous and text data is categorical, but without proper knowledge of data this can be delinquent at times.
Eg: Prize money of competitors in a competition - the range of prize money might vary between competitions of different years, but rank of competitors(ordinal) is more accurate way to measure than using prize money as continuous variable.

A frequency distribution is the organization of raw data in table form, using classes and frequencies.

  1. Categorical Frequency Distributions
  2. Grouped Frequency Distribution

Categorical frequency distribution is used for data that can be placed in specific categories, such as nominal or ordinal level data. For example, data such as political affiliation, religious affiliation, or major field of study would use categorical frequency distributions.
Below is the survey of unemployment of gender of 12 people from a university

M-Male, F-Female, U-Unemployed, E-Employed
Percentage = f/N, where f is frequency and N is total data

Grouped Frequency Distributions is used when we have numerical data of large range. The data is grouped into classes that are more than 1 unit in width. The minimum value and maximum valued acts as boundaries.
Eg: In the above survey, along with employment status, if the age of the people were recorded

Let’s calculate Max, Min, Range — then we will decide how many classes we want to divide Age into.

5 classes are created starting from lower limit to upper limit as shown below using class width calculated above.

Couple of things I want to highlight before we jump into histogram.
Class Boundaries: This is for us to decide what boundaries we want to include for each class limit above Eg: 16–32 class limit can actually be used to record all the ages between 15.5–32.5
Open Ended distribution: This is used when there are outliers or you are interested in specific range of values. Eg: if there was one person whose age was 100 years in above survey, the class width for 5 classes would make the distribution look skewed. It would be better to have last class as “48 and above” in that case.

Histograms

The histogram is a graph that displays the data by using contiguous vertical bars (unless the frequency of a class is 0) of various heights to represent the frequencies of the classes.

Histogram to represent the data shown for the record high temperatures for each of the 50 states (see Example 2–2).

Frequency is calculated against class of equal width
Each Class is represented on X axis with same width, the frequency is captured on Y axis.

Histogram Shapes

  1. Bell Shaped -has a single peak and tapers off at either end.
    Eg: Any normal distribution without outliers and values converging towards mean, Miles covered by a particular car for each gallon of gas.
  2. Uniform- is basically flat or rectangular.
    Eg: Any perfect linear trend captured in histogram will result in uniform histogram.
  3. Right Skewed - When the peak of a distribution is to the left and the data values taper off to the right, a distribution is said to be positively or right-skewed.
  4. Left Skewed- When the data values are clustered to the right and taper off to the left, a distribution is said to be negatively or left-skewed.
  5. Bimodal - When a distribution has two peaks of the same height, it is said to be bimodal.
  6. U-shaped - Peaks near min and max values of the distribution and less frequency near median.

Distributions can have other shapes in addition to the ones shown here; however, these are some of the more common ones that you will encounter in analyzing data.

Reference: Elementary Statistics — Bluman

--

--

Pavan Ebbadi
Analytics Vidhya

Senior Advisor of Analytics at CVS Caremark. Leading a team to build Personalization engine for CVS customers with stats, machine learning and deep learning.