Descriptive Statistics with Python — Learning Day 1

Data Types and Frequency Distributions

Gianpiero Andrenacci
Data Bistrot
15 min readJul 10, 2024

--

Descriptive Statistics with Python — All rights reserved

Understanding Data Types

In data analysis, it is fundamental to comprehend the different types of data you are dealing with. This understanding influences the choice of statistical methods and the measurement scales you employ.

Data can be broadly classified into three categories: qualitative data, ranked data, and quantitative data. Each type necessitates distinct handling and analytical techniques to yield meaningful insights.

Three Types of Data

Qualitative Data

Qualitative data, also known as categorical data, encompasses attributes or characteristics that cannot be quantified numerically. Instead, these data points are classified into distinct categories or groups. For instance, consider data on gender (male, female), colors (red, blue, green), or brands (Apple, Samsung, Google).

These examples illustrate that qualitative data are essentially about naming or labeling without implying any quantitative measurement. The primary statistical operations with qualitative data involve counting the occurrences within each category and determining the mode, or the most frequent category.

Ranked Data

Ranked data, or ordinal data, share similarities with qualitative data but include an inherent order or ranking among the categories. Despite this order, the intervals between ranks are not necessarily equal or known. For example, survey responses might be ranked as strongly agree, agree, neutral, disagree, and strongly disagree.

Similarly, class rankings (first, second, third) or levels of education (high school, bachelor’s, master’s, doctorate) fall into this category. The operations that can be performed on ordinal data include counting and determining the mode and median, but not mean or other arithmetic calculations due to the unequal intervals between ranks.

Quantitative Data

Quantitative data represent numerical values that can be measured and ordered. This type of data can be further classified into discrete and continuous data. Discrete data consist of distinct, separate values, such as the number of students in a class. On the other hand, continuous data can take any value within a given range, such as height or weight.

Examples of quantitative data include age (years), salary (dollars), and temperature (Celsius). The numerical nature of quantitative data allows for a full range of arithmetic operations, including addition, subtraction, multiplication, and division, making it highly versatile for statistical analysis.

Measurement Scales

Measurement scales play a crucial role in data analysis as they specify the extent to which a number, word, or letter represents an attribute. This, in turn, determines the appropriateness of various arithmetic operations and statistical procedures.

Nominal Measurement

Nominal measurement involves categorical data. With nominal data, the primary operations are counting and identifying the categories. For instance, you might count how many males and females are in a sample, or which hair color is most common. Statistical procedures suitable for nominal data include the Chi-square test, which assesses the association between categorical variables.

Ordinal Measurement

Ordinal measurement pertains to ordinal data type. Examples include class rankings or survey responses. With ordinal data, you can count, find the mode, and determine the median, but not the mean, as the distances between ranks are not uniform.

Interval/Ratio Measurement

Interval and ratio measurements involve numerical data with equal intervals between values. The distinction between the two lies in the presence of a true zero point in ratio data, whereas interval data do not have a true zero.

Examples of interval data include temperature (Celsius, Fahrenheit), where zero does not represent the absence of temperature. Ratio data include height, weight, and age, where zero signifies the absence of the attribute. Interval and ratio data support a full range of arithmetic operations and statistical procedures, such as t-tests and ANOVA, due to their numerical nature and equal intervals.

Frequency Distributions with Python

A frequency distribution is a way to organize and present data, either graphically or in a table, to show how often each value occurs within specified intervals. The frequency indicates how many times a value appears, while the distribution shows the pattern of these frequencies across the variable’s range.

The size of each interval is determined by the nature of the data and the analyst’s objectives. It is essential that intervals are mutually exclusive and collectively exhaustive, ensuring every data point fits into one and only one interval. Frequency distributions are commonly used in statistics to understand data patterns, often associated with normal distribution charts.

A frequency distribution is a valuable statistical tool that visually represents how observations are distributed within a dataset. Analysts frequently employ frequency distributions to visualize or illustrate data from a sample.

Key considerations when gathering data for a frequency distribution include ensuring that the intervals do not overlap and that they encompass all possible observations.

A frequency distribution helps us detect patterns in data by superimposing order on the variability among observations.

For instance, when examining the reaction times of airline pilots to a cockpit alarm, these reaction times can be categorized into various ranges.

When measuring the reaction times of 50 airline pilots, some will have very quick reactions, while others might be slower. However, it is highly likely that the majority will fall within a middle range. Key considerations when gathering data for a frequency distribution include ensuring that the intervals do not overlap and that they encompass all possible observations.

Graphs of frequency distributions further aid our effort to detect data patterns and make sense out of the data.

For example, the appearance of a bell-shaped pattern in the frequency distribution of reaction times of airline pilots to a cockpit alarm suggests the presence of many small chance factors whose collective effect must be considered in pilot retraining or cockpit redesign.

Frequency Distributions for Quantitative Data

Constructing Frequency Distributions

Let’s construct frequency distributions using Python.

Step-by-Step Guide

  1. Collect Data: We will use a sample dataset for demonstration purposes. You can replace this with your own dataset.
  2. Create Classes: Define the intervals or classes for the data.
  3. Calculate Frequencies: Count the number of observations in each class.
  4. Visualize the Frequency Distribution: Use graphs to visualize the distribution.

Example: Constructing a Frequency Distribution

Let’s consider a dataset of reaction times (in milliseconds) of airline pilots to a cockpit alarm:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Sample data: Reaction times of airline pilots to a cockpit alarm (in milliseconds)
reaction_times = [450, 512, 480, 490, 500, 505, 520, 530, 480, 460, 470, 475, 485, 495, 515, 525, 535, 540, 550, 560]

# Create a DataFrame
df = pd.DataFrame(reaction_times, columns=['ReactionTime'])

# Display the first few rows of the DataFrame
df.head()
ReactionTime
0 450
1 512
2 480
3 490
4 500
# Define class intervals (bins)
bins = [450, 470, 490, 510, 530, 550, 570]

# Create a frequency distribution
df['Frequency'] = pd.cut(df['ReactionTime'], bins=bins, right=False)

# Calculate the frequency for each class
frequency_distribution = df['Frequency'].value_counts().sort_index()

# Display the frequency distribution
print(frequency_distribution)
Frequency
[450, 470) 2
[470, 490) 5
[490, 510) 4
[510, 530) 4
[530, 550) 3
[550, 570) 2
Name: count, dtype: int64
# Plot the frequency distribution
plt.figure(figsize=(10, 6))
plt.bar(frequency_distribution_df['Class Interval'].astype(str), frequency_distribution_df['Frequency'], color='skyblue')
plt.xlabel('Class Interval')
plt.ylabel('Frequency')
plt.title('Frequency Distribution of Reaction Times')
plt.xticks(rotation=45)
plt.show()

Constructing frequency distributions is a fundamental step in data analysis. It helps in detecting patterns and examining the distribution of data. By following the guidelines and using Python for implementation, you can easily create and visualize frequency distributions for any quantitative data.

Guidelines for Frequency Distributions

Essential guidelines include:

  1. Each observation should be included in one, and only one, class.
  2. List all classes, even those with zero frequencies.
  3. All classes should have equal intervals.

Relative and Cumulative Frequency Distributions

Relative Frequency Distributions

Relative Frequency Distributions represent the proportion of the total number of observations that fall within each class interval. This helps in understanding the distribution of data in terms of percentages, making it easier to compare datasets of different sizes.

Calculation of Relative Frequency:

Example:

If we have a dataset of reaction times with a total of 20 observations, and 5 of those fall within a specific class interval, the relative frequency for that class would be:

The relative frequency is 25%.

Cumulative Frequency Distributions

Cumulative Frequency Distributions show the cumulative total of frequencies up to the upper boundary of each class interval.

This type of distribution helps in understanding the running total of frequencies, which is useful for determining percentiles and medians.

Calculation of Cumulative Frequency:

Example:

Consider a dataset with the following class frequencies:

  • Class 1: 3
  • Class 2: 5
  • Class 3: 7

The cumulative frequencies would be:

  • Cumulative Frequency for Class 1: 3
  • Cumulative Frequency for Class 2: 3 + 5 = 8
  • Cumulative Frequency for Class 3: 8 + 7 = 15

By understanding both relative and cumulative frequency distributions, you can gain deeper insights into your data, such as the proportion of data within certain ranges and the accumulation of frequencies across class intervals

Example: Reaction Times Dataset

We will use the same dataset of reaction times (in milliseconds) of airline pilots to a cockpit alarm to demonstrate how to calculate relative and cumulative frequency distributions.

Step-by-Step Guide

1. Collect Data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Sample data: Reaction times of airline pilots to a cockpit alarm (in milliseconds)
reaction_times = [450, 512, 480, 490, 500, 505, 520, 530, 480, 460, 470, 475, 485, 495, 515, 525, 535, 540, 550, 560]

# Create a DataFrame
df = pd.DataFrame(reaction_times, columns=['ReactionTime'])

2. Create Classes and Calculate Frequencies

# Define class intervals (bins)
bins = [450, 470, 490, 510, 530, 550, 570]

# Create a frequency distribution
df['Class Interval'] = pd.cut(df['ReactionTime'], bins=bins, right=False)

# Calculate the frequency for each class
frequency_distribution = df['Class Interval'].value_counts().sort_index()

# Display the frequency distribution
print("Frequency Distribution:\n", frequency_distribution)

3. Calculate Relative Frequencies

# Calculate relative frequency
relative_frequency = frequency_distribution / frequency_distribution.sum()

# Display the relative frequency distribution
print("\nRelative Frequency Distribution:\n", relative_frequency)

4. Calculate Cumulative Frequencies

# Calculate cumulative frequency
cumulative_frequency = frequency_distribution.cumsum()

# Display the cumulative frequency distribution
print("\nCumulative Frequency Distribution:\n", cumulative_frequency)

5. Combine All Distributions into a DataFrame

# Combine all distributions into a single DataFrame
frequency_table = pd.DataFrame({
'Frequency': frequency_distribution,
'Relative Frequency': relative_frequency,
'Cumulative Frequency': cumulative_frequency
})

# Display the combined frequency table
print("\nFrequency Table:\n", frequency_table)
Frequency Table:
Frequency Relative Frequency Cumulative Frequency
Class Interval
[450, 470) 2 0.10 2
[470, 490) 5 0.25 7
[490, 510) 4 0.20 11
[510, 530) 4 0.20 15
[530, 550) 3 0.15 18
[550, 570) 2 0.10 20

6. Visualize the Distributions

# Plot the frequency distribution
plt.figure(figsize=(12, 6))

# Frequency
plt.subplot(1, 3, 1)
plt.bar(frequency_table.index.astype(str), frequency_table['Frequency'], color='skyblue')
plt.xlabel('Class Interval')
plt.ylabel('Frequency')
plt.title('Frequency Distribution')
plt.xticks(rotation=45)

# Relative Frequency
plt.subplot(1, 3, 2)
plt.bar(frequency_table.index.astype(str), frequency_table['Relative Frequency'], color='lightgreen')
plt.xlabel('Class Interval')
plt.ylabel('Relative Frequency')
plt.title('Relative Frequency Distribution')
plt.xticks(rotation=45)

# Cumulative Frequency
plt.subplot(1, 3, 3)
plt.bar(frequency_table.index.astype(str), frequency_table['Cumulative Frequency'], color='salmon')
plt.xlabel('Class Interval')
plt.ylabel('Cumulative Frequency')
plt.title('Cumulative Frequency Distribution')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()
The frequency distribution

By following these steps, we have successfully calculated and visualized the frequency, relative frequency, and cumulative frequency distributions for a dataset. This approach helps in understanding the distribution of data and gaining insights into patterns and trends within the data.

Key Points:

  • Relative Frequency shows the proportion of observations in each class.
  • Cumulative Frequency shows the running total of frequencies up to each class.
  • Visualization helps in better understanding and interpreting the distributions.

Using Python, you can easily calculate and visualize these distributions, providing valuable insights for data analysis.

Cumulative Percentages — Percentile Ranks

Cumulative percentages are a way to describe the relative position of any score within its parent distribution. When used in this context, they are referred to as percentile ranks. The percentile rank of a score indicates the percentage of scores in the entire distribution that have values equal to or smaller than that score.

For example, if a weight has a percentile rank of 80, it means that 80% of the weights in the entire distribution are equal to or lighter than that particular weight.

Understanding Percentile Ranks

Percentile Rank of an Observation: The percentile rank of an observation is the percentage of scores in the entire distribution with values that are equal to or smaller than that observation.

Calculation

To calculate the percentile rank of a score within a distribution:

  1. Sort the Data: Arrange the data in ascending order.
  2. Identify the Score: Locate the score of interest.
  3. Count the Values: Count the number of values that are equal to or smaller than the score.
  4. Calculate the Percentile Rank:
import numpy as np
import pandas as pd

# Sample data: Weights of individuals in kg
weights = [50, 55, 60, 65, 70, 75, 80, 85, 90, 95]

# Convert to a pandas Series
weights_series = pd.Series(weights)

# Calculate the percentile rank for a specific weight
score = 75
percentile_rank = (weights_series <= score).mean() * 100
print(f"The percentile rank of the weight {score} kg is {percentile_rank:.2f}")
The percentile rank of the weight 75 kg is 60.00

This means that 60% of the weights in the distribution are 75 kg or lighter.

Explanation:

  1. Data Preparation: We have a list of weights of individuals.
  2. Conversion to Series: Convert the list to a pandas Series for easier manipulation.
  3. Percentile Rank Calculation: Apply the percentile formula for 75kg

Key Points:

  • Percentile Ranks describe the relative position of a score within its distribution.
  • The percentile rank indicates the percentage of scores that are equal to or smaller than the given score.
  • Calculation involves sorting the data, locating the score, counting the number of values equal to or smaller than the score, and then converting this count into a percentage.

Frequency Distributions for Qualitative (Nominal) Data

Frequency distributions for qualitative (nominal) data summarize the count of observations for each category within a dataset. Unlike quantitative data, which can be measured and ordered numerically, nominal data represent categories that do not have an intrinsic order. Nominal data are used to label variables without providing any quantitative value. Examples of nominal data include gender, color, brand names, types of cuisine, etc.

Purpose and Use

The primary purpose of a frequency distribution for nominal data is to provide a clear picture of how often each category appears in the dataset. This helps in understanding the composition and structure of the data, which is particularly useful in fields such as market research, social sciences, and any domain where categorical data are prevalent. For example, in a survey studying consumer preferences for different smartphone brands, the frequency distribution would show how many respondents prefer each brand.

Creating Frequency Distributions

To create a frequency distribution for nominal data, follow these steps:

  1. Collect the Data: Gather the nominal data you want to analyze. This could be through surveys, observations, or any other data collection method.
  2. Count the Frequencies: Count the number of observations for each category. This is done by tallying how many times each category appears in the dataset.
  3. Visualize the Distribution: Use graphical representations such as bar charts or pie charts to visualize the frequency distribution. This helps in easily interpreting the data and identifying the most or least common categories.

Example: Favorite Fruit Survey

Consider a survey where participants are asked about their favorite fruit. The data collected might look like this:

import pandas as pd
import matplotlib.pyplot as plt

# Sample data: Favorite fruit survey results
favorite_fruit = [
'Apple', 'Banana', 'Apple', 'Orange', 'Banana', 'Apple',
'Grapes', 'Apple', 'Orange', 'Banana', 'Grapes', 'Apple',
'Orange', 'Apple', 'Banana', 'Grapes', 'Banana', 'Apple'
]

# Create a DataFrame
df = pd.DataFrame(favorite_fruit, columns=['FavoriteFruit'])

# Calculate the frequency distribution
frequency_distribution = df['FavoriteFruit'].value_counts()

# Display the frequency distribution
print("Frequency Distribution:\n", frequency_distribution)

The output is this:

Frequency Distribution:
Apple 7
Banana 5
Orange 3
Grapes 3
Name: FavoriteFruit, dtype: int64

Visualizing the Frequency Distribution

To better understand and communicate the distribution of favorite fruits, visualizations can be created.

Bar Chart:

# Plot the frequency distribution
plt.figure(figsize=(10, 6))
plt.bar(frequency_distribution.index, frequency_distribution.values, color='skyblue')
plt.xlabel('Favorite Fruit')
plt.ylabel('Frequency')
plt.title('Frequency Distribution of Favorite Fruits')
plt.show()

Pie Chart:

# Plot the frequency distribution as a pie chart
plt.figure(figsize=(8, 8))
plt.pie(frequency_distribution, labels=frequency_distribution.index, autopct='%1.1f%%', colors=['red', 'yellow', 'orange', 'purple'])
plt.title('Favorite Fruits Distribution')
plt.show()

Importance of Frequency Distributions for Nominal Data

Frequency distributions for nominal data are important for several reasons:

  • Data Summarization: They provide a clear summary of how data are distributed across different categories.
  • Pattern Recognition: They help identify patterns and trends within the data. For instance, in the favorite fruit survey, it is easy to see that ‘Apple’ is the most preferred fruit.
  • Decision Making: They aid in decision-making processes by highlighting which categories are more or less common. This is particularly useful in market research and product development.
  • Communication: Visual representations of frequency distributions make it easier to communicate findings to stakeholders, who may not be familiar with raw data.

Understanding and utilizing frequency distributions for qualitative (nominal) data is a fundamental skill in data analysis. By summarizing the counts of each category and visualizing them effectively, analysts can glean valuable insights and make informed decisions based on the composition and structure of the data. This approach is essential in many fields, including market research, social sciences, and any domain where categorical data play a crucial role.

Frequency Distribution with Ranked Data

Frequency distributions with ranked data (ordinal data) , summarize the count of observations within each rank or category. Ranked data share characteristics with nominal data in that they categorize observations, but they also have an intrinsic order. However, the intervals between ranks are not necessarily equal or known.

Examples of ranked data include survey responses (e.g., “Strongly Agree” to “Strongly Disagree”), class rankings, and levels of education.

Purpose and Use of a frequency distribution for ranked data

The main purpose of a frequency distribution for ranked data is to understand how observations are distributed across different ranks. This type of distribution helps identify patterns, trends, and the central tendency within the data.

For instance, in a customer satisfaction survey, a frequency distribution can reveal how many customers fall into each satisfaction category.

Creating Frequency Distributions for Ranked Data

To create a frequency distribution for ranked data, follow these steps:

  1. Collect the Data: Gather the ranked data you want to analyze. This could be through surveys, questionnaires, or any other data collection method.
  2. Count the Frequencies: Count the number of observations for each rank. This involves tallying how many times each rank appears in the dataset.
  3. Visualize the Distribution: Use graphical representations such as bar charts to visualize the frequency distribution. This helps in easily interpreting the data and identifying the most or least common ranks.

Example: Customer Satisfaction Survey

Consider a survey where customers are asked to rate their satisfaction with a product on a scale from “Very Unsatisfied” to “Very Satisfied.” The data might look like this:

import pandas as pd
import matplotlib.pyplot as plt

# Sample data: Customer satisfaction survey results
customer_satisfaction = [
'Very Satisfied', 'Satisfied', 'Neutral', 'Dissatisfied', 'Very Satisfied',
'Satisfied', 'Satisfied', 'Neutral', 'Very Satisfied', 'Very Unsatisfied',
'Neutral', 'Satisfied', 'Very Satisfied', 'Dissatisfied', 'Neutral',
'Satisfied', 'Very Satisfied', 'Very Unsatisfied', 'Satisfied', 'Neutral'
]

# Create a DataFrame
df = pd.DataFrame(customer_satisfaction, columns=['Satisfaction'])

# Calculate the frequency distribution
frequency_distribution = df['Satisfaction'].value_counts().sort_index()

# Display the frequency distribution
print("Frequency Distribution:\n", frequency_distribution)

The output is this:

Frequency Distribution:
Dissatisfied 2
Neutral 5
Satisfied 6
Very Satisfied 6
Very Unsatisfied 2
Name: Satisfaction, dtype: int64

Visualizing the Frequency Distribution

To better understand and communicate the distribution of customer satisfaction, visualizations can be created.

Bar Chart:

# Plot the frequency distribution
plt.figure(figsize=(10, 6))
plt.bar(frequency_distribution.index, frequency_distribution.values, color='#28579B')
plt.xlabel('Satisfaction Level')
plt.ylabel('Frequency')
plt.title('Frequency Distribution of Customer Satisfaction')
plt.show()

Importance of Frequency Distributions for Ranked Data

Frequency distributions for ranked data are important for the same reasons pointed out above for nominal data: Data Summarization, Pattern Recognition, Decision Making, Communication.

By summarizing the counts of each rank and visualizing them effectively, analysts can glean valuable insights and make informed decisions based on the composition and structure of the data.

Visual representation of the data types and levels of measurement

Bonus: This diagram has been created with the following python code

ffrom graphviz import Digraph

# Create a new Digraph
dot = Digraph()

# Add nodes with appropriate labels
dot.node('A', 'Types of Data')
dot.node('B', 'Categorical or Qualitative Data')
dot.node('C', 'Numerical or Quantitative Data')
dot.node('D', 'Nominal Data')
dot.node('E', 'Ordinal Data')
dot.node('F', 'Discrete Data')
dot.node('G', 'Continuous Data')

# Add edges to represent the hierarchy
dot.edges(['AB', 'AC', 'BD', 'BE', 'CF', 'CG'])

# Render the diagram to a file
diagram_path = '/mnt/data/types_of_data_diagram'
dot.render(diagram_path, format='png')

diagram_path

--

--

Gianpiero Andrenacci
Data Bistrot

AI & Data Science Solution Manager. Avid reader. Passionate about ML, philosophy, and writing. Ex-BJJ master competitor, national & international titleholder.