Descriptive Statistics with Python — Learning Day 2

Types of Variables and Visualization With Python

Gianpiero Andrenacci
Data Bistrot
12 min readJul 15, 2024

--

Descriptive Statistics with Python — All rights reserved

Discrete and Continuous Variables

In statistics and data analysis, a variable is a characteristic or property that can take on different values. Variables are the building blocks of data, and understanding their types is essential for analyzing data correctly and effectively.

In contrast to variables, a constant is a characteristic or property that can take on only one value, remaining fixed throughout an analysis.

This article explore the different types of variables, including discrete and continuous variables, as well as independent and dependent variables in the context of experiments and observational studies.

Discrete Variables

A discrete variable is one that consists of isolated numbers separated by gaps. These variables take on specific, distinct values and are often counted rather than measured. Examples of discrete variables include the number of students in a class, the number of cars in a parking lot, or the number of pets owned. Since discrete variables can only take specific values, there are gaps between the possible values.

Example:

  • Number of students in a class: 20, 21, 22 (cannot be 20.5)
  • Number of cars in a parking lot: 5, 10, 15 (cannot be 7.5)

Example: Visualizing Discrete Variables with Python

A bar chart uses rectangular bars to represent the frequency or count of each category. The length of each bar is proportional to the value it represents.

In the following python code, we visualize the number of students in different classes using an horizontal bar chart.

import matplotlib.pyplot as plt

# Sample data: Number of students in different classes
classes = ['Class A', 'Class B', 'Class C', 'Class D', 'Class E', 'Class F']
students = [20, 25, 22, 27, 24, 30]

# Create a horizontal bar chart
plt.figure(figsize=(10, 6))
plt.barh(classes, students, color='#B03A2E')
plt.xlabel('Number of Students')
plt.ylabel('Class')
plt.title('Number of Students in Different Classes')
plt.show()

Why Use Bar Charts for Discrete Data ?

  1. Distinct Categories: Bar charts are ideal for discrete data because they represent distinct, separate categories. Each bar corresponds to a unique category, making it easy to compare frequencies or counts across categories.
  2. Clear Visualization: The gaps between the bars in a bar chart highlight that the categories are separate and not continuous. This visual distinction helps to emphasize the discrete nature of the data.
  3. Versatility: Bar charts can be used both vertically and horizontally, accommodating various types of categorical data and making them flexible for different presentation needs.

Continuous Variables

A continuous variable consists of numbers whose values, at least in theory, have no restrictions. These variables can take on any value within a given range and are typically measured rather than counted. Examples of continuous variables include height, weight, and temperature. Because continuous variables can take on any value within a range, they do not have gaps between possible values.

Example:

  • Height: 160.5 cm, 170.2 cm, 180.8 cm (can be any value within a range)
  • Temperature: 23.1°C, 23.2°C, 23.3°C (can be any value within a range)

Example: Visualizing a Continuous Variable with a Histogram

Visualize the distribution of a continuous variable (e.g., heights) using a histogram.

import numpy as np
import matplotlib.pyplot as plt

# Sample data: Heights in centimeters
heights = np.random.normal(170, 10, 1000) # Generate 1000 data points with a mean of 170 cm and a standard deviation of 10 cm

# Create a histogram
plt.figure(figsize=(10, 6))
plt.hist(heights, bins=30, color='skyblue', edgecolor='black')
plt.xlabel('Height (cm)')
plt.ylabel('Frequency')
plt.title('Distribution of Heights')
plt.grid(True)
plt.show()

Why Use Histograms for Continuous Data?

  • Continuous Range: Histograms are ideal for continuous data because they represent data points that fall within continuous intervals or ranges. Each bar in a histogram represents the frequency of data points within a specific range.
  • Distribution Insight: Histograms effectively show the distribution of continuous data, revealing patterns such as normal distribution, skewness, and the presence of outliers (see later).
  • Density Representation: Unlike bar charts, histograms have bars that touch each other, indicating the continuous nature of the data and the absence of gaps between the values.

Experiments, Independent and Dependent Variables

In the field of scientific research, an experiment is a structured investigation where the investigator decides who receives a special treatment or intervention. This carefully controlled manipulation, often referred to as the intervention, is essential for determining causal relationships between variables. By deliberately altering one aspect of the study, researchers can observe how it impacts other aspects, thus uncovering cause-and-effect dynamics.

At the heart of any experiment lies the independent variable. This is the treatment or condition that the investigator manipulates. It represents the presumed cause in the study. For instance, in a clinical trial testing a new drug, the independent variable might be the dosage of the drug administered to participants. By varying the dosage, researchers can explore how different levels of the drug influence outcomes.

The dependent variable, on the other hand, is the outcome or response that is measured to assess the effect of the independent variable. It represents the presumed effect in the study. Continuing with the clinical trial example, the dependent variable could be the improvement in patients’ health status after receiving the drug. By measuring health improvements, researchers can evaluate the efficacy of the drug and its potential benefits.

Imagine a clinical trial designed to test a new medication aimed at reducing blood pressure. Here, the independent variable is the dosage of the medication given to participants. Researchers might administer different dosages to different groups to see how varying levels impact blood pressure reduction.

The dependent variable in this scenario would be the actual change in blood pressure observed in the participants. By analyzing the data, researchers can determine whether higher doses lead to greater reductions in blood pressure, thus establishing a causal link between the drug dosage and health outcomes.

Experiments are a powerful tool in scientific research, enabling investigators to unravel the complexities of cause and effect. By manipulating the independent variable and measuring the dependent variable, researchers can gain valuable insights into the relationships between different factors. In clinical trials, psychological studies, or any other field of inquiry, the role of independent and dependent variables is pivotal for designing effective experiments and drawing meaningful conclusions.

Example: Clinical Trial with Causal Effect

Objective: Show the correlation effect of different dosages of a drug on health improvement.

Dataset: Simulated data for drug dosage and corresponding health improvement scores.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Simulated data: Dosages of a drug (independent variable)
dosages = [10, 20, 30, 40, 50] # mg

# Simulated data: Improvement in health status (dependent variable)
# Assuming a simple linear relationship for demonstration purposes
health_improvement = [8, 12, 18, 25, 30] # Health improvement score

# Creating a DataFrame to organize the data
data = pd.DataFrame({
'Dosage (mg)': dosages,
'Health Improvement Score': health_improvement
})

# Display the DataFrame
print(data)

# Plotting the data to visualize the causal effect
plt.figure(figsize=(10, 6))
plt.plot(dosages, health_improvement, marker='o', linestyle='-', color='b')
plt.xlabel('Dosage (mg)')
plt.ylabel('Health Improvement Score')
plt.title('Effect of Drug Dosage on Health Improvement')
plt.grid(True)
plt.show()

# Calculating the correlation coefficient
correlation = np.corrcoef(dosages, health_improvement)[0, 1]
print(f"Correlation between Dosage and Health Improvement: {correlation:.2f}")

# Linear regression to demonstrate the relationship
from sklearn.linear_model import LinearRegression

# Reshaping data for sklearn
X = np.array(dosages).reshape(-1, 1)
y = np.array(health_improvement)

# Creating and fitting the model
model = LinearRegression()
model.fit(X, y)

# Predicting values
y_pred = model.predict(X)

# Plotting the linear regression line
plt.figure(figsize=(10, 6))
plt.scatter(dosages, health_improvement, color='blue', label='Actual data')
plt.plot(dosages, y_pred, color='red', linestyle='--', label='Linear fit')
plt.xlabel('Dosage (mg)')
plt.ylabel('Health Improvement Score')
plt.title('Linear Regression: Drug Dosage vs Health Improvement')
plt.legend()
plt.grid(True)
plt.show()

# Displaying the regression coefficient (slope)
print(f"Regression coefficient (slope): {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")
   Dosage (mg)  Health Improvement Score
0 10 8
1 20 12
2 30 18
3 40 25
4 50 30

Explanation:

  1. Data Simulation: We simulate data for different dosages of a drug and the corresponding health improvement scores, assuming a simple linear relationship for demonstration purposes.
  2. DataFrame Creation: We create a Pandas DataFrame to organize and display the data.
  3. Data Visualization: We plot the data using a line plot to visualize the relationship between drug dosage and health improvement.
  4. Correlation Calculation: We calculate the correlation coefficient to quantify the strength and direction of the relationship between dosage and health improvement.
  5. Linear Regression: We use sklearn.linear_model.LinearRegression to fit a linear regression model to the data. This helps to demonstrate the correlation more clearly by providing a mathematical model of the relationship (we’ll se linear regrassion detail later in this series dedicated to descriptive statistics).
  6. Visualization of Linear Fit: We plot the actual data points and the linear regression line to visualize the fit of the model.
  7. Regression Coefficient: We display the regression coefficient (slope) and the intercept, which quantify the linear relationship between the independent and dependent variables.

This example demonstrates correlation where increasing the dosage of the drug leads to greater health improvement, with evidence provided by the correlation coefficient and the linear regression model.

Correlation is Not Causation

When analyzing data, it is essential to understand that correlation does not imply causation. Just because two variables show a correlation, it does not mean that changes in one variable cause changes in the other. This principle is critical when interpreting results, such as the relationship between drug dosage and health improvement scores in clinical trials.

In the article Descriptive Statistics with Python — Learning Day 5: Correlation and Causation, we will delve deeper into this concept, using simulated data from a clinical trial. We will use statistical tools, including the correlation coefficient and linear regression models, to demonstrate this relationship and discuss its limitations.

For instance, in our clinical trial example, we may find a strong correlation between drug dosage and health improvement scores. While this suggests that higher dosages are associated with better health outcomes, it does not prove that the drug dosage directly causes the improvement. There might be other factors such as patient age, overall health, or lifestyle choices that contribute to the observed effect.

Confounding Variables

In the context of data analysis and experimental design, a confounding variable is an uncontrolled variable that compromises the interpretation of a study. It is an extraneous factor that can influence both the independent and dependent variables, potentially leading to incorrect conclusions about the relationship between them. Confounding variables can introduce bias and make it difficult to determine whether the observed effect is due to the independent variable or the confounder.

Example:

Consider a study examining the effect of exercise on weight loss. If the study does not account for participants’ diets, diet becomes a confounding variable because it can also affect weight loss. As a result, it would be unclear whether the observed weight loss is due to exercise, diet, or a combination of both.

Observational Study

An observational study is a study that focuses on detecting relationships between variables not manipulated by the investigator. Unlike experiments, observational studies do not involve intervention or manipulation of variables by the researcher. Instead, the researcher observes and analyzes existing conditions or behaviors to find correlations or associations.

Example:

A study examining the relationship between smoking and lung cancer rates is observational because the researcher does not control or manipulate the smoking behavior of the participants.

import pandas as pd
# Example of an observational study: Relationship between smoking and lung cancer rates
data = {
'Smoking_Habit': ['Smoker', 'Non-Smoker', 'Smoker', 'Non-Smoker', 'Smoker'],
'Lung_Cancer_Rate': [90, 10, 85, 15, 88]
}
df = pd.DataFrame(data)
print("Observational study data:\n", df)
# Analyze the relationship
correlation = df['Smoking_Habit'].apply(lambda x: 1 if x == 'Smoker' else 0).corr(df['Lung_Cancer_Rate'])
print(f"Correlation between smoking habit and lung cancer rate: {correlation:.2f}")
Observational study data:
Smoking_Habit  Lung_Cancer_Rate
0 Smoker 90
1 Non-Smoker 10
2 Smoker 85
3 Non-Smoker 15
4 Smoker 88
Correlation between smoking habit and lung cancer rate: 1.00

Independent and dependent variables are key concepts in experimental research, where manipulation of the independent variable allows researchers to observe effects on the dependent variable. In contrast, observational studies focus on detecting relationships without manipulation.

Typical Shapes of Frequency Distributions

Frequency distributions can take on various shapes, which reveal important characteristics about the data. Whether expressed as a histogram, a frequency polygon, or a stem-and-leaf display, the shape of a frequency distribution provides insights into the underlying patterns and trends within the dataset.

Normal Distribution

Normal Distribution: A normal distribution, often referred to as the bell curve, is symmetric and has a characteristic bell-shaped silhouette. It is one of the most common distributions and can be analyzed using the well-documented normal curve. Examples of data that typically follow a normal distribution include:

  • Heights of adults
  • IQ scores
import numpy as np
import matplotlib.pyplot as plt

# Generating normal distribution data
data = np.random.normal(loc=170, scale=10, size=1000) # Heights of adults in cm

# Plotting the histogram
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, color='skyblue', edgecolor='black', density=True)
plt.title('Normal Distribution of Heights')
plt.xlabel('Height (cm)')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

Bimodal Distribution

Bimodal Distribution: A bimodal distribution has two distinct peaks, indicating the presence of two different types of observations within the same dataset. This type of distribution might be observed in:

  • Exam scores of two different classes
  • Daily temperatures in a city with two distinct climate patterns
# Generating bimodal distribution data
data1 = np.random.normal(loc=70, scale=5, size=500) # Exam scores of class 1
data2 = np.random.normal(loc=85, scale=5, size=500) # Exam scores of class 2
data = np.concatenate([data1, data2])

# Plotting the histogram
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, color='skyblue', edgecolor='black', density=True)
plt.title('Bimodal Distribution of Exam Scores')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

Positively Skewed Distribution (right-skewed)

Positively Skewed Distribution : A positively skewed distribution has a long tail extending to the right. This occurs due to a few extreme observations in the positive direction. Examples include:

  • The distribution of daily rainfall in a desert
  • The distribution of test scores in a very difficult exam
# Generating positively skewed distribution data
data = np.random.exponential(scale=10, size=1000) # Daily rainfall in mm

# Plotting the histogram
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, color='skyblue', edgecolor='black', density=True)
plt.title('Positively Skewed Distribution of Daily Rainfall')
plt.xlabel('Rainfall (mm)')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

Negatively Skewed Distribution (left-skewed)

A negatively skewed distribution, also known as a left-skewed distribution, is a distribution in which more values are concentrated on the left side of the mean than on the right side.

In finance, a left-skewed distribution means there will likely be frequent small gains and few large losses. An investor would not want a negatively skewed return distribution because the large losses will cancel out the small gains.

It is important to remember that a distribution may be presented in other formats, such as a bar chart or histogram. Below is a left-skewed distribution presented as a bar chart.

import numpy as np
import matplotlib.pyplot as plt

# Generating negatively skewed distribution data
data = np.random.beta(a=5, b=2, size=1000) * 100 # Adjusting parameters for left skew

# Plotting the histogram
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, color='skyblue', edgecolor='black', density=True)
plt.title('Negatively Skewed Distribution of Values')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

Identifying Skewness

Remembering whether a distribution is positively or negatively skewed can be challenging. The key is to focus on the direction of the few extreme observations rather than the majority of the data points.

  • Positively Skewed: Look for a longer tail on the right side.
  • Negatively Skewed: Look for a longer tail on the left side.

By referring to the shapes and examples provided, one can better identify and understand the skewness in various datasets.

Brief Explanations of mathematical functions used in the article

np.random.normal

The np.random.normal function in NumPy generates random samples from a normal (Gaussian) distribution.

  • Syntax: np.random.normal(loc=0.0, scale=1.0, size=None)
  • loc: Mean (“centre”) of the distribution.
  • scale: Standard deviation (spread or “width”) of the distribution.
  • size: Output shape. If size is None, a single value is returned.

np.random.exponential:

The np.random.exponential function in NumPy generates random samples from an exponential distribution.

  • Syntax: np.random.exponential(scale=1.0, size=None)
  • scale: The inverse of the rate parameter (1/λ). It is the mean of the distribution.
  • size: Output shape. If size is None, a single value is returned.

np.random.beta

The np.random.beta function in NumPy generates random samples from a Beta distribution.

  • Syntax: np.random.beta(a, b, size=None)
  • a: Alpha parameter of the Beta distribution (must be > 0).
  • b: Beta parameter of the Beta distribution (must be > 0).
  • size: Output shape. If size is None, a single value is returned.

--

--

Gianpiero Andrenacci
Data Bistrot

AI & Data Science Solution Manager. Avid reader. Passionate about ML, philosophy, and writing. Ex-BJJ master competitor, national & international titleholder.