Essential Math for Machine Learning: PDF and CDF
This article is part of the series Essential Math for Machine Learning.
Introduction
In the world of probability and statistics, understanding how events unfold is paramount. When dealing with continuous random variables — quantities that can take on an infinite range of values (like time, temperature, or distance) — two key concepts emerge as invaluable tools: the Probability Density Function (PDF) and the Cumulative Distribution Function (CDF).
Probability Density Function (PDF)
What it is
The PDF describes the relative likelihood of a continuous random variable taking on a specific value. For continuous variables, the probability of getting an exact value is zero (since there are infinite possibilities). Instead, the PDF tells us the probability of the variable falling within a particular range of values.
What to look for
- Shape: The shape of the PDF curve reveals the distribution’s nature (e.g., normal, uniform, exponential).
- Peak(s): The peak(s) indicate the most likely value(s) or the mode(s) of the distribution.
- Spread: The spread indicates the variability or how much the values deviate from the central tendency.
Cumulative Distribution Function (CDF)
What it is
The CDF gives the probability that a random variable will take on a value less than or equal to a specific value.
What to look for
- Monotonicity: The CDF always increases (or stays flat) as we move from left to right on the x-axis.
- Values: The CDF ranges from 0 to 1.
- Steepness: Steeper sections of the curve indicate regions where the probability density is higher.
Target Shooting Example
Let’s imagine shooting arrows at a circular target. We’ll assume the shots follow a normal (Gaussian) distribution centered around the bullseye. We’ll use Python to simulate this and visualize the results.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# Parameters
num_shots = 1000
target_radius = 1 # Radius of the target
# Simulate shots (assuming normally distributed errors)
mean_error = 0
std_dev_error = 0.2 # Standard deviation of errors (controls spread)
x_errors = np.random.normal(mean_error, std_dev_error, num_shots)
y_errors = np.random.normal(mean_error, std_dev_error, num_shots)
distances = np.sqrt(x_errors**2 + y_errors**2) # Distances from center
# PDF Visualization
plt.figure(figsize=(12, 5))
# Histogram (PDF approximation)
plt.subplot(1, 2, 1)
plt.hist(distances, bins=30, density=True, alpha=0.7, label='Simulated Shots')
plt.title('PDF of Distances from Center')
plt.xlabel('Distance from Center')
plt.ylabel('Probability Density')
# Theoretical PDF (normal distribution)
x = np.linspace(0, 1.5, 100)
pdf = norm.pdf(x, mean_error, std_dev_error)
plt.plot(x, pdf, 'r-', label='Theoretical PDF')
plt.legend()
# CDF Visualization
plt.subplot(1, 2, 2)
cdf = norm.cdf(x, mean_error, std_dev_error)
plt.plot(x, cdf, 'b-', label='CDF')
plt.title('CDF of Distances from Center')
plt.xlabel('Distance from Center')
plt.ylabel('Cumulative Probability')
plt.legend()
plt.show()
Explanation:
- We simulate errors in x and y coordinates.
- Then we use Pythagorean theorem to determine the distance to the center.
- After that, we plot histograms for both x and y coordinates to see their distributions.
- Then we calculate the PDF based on the distance and plot it.
- Finally, we determine the CDF and plot it.
Output: The code will generate two plots:
- PDF: A histogram showing the distribution of distances, overlaid with the theoretical normal distribution curve. The peak will be close to the bullseye (distance = 0), and the curve will spread out, indicating decreasing probability of hitting farther from the center.
- CDF: An increasing curve. You can read off probabilities directly from the CDF. For example, the probability of hitting within a distance of 0.5 is the value of the CDF at x = 0.5.