Stories by Xinzhe Li, PhD in Language Intelligence on Medium

Efficiently Track and Cache Your LLM API Usage with LLMTrack

Xinzhe Li, PhD in Language Intelligence — Sun, 16 Mar 2025 07:34:40 GMT

Are you tired of repeatedly incurring costs and facing latency issues due to redundant API calls when working with Large Language Models (LLMs)? Do you want detailed insights into how different models consume tokens for your projects? Meet LLMTrack, the simple yet powerful Python package designed to streamline your workflow by caching results and precisely tracking token usage per model.

Why LLMTrack?

As developers, data scientists, or AI enthusiasts, we frequently interact with various LLMs such as OpenAI’s GPT models, Azure OpenAI, MoonShot, and Groq. Repeatedly calling these APIs for identical prompts wastes both time and money. Moreover, tracking token consumption helps optimize costs and ensures efficient usage of LLM resources.

LLMTrack addresses these exact pain points:

Caching: Avoids repeated API calls by caching responses for each LLM model.
Token Tracking: Records token usage separately for each LLM model.

Quick Installation

Getting started with LLMTrack is straightforward:

pip install llmtrack

Simplified Workflow

Here’s how easy it is to integrate LLMTrack into your project:

from llmtrack import set_root_dir, get_llm

# Set up a custom root directory (optional)
set_root_dir("~/my_project/llmtrack")

# Obtain an LLM instance with caching and token tracking enabled
client_name = "openai"
model_name = "gpt-4o-mini"
llm = get_llm(f"{client_name}/{model_name}", cache=True, token_usage=True)

# Generate a response
usr_message = "ONLY generate a positive word"
client_response = llm.respond(usr_message, verbal=True)

The cache and token usage details are neatly organized in:

~/my_project/llmtrack/openai/gpt-4o-mini

Easily Monitor Your Usage

LLMTrack makes it easy to inspect your token usage and verify cache contents:

# Token Usage
usage = llm.token_usage
print('Token Usage:', usage)

# Check Cache
cache_key = llm.get_cache_key(usr_message)
print('Cache:', llm.cache[cache_key])

Supported Models & Clients

LLMTrack seamlessly integrates with popular LLM APIs:

OpenAI: Supports models like gpt-4o-mini, gpt-3.5-turbo, and more (all models).
Azure OpenAI
MoonShot: e.g., moonshot/moonshot-v1-8k
Groq: Supports models listed in Groq’s documentation.

Just set the appropriate environment variables, and you’re good to go!

Enhance Productivity, Reduce Costs

Hopefully, you can stop worrying about repetitive API costs and start efficiently tracking your LLM interactions.

Efficiently Track and Cache Your LLM API Usage with LLMTrack was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

Programming LLMs Step-by-Step(Part 1): Pre-Training Large Language Models in PyTorch

Xinzhe Li, PhD in Language Intelligence — Sun, 19 May 2024 11:09:04 GMT

A Step-by-Step Tutorial on Transformer Architecture, Data Preparation, and Practical Training Techniques from Scratch

Continue reading on Level Up Coding »

Implementation of LLM Agents: Should You Opt for LangChain?

Xinzhe Li, PhD in Language Intelligence — Mon, 29 Apr 2024 21:52:28 GMT

Comparing LangChain and From-Scratch Implementations

Continue reading on Level Up Coding »

From Theory to Code in Machine Learning (Part 2): Maximum Likelihood Estimation for Classification

Xinzhe Li, PhD in Language Intelligence — Wed, 24 Apr 2024 03:45:33 GMT

Implementing Logistic Regression from Scratch in Python

Continue reading on Level Up Coding »

Principal Component Analysis: A Comprehensive Explanation

Xinzhe Li, PhD in Language Intelligence — Tue, 26 Mar 2024 03:57:42 GMT

An Intuitive, Mathematically , and Step-by-Step Coding Guide

Photo by Joshua Sortino on Unsplash

In the ever-expanding world of data, understanding the underlying patterns and extracting meaningful insights is crucial for making informed decisions. One of the powerful tools at our disposal for simplifying complex data sets into understandable formats is Principal Component Analysis (PCA). This blog post aims to unravel the intuition and comprehensive methodology behind PCA. We will explore the concepts of decorrelation and variance maximization. These concepts are pivotal in transforming raw data into a reduced dimensional space, where the essence of the data is preserved while minimizing redundancy. Join us on this journey to demystify PCA, a cornerstone technique in the field of machine learning and data science, making it accessible and intuitive for enthusiasts and professionals alike.

Section 1: Why Should We Reduce Data Dimension?

One Running Example — Book Review: Imagine you’re trying to organize a set of book reviews based on whether they speak positively or negatively about the books. Each review mentions words like “bad,” “good,” or “superb” a certain number of times. You decide to keep track by creating a chart: for each review, you note how often each of these words appears. This chart is your document-word matrix, where each cell represents the frequency of a word in a document. For simplicity, we will use hypothetical frequencies that reflect the sentiment of each document.

data = np.array([
    # good superb
    [0,    0],  # D1: Neutral or negative
    [3,    0],  # D2: Moderately positive
    [5,    2],  # D3: Positive
    [2,    5]   # D4: Highly positive
])

D1 shows no mention of either “good” or “superb”, suggesting a neutral or negative review.
D2 features “good” without “superb”, indicating moderate positivity.
D3 and D4 show increasing usage of “superb” alongside “good”, moving from positive to highly positive sentiments.

Redundancy and Risk of Using Correlated Features — “good” and “superb”: When we consider both “good” and “superb” together in our analysis, especially using simpler prediction models that rely on word counting, we might run into issues with correlation. This is because if “good” and “superb” are often used together in the same reviews, just adding up their counts might not give us the full picture. For example, a review that mentions “good” several times and “superb” a few times might end up being treated the same as one that only mentions “good” many times, even though the presence of “superb” suggests a stronger positive sentiment. This correlation between “good” and “superb” usage — where they increase together — can lead us to oversimplify our understanding of the reviews. In other words, by just counting words without considering the strength of sentiment they represent, we could miss out on the nuances between different levels of positivity. This approach might cause us to overlook the subtle but important distinctions that separate moderately positive reviews from those that are highly positive, leading to a less accurate or nuanced model of sentiment prediction.

Real-world Issues: In summary, real-world machine learning models, similar to the oversimplified word counting approach, can suffer from issues related to feature correlation and oversimplification across various domains, including regression, recommendation systems, and computer vision. These challenges stem from models not accounting for the nuanced relationships between features, leading to potential misinterpretations and inaccuracies. Specifically, in linear regression models, multicollinearity arises when independent variables are highly correlated. In this condition, the coefficients (or weights) assigned to the independent variables may not be accurately estimated. Note that the model might still predict well on average, but it does indicate specific problems related to the interpretation and stability of the coefficient estimates. For example, in real estate pricing models, features like the number of bedrooms and the size of the house might be correlated, complicating the interpretation of their individual contributions to the house price.

Hence, here is Claim 1: We want decorrelated data.

Section 2: Why Should We Maximize Variance?

Another Example — Friends’ Hangout Frequencies

Here, we have two features across three different data points (people), representing their hangout frequency with two friends (Friend A and Friend B) in a week.

# the frequency of hangouts with Friend A and Friend B within a year.
hangout = np.array([
   #Friend A     Friend B
        [2,       3],    # Me
        [10,      4],    # Person 1
        [20,      5]     # Person 2
])
hangout = np.array([
   #Friend A     Friend B  
        [2,       3],    # Introverted Person (Me)
        [10,      4],    # Outgoing Person  (Person 1)
        [20,      5]     # Extremely Outgoing Person  (Person 2)
])

To determine which feature is more useful, you can think of it in terms of consistency and variability among the given data points and how these aspects can help predict or understand new data points (new persons).

Variability: Friend A shows a wide range of hangout frequencies: from 2 to 20 times a week. This variability means that how often someone hangs out with Friend A could tell us a lot about their social habits or preferences.
Consistency: Friend B, on the other hand, shows less variability (from 2 to 4 times a week), suggesting that people’s interactions with Friend B are more uniform.

One thing I have to emphasize is that, although the hangout features always uniquely identify each data point, the overfitting issue does not correlate during the process of exploring patterns. In modeling for predicting each data point (person), it definitely leads to overfitting, as the model might learn the specificities of the training data too closely, impairing its ability to generalize to new, unseen data. However, exploring patterns within the data, as opposed to individual identification, focuses on uncovering broader trends and relationships. This does not lead to overfitting since we do not care about each data point but the general pattern underlying the data.

Therefore, here is Claim 2: A Features with greater variance often have more predictive power, i.e., maximizing variance.

Someone said Friend A is more friendly and hangs out more often. That’s why they choose A. Ok. Let’s look at an update one:

# the frequency of hangouts with Friend A and Friend B within a year.
hangout = np.array([
   #Friend A     Friend B  Friend C
        [2,       3,        12],    # Introverted Person
        [10,      4,        12],    # Outgoing Person 
        [20,      5,        12]     # Extremely Outgoing Person 
])

Let’s go back to the book review example.

Selecting Features According to Variation (Variance): Using the code snippet below, we can clearly see the variance of both features, as shown in Figure 1.

good = data[:,0]
superb = data[:,1]
fig, ax = plt.subplots(figsize=(5, 3))
ax.hist(good, bins=15, alpha=1, label=f'Var(good)={np.var(good):.2f}')
ax.hist(superb, bins=15, alpha=0.5, label=f'Var(superb)= {np.var(superb):.2f}')
ax.set_title('Variance Demonstration')
ax.legend()

Figure 1: Variance of Word Frequency. The vertical axis shows the frequency of word frequency.

According to the Claim 2, we should choose the “superb” feature to represent data to reduce data dimension. However, solely focusing on “superb” risks losing information about moderately positive reviews, like D2, where “good” is present but “superb” is not. This exclusion could lead to an incomplete understanding of the sentiment spectrum represented in the data.

Therefore, here is Claim 3: New features should be generated by considering all the existing features.

Section 4: Principal Component Analysis

PCA is designed to generate new dimensions (features) to satisfy all the three claim above, Specifically, it finds projection vectors, each of them meeting the two desired properties.

Maximizing variance: Data is projected onto the directions that spread out the data.
Orthogonality and Decorrelation: The directions are orthogonal to each others.

This is what PCA does via the following processes:

Computing the covariance matrix of the centered data (centering data with mean =0).
Performing Eigenvalue Decomposition on the covariance matrix to obtain the eigenvectors and eigenvalues.
Ordering the eigenvectors by decreasing eigenvalues and choosing a subset to form a matrix of principal components.
Transforming the original data through this matrix to obtain the reduced dataset.

Generally, PCA identifies directions (eigenvectors p) in the feature space where the data (as represented by the covariance matrix A) shows the most spread or variance. To fully understand how these steps lead to eigenvectors satisfying the above two properties, let’s delve into the two important mathematical concepts: Covariance and Eigenvector (and corresponding eigenvalues).

Section 4.1: Covariance

Covariance is the average of the product of the differences of each data point between two features/variables from the sample mean. Covariance matrix is just a square matrix that describes the covariance between each pair of variables in a dataset, while the covariance between a variable itself is actually the variance. Below are the calculations in Python.

def covar(x, y):
    """ Covariance of two variables.
    Args:
        x, y (np.ndarray): Two NumPy arrays of the same length."""
    return np.sum((x - x.mean())*(y - y.mean()))/len(x)

def covar_matrix(X):
    """ Covariance matrix of a dataset.
    Args:
        X (np.ndarray): A NumPy array of shape (n_samples, n_features)."""
    n_samples, n_features = X.shape
    covar_matrix = np.zeros((n_features, n_features))
    for i in range(n_features):
        for j in range(n_features):
            covar_matrix[i, j] = covar(X[:, i], X[:, j])
    return covar_matrix

No worries if the mathematical expression cannot give you an intuitive understanding : what is covariance? why does the covariance matrix help achieve the above goals? No worries. Let’s see it in examples.

Covariance: Tracking the Harmony Between Two Variables. With the hangout data, you’re trying to figure out which groups of friends like to hang out together. The covariance matrix is like our way of keeping track of these appearances. It tells us if “Friend A” and “Friend A” often show up together (they’re positively related), if one shows up when the other doesn’t (negatively related), or if they don’t have much to do with each other at all. One confusion may be why the calculation of covariance requires the subtraction of sample means.Let’s think about each sample mean of a variable as a baseline, i.e., the number of occurrences without preference. So, only the co-occurrences removing the average can reflect its ture preference to hanging out with another friend or whether the words are highly correlated. Otherwise, the large number of co-occurrences may be just because the person is too outgoing or the word is very common, e.g., “the”, “a”.

Why does the covariance matrix help achieve our goals? Tracking the underlying correlation structure of the data (e.g., which words move together) is a good way to explain the global relationships between features in our data with as few explanations as possible. In other words, we can start to see our data in simpler, clearer ways, like finding the main reasons people come to the party or the semantic groups of words.

Section 4.2: EigenDecomposition on Covariance Matrix

As defined in Wikipedia, “a(nonzero) vector v of dimension N is an eigenvector of a square N × N matrix A if it satisfies a linear equation of the for Ap=λp”.

Why does p in Ap=λp,where A is a covariance matrix, relate to an axis transforming original data with a refection of dataset variance? In the context of a covariance matrix, p, being an eigenvector, serves as a unique axis where the covariance matrix’s action is reduced to simple scaling. This scaling, quantified by λ, directly corresponds to the variance of data along p because it’s the measure of how much A stretches p. In essence, p isolates a specific “mode” or pattern of variance within the dataset, with λ quantifying the magnitude of this variance. This connection between λ and the variance along p underpins can be used to construct projection matrix, consisting of eigenvectors sorted by their corresponding eigenvalues, to transform the original data.

Let’s look at the eigenvalues and eigenvectors in the example of word-document matrix. After running the code below to following the first three steps of PCA:

Computing covariance matrix
Performing Eigenvalue Decomposition
Ordering the eigenvalues and eigenvectors

The results of eigenvectors (the components used to transform the original data satisfying the two goals — Maximizing variance and keeping orthogonality), as shown in Figure 3 below.

def covar(x, y):
    """ Covariance of two variables.
    It measures how two variables change together by 
     the average of the product of the differences of each data point from the sample mean.
    Equation: cov(x, y) = Σ (x_i - mean(x)) * (y_i - mean(y)) / n
    Args:
        x, y (np.ndarray): Two NumPy arrays of the same length."""
    return np.sum((x - x.mean())*(y - y.mean()))/len(x)

def covar_matrix(X):
    """ Covariance matrix of a dataset.
    It is a square matrix that describes the covariance between two or more variables in a dataset.
    Args:
        X (np.ndarray): A NumPy array of shape (n_samples, n_features)."""
    n_samples, n_features = X.shape
    covar_matrix = np.zeros((n_features, n_features))
    for i in range(n_features):
        for j in range(n_features):
            covar_matrix[i, j] = covar(X[:, i], X[:, j])
    return covar_matrix

# plot eigenvectors on the data plot marked by components
def plot_eigenvectors(data, eigenvector):
    fig, ax = plt.subplots()
    ax.scatter(data[:,0], data[:,1])
    # add label on the data points
    labels = ['Negative', 'Positive', 'Moderately Positive', 'Strongly Positive']
    for i in range(data.shape[0]):
        ax.text(data[i,0], data[i,1], labels[i], fontsize=6, ha='right')
    for i in range(2):
        ax.quiver(0, 0, eigenvector[i,0], eigenvector[i,1], angles='xy', scale_units='xy', scale=1, color='red')
        ax.text(eigenvector[i,0], eigenvector[i,1], f'Component {i+1}', fontsize=6)
    ax.set_aspect('equal')
    ax.set_xlim(-1, 6)
    ax.set_ylim(-1, 6)
    plt.show()

# Step1: Computing covariance matrix (divides by n-1 for populatoin covariance)
covmat=covar_matrix(data)* len(data) / (len(data) - 1)
# Step2: Performing Eigenvalue Decomposition
lambd, eigenvector= np.linalg.eig(covmat)
# Step3: Ordering the eigenvalues and eigenvectors
idx = lambd.argsort()[::-1]
lambd = lambd[idx]
eigenvector = eigenvector[:,idx]
print('Eigen values: {}'.format(lambd))
print('Eigen vectors: {}'.format(eigenvector))
plot_eigenvectors(data, eigenvector)
# Covariance matrix: 
# [[4.33333333 0.83333333]
#  [0.83333333 5.58333333]]
# Eigen values: [6.         3.91666667]
# Eigen vectors: [[-0.4472136  -0.89442719]
#  [-0.89442719  0.4472136 ]]

Figure 3: The first two eigenvectors/components. The graph is returned via the above code.

Section 4.3: Transforming Data with Eigenvectors

From the visualization, it’s evident that each eigenvector serves as a means to project high-dimensional data onto a line, thus creating a transformed feature that encapsulates a significant aspect of the data’s variance. But a question arises: “How could we calculate these transformed values?” The process involves using the dot product between the original data and each eigenvector, allowing us to transform the data into a new dimensional space defined by the principal components.

# Step4: Projecting the data onto the new feature space
# matrix propagation
data_pca = np.dot(data-data.mean(axis=0), eigenvector) 
print(data_pca)

Section 4.4: Proving The Above Claims

Claim1: We want decorrelated data. By running the code below, we can see the covariance approximates to zero, indicating a data with decorrelated features.

covar(data_pca[:,0], data_pca[:,1])
# 2.220446049250313e-16-> approximating to 0

Claim 2: Features with greater variance often have more predictive power, i.e., maximizing variance. If you compare these two values with the variances of the original feature (4.19, 3.25 as shown in Figure 1), the variance on the 1st component (4.5) is always larger than them. Although the variance on the 2nd component (2.93) is smaller, the key is that it is orthogonal, i.e., removing correlation.

# Step4: Projecting the data onto the new feature space
data_pca1 = np.dot(data, eigenvector[:,0]) 
data_pca2 = np.dot(data, eigenvector[:,1])
print('Variance of PCA transformed data: {}, {}'.format(var(data_pca1), var(data_pca2)))
print('Explained Variance Ratio: {}'.format(lambd/np.sum(lambd)))
# Variance of the transformed data: 4.5, 2.9375000000000004
# Explained Variance Ratio: [0.60504202 0.39495798]

We can visualize the original feature with the maximum variance (“superb”) and the one projected on the 1st component.

Figure 1: Feature with maximum variance v.s. Projection on the 1st component. Both are centered for comparison.

Finally, let’s verify that the population variances of transformed data are indeed λ, as shown below.

print('Variance of PCA transformed data: {}, {}'.format(var(data_pca1)*len(data)/(len(data)-1), var(data_pca2)*len(data)/(len(data)-1)))
# Variance of PCA transformed data: 6.0, 3.9166666666666674

Claim 3: both features should be considered to transform into one feature. It is very obvious. Because dot product between an eigenvector and each data point leads to a linear transformation of both features (“good” and “superb”). But this is also related to the limitation of transformed data from PCA: They hardly maintain the non-linear correlations.

Conclusion

Principal Component Analysis stands as a monumental technique in the realm of data analysis and machine learning, offering a window into the underlying structure of complex datasets. By embarking on a journey through sentiment analysis in book reviews, we have witnessed firsthand the transformative power of PCA in isolating key features that embody the most significant variations within data.

Principal Component Analysis: A Comprehensive Explanation was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

Statistical Inference v.s. Statistical Estimation v.s. ML Inference

Xinzhe Li, PhD in Language Intelligence — Mon, 18 Mar 2024 22:52:35 GMT

This blog post is used to reflect my students (also myself ) of some coarse arguments during the initial learning of statistical concepts.

Before providing my arguments, let’s look at the definitions from the PSU’s lecture material:

A statistical inference aims at learning characteristics of the population from a sample; the population characteristics are parameters and sample characteristics are statistics.

Estimation represents ways or a process of learning and determining the population parameter based on the model fitted to the data.

Argument 1: Statistical Inference and Estimation are Essentially the Same

This argument is rooted in the observation that both statistical inference and estimation aim to learn about population parameters using sample data. Indeed, the core objective of both processes is to generalize findings from a sample to the broader population, thereby making them seem fundamentally similar.

However, statistical inference encompasses a broader set of activities than estimation alone. While estimation focuses on determining the values of population parameters (like means or proportions), inference also includes hypothesis testing, confidence interval construction, prediction, and more. Estimation is a component of inference, specifically concerned with the calculation of an estimate of a parameter. So, while they share a common goal of understanding population parameters, inference covers a wider array of statistical processes. Below are specific perspectives:

Hypothesis Testing: Evaluating hypotheses about the population parameters, using sample data. This involves determining the likelihood of observing the sample data under different assumptions (null vs. alternative hypotheses) and making decisions about which hypothesis is more consistent with the observed data.
Model Selection and Comparison: Inference involves comparing different statistical or machine learning models to determine which best describes the observed data, often using criteria that balance fit and complexity (e.g., Akaike Information Criterion, Bayesian Information Criterion).
Predictions and Generalization: Making predictions or decisions based on the data and models, and understanding the reliability and applicability of these predictions to the broader population.

Argument 2: The Difference Lies in the Requirement of a Model for Estimation

Both statistical inference and estimation can involve models, but the emphasis on models may vary depending on the specific approach or context.

Estimation often directly involves fitting models to data to derive parameter estimates: It indeed often involves the use of statistical models to determine the value of population parameters. These models help to account for randomness and uncertainty in the data, enabling the derivation of point estimates (single values) and interval estimates (ranges of values) for parameters. The process of estimation can be as simple as calculating a sample mean to estimate a population mean or as complex as fitting a regression model to estimate relationships between variables.

Statistical Inference can either rely on model-based approaches or employ model-free methods: While it can involve model-based approaches, statistical inference also includes model-free methods, such as non-parametric tests or basic hypothesis testing that does not explicitly rely on a parametric model of the data distribution. Inference is about making judgments or drawing conclusions about population parameters, which can involve estimation but also goes beyond it to include testing hypotheses about those parameters.

Argument 3: Machine Learning Inference Prioritizes Predictions, a Component of Statistical Inference

Overlap with Statistical Inference: 1) Machine learning and statistical inference share common ground in that they both employ mathematical models to draw conclusions from data. 2) ML models prioritize the ability to generalize from learned data to make predictions about unseen data. This focus aligns with the prediction aspect of statistical inference, where the goal is to use sample data to make inferences about future or unknown outcomes.

Distinction: The distinction lies in the fact that ML models often do not focus on understanding relationships between variables and estimated parameters and hypothesis testing, with a strong emphasis on model assumptions and theoretical underpinnings. Instead, Machine learning inference prioritizes performance and accuracy of predictions over strict adherence to statistical assumptions, emphasizing computational methods and algorithmic innovation.

Data Wrangling and Preprocessing in Python: A Practical Guide

Xinzhe Li, PhD in Language Intelligence — Mon, 11 Mar 2024 05:41:09 GMT

Consider a dataset intended for a rating system, which spans a range from 1 to 10 but includes an outlier (100) and a missing value (NA): [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100, NA]. Addressing the peculiarities of this dataset involves several critical steps:

Understanding the Dataset Through Sample Statistics (noting the small size of this example compared to typically larger datasets)
Handling Missing or Incomplete Values
Identifying and Removing Outliers

For preprocessing multivariate data in preparation for machine learning applications, it’s crucial to:

Normalize the data, utilizing techniques such as Min-Max Scaling.

Section 1: Understanding the Dataset Through Sample Statistics

Sample statistics provide a snapshot of the data’s characteristics, offering insights into its central tendency, spread, and shape. Among the most common sample statistics are the mean, median, variance, standard deviation, and skewness, each serving a unique purpose in data analysis.

Sample Mean (Average):

The sample mean is the sum of all observations divided by the number of observations. However, the mean is sensitive to outliers, which can significantly skew its value away from the center of the majority of the data. We will further analyze such effect in the next section for handling missing values.

mean = sum(data) / len(data)
print(mean)

Median:

The median is the middle value when the dataset is ordered from smallest to largest. For an even number of observations, it is the average of the two middle numbers. The median is less sensitive to outliers than the mean and often provides a better measure of central tendency for skewed data.

data_sorted = sorted(data)
n = len(data_sorted)
median = data_sorted[n//2] if n % 2 != 0 else (data_sorted[n//2 - 1] + data_sorted[n//2]) / 2
print(median)

Variance:

Variance measures the dispersion of a dataset. It is calculated as the average of the squared differences from the mean. A higher variance indicates that the data points are more spread out from the mean.

variance = sum((x - mean) ** 2 for x in data) / (len(data) - 1)

Standard Deviation:

The standard deviation is the square root of the variance. It provides a measure of the spread of the data points around the mean in the same units as the data.

standard_deviation = variance ** 0.5

Skewness:

Skewness measures the asymmetry of the distribution of values about the mean. A positive skewness indicates a right-skewed distribution with a long tail on the right, while a negative skewness indicates a left-skewed distribution with a long tail on the left.

from scipy.stats import skew
skewness = skew(data)

Understanding these sample statistics is crucial for effectively analyzing and interpreting data. They provide the foundational knowledge needed to assess the distribution’s characteristics and make informed decisions about handling missing values and other data preprocessing steps.

Section 2: Handling Missing or Incomplete Values

When dealing with skewed datasets, particularly those with a right-skewness, selecting the appropriate method for filling in missing values is crucial. The presence of outliers can significantly distort measures like the mean, making it unsuitable for representing the central tendency of the majority of the data.

Understanding the Skewness:

A dataset with extreme right-skewness, such as [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100, NA], is characterized by the majority of values clustering at the lower end and a long tail extending towards higher values. This can be quantitatively confirmed using the skew function from scipy.stats:

from scipy.stats import skew
print("Skewness: ", skew(right_skewed))
# Skewness:  2.7937162913590363

The Problem with the Mean:

In right-skewed distributions, the mean is inflated by the outliers ( leadning to positive skewness) and does not effectively represent the central tendency for the bulk of the data:

print("Mean: ", sum(right_skewed)/len(right_skewed))
# Mean:  14.090909090909092

Median: The Better Alternative:

The median, being the middle value of a sorted dataset, is less affected by outliers and provides a more accurate representation of the dataset’s central tendency:

right_skewed_sorted = sorted(right_skewed)
n = len(right_skewed_sorted)
median = right_skewed_sorted[n//2] if n % 2 != 0 else (right_skewed_sorted[n//2 - 1] + right_skewed_sorted[n//2]) / 2
print("Median: ", median)
# Median:  6

Section 3: Identifying and Removing Outliers

To avoid the above effect due to outliers, we can use the outlier detection methods.

Visualizing Data Distribution

A histogram plot helps visualize the distribution (e.g., variance and skewness), supporting the choice of median over mean for filling missing values:

from matplotlib import pyplot as plt
plt.hist(right_skewed, bins=5, color='c', edgecolor='k')
plt.show()

Visualizing IQR Statistics

A boxplot (or box-and-whisker plot) graphically depicts the distribution of a dataset based on a five-number summary: minimum, first quartile (Q1), median (second quartile, Q2), third quartile (Q3), and maximum. Interquartile Range (IQR) is calculated as the difference between the third quartile (Q3) and the first quartile (Q1). It represents the range within which the middle 50% of the data points lie.

plt.boxplot(right_skewed)
plt.show()

To identify potential outliers on the lower end of the data, 1.5 times the IQR is subtracted from the first quartile (Q1). Any data points that fall below this value are considered outliers. Mathematically, the lower bound for outliers is calculated as Q1−1.5×IQR.

To identify potential outliers on the upper end of the data, 1.5 times the IQR is added to the third quartile (Q3). Any data points that fall above this value are considered outliers. Mathematically, the upper bound for outliers is calculated as Q3+1.5×IQR.

import numpy as np

# Example dataset
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100])

# Calculate the first quartile (Q1) and third quartile (Q3)
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)

# Calculate the Interquartile Range (IQR)
IQR = Q3 - Q1

# Calculate the lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Print the results
print(f"First Quartile (Q1): {Q1}")
print(f"Third Quartile (Q3): {Q3}")
print(f"Interquartile Range (IQR): {IQR}")
print(f"Lower Bound for Outliers: {lower_bound}")
print(f"Upper Bound for Outliers: {upper_bound}")
# First Quartile (Q1): 3.5
# Third Quartile (Q3): 8.5
# Interquartile Range (IQR): 5.0
# Lower Bound for Outliers: -4.0
# Upper Bound for Outliers: 16.0

Section 4: Normalizing Data

Data normalization is a crucial preprocessing step in machine learning, especially when dealing with features that vary in scale, distribution, or range. Specifically, if one feature (e.g., weight in kilograms) has a smaller numerical range compared to another feature (e.g., height in centimeters), it might contribute less to the model’s decision process simply because its values are lower in magnitude, not because the numerical range is larger.

Normalization techniques, such as Min-Max Scaling or Z-Score Normalization, are used to ensure that each feature contributes appropriately to the model by scaling them to a common range or distribution.

import numpy as np

# Sample height and weight data
heights = np.array([160, 175, 180, 190, 200])  # in cm
weights = np.array([55, 65, 75, 85, 95])  # in kg

# Min-Max Scaling
heights_scaled = (heights - heights.min()) / (heights.max() - heights.min())
weights_scaled = (weights - weights.min()) / (weights.max() - weights.min())
print("Scaled Heights:", heights_scaled)
print("Scaled Weights:", weights_scaled)

# Z-score normalization
heights_scaled = (heights - np.mean(heights)) / np.std(heights)
weights_scaled = (heights - np.mean(weights)) / np.std(weights)
print("Scaled Heights:", heights_scaled)
print("Scaled Weights:", weights_scaled)

Summary

Through understanding sample statistics, managing missing values, eliminating outliers, and normalizing data for machine learning applications, this guide offers a comprehensive approach to preparing datasets for effective analysis and model training.

From Theory to Code in Machine Learning (Part 1): Maximum Likelihood Estimation in Regression

Xinzhe Li, PhD in Language Intelligence — Thu, 07 Mar 2024 06:54:13 GMT

Implementing Linear Regression from Scratch in Python

Photo by Fabio on Unsplash

The Maximum Likelihood Estimation (MLE) principle provides a general starting point to determine certain criterion (optimization objectives) for optimization, which lead to the estimated values of parameters (characteristics of population). This article will introduce it from scratch and estimate the parameters of linear regression.

Section 2: When Do We Need Statistical Estimation?

Section 2.1: No Statistical Estimation. First-principle, Physical Formulas

Firstly, let’s consider the case without statistical estimation. Normally, when we want to model something, we “reason” over the case and develop professional domain knowledge through the study of underlying first principles for both physical formulas and parameters. This method relies heavily on a deep understanding of the theoretical foundations of a problem. The major advantage of this approach is its reliance on established scientific laws and principles, which can provide accurate and explainable models. However, this method might be limited in scenarios where the underlying principles are not well understood or too complex to model directly.

An Example: The concept of mass-energy equivalence is encapsulated in the famous equation formulated by Albert Einstein, E=m c². The equation directly relates the mass of an object (m) to its energy equivalence (E) with c representing the speed of light in a vacuum, approximately 3.00×10⁸ meters per second. The insight of relating m to E was not directly derived from experimental data but from Einstein’s deep reflections on the nature of space, time, and energy.

Limitation: Although first-principle derivation excels in areas with well-understood physical laws and offers models with high interpretability and theoretical justification, complex or poorly understood phenomena require leveraging data to uncover patterns and relationships that may not be apparent from theoretical analysis alone. That’s where we need statistical estimation.

Sometimes, only function forms can be determined by first-principle reasoning. Hence, statistical estimation is still required as a complementary approach to estimate parameters. Often, this first-principle approach may involve statistical analysis: Understanding Underlying Data Distribution, as discussed in the beginning.

Section 2.2: Statistical Estimation for Parameters

When the exact physical formulas or principles are unknown or too complex, a general universal approximator (neural network) or a hypothesized formula are commonly developed. Then, statistical estimation methods can be used to approximate their parameters using existing data.

Pros & Cons: Statistical estimation offers flexibility and is particularly powerful in handling large datasets and complex relationships that are not easily modeled using first principles alone. However, it may lack the interpretability and theoretical grounding of models derived from first principles, and its accuracy is heavily dependent on the quality and quantity of available data.

An Example: Let’s take an intuitive example from the Natural Language Processing (NLP) domain to illustrate the concept of statistical estimation: sentiment analysis.
Imagine you’re tasked with developing a model that can automatically determine whether a piece of text (like a product review) expresses a positive, negative, or neutral sentiment. The first-principle derivation approach would require understanding and modeling the complex rules of language, including grammar, idioms, sarcasm, and other nuanced features that contribute to sentiment. This is highly challenging due to the vast complexity and variability of natural language.
So we need to apply statistical estimation. Instead of trying to manually codify these rules, you opt for a statistical estimation approach. You collect a large dataset of text reviews, each labeled with its corresponding sentiment (positive, negative, or neutral) by human annotators. You then use this dataset to train a machine learning model, such as a neural network, to estimate the function fθ that maps input text to sentiment labels.

Section 3: Objective — Maximizing Likelihood

The intuitive objective is to choose a set of model parameters θ that is most probable according to the observed data: features x and the labels y, i.e., maximizing the posterior probability of P(θ∣x,y). It is barely impossible to know this. However, according to the Bayes rule, we know

P(θ∣x,y)-> P(y,x∣θ): Utilizing Bayes’ theorem allows us to express the posterior probability P(θ∣x,y) in terms of the likelihood P(y,x∣θ), the prior probability P(θ), and the marginal likelihood P(x,y). When the goal is to find the parameter set θ that maximizes the posterior probability given the data (features x and labels y), we can simplify the optimization problem by focusing on maximizing the numerator of the Bayes’ formula because the denominator P(x,y) is independent of θ and acts as a constant normalization factor. Specifically, in the context of parameter estimation, the focus often shifts to maximizing P(y,x∣θ)⋅P(θ) because it directly influences the value of P(θ∣x,y). However, when assuming a uniform prior (where P(θ) is constant) or in situations where the prior knowledge about θ is not specifically leveraged, the problem simplifies further to maximizing the likelihood P(y,x∣θ) alone. This underlies the principle of Maximum Likelihood Estimation (MLE), where P(θ) is effectively considered constant and does not influence the optimization of θ.

P(y,x∣θ) -> P(y∣x, θ): Since x is known and fixed in supervised learning tasks, we often consider x as a given context rather than a variable we seek to model the distribution of. Thus, the model focuses on y’s distribution conditioned on x and θ .

The total likelihood L on the sample (X, Y) is the product of P(y∣x, θ) for all n examples (x, y) in (X, Y) .

Note that the likelihood is not a probability distribution over θ(since it does not integrate to 1 over θ, but it is used to compare different values of θin terms of how plausible they make the observed data appear.

Section 4: MLE for Regression Tasks

Here is a function for linear regression: The predicted y \hat{yi}=β0+β1xi. To get the true y, we add the error term ϵi :

, where (xi, yi) is an example in training data.

In the context of linear regression, ϵi is typically assumed to be normally distributed around 0. It is intuitive considering that the random error is unlikely to be big if the linear regression model is a good choice.

Now, we can derive the distribution of P(y∣x, θ) :

Now, we can specify P(yi∣xi, β0, β1) for the calculation of likelihood, where the right side of the equation below PDF for normal distribution:

The likelihood function for the entire dataset of n observations is the product of the individual probabilities for each data point, given the parameters:

To simplify the optimization, it’s common to take the logarithm of the likelihood function, converting the product into a sum:

The maximization of the above function equals to minimize:

Essentially, this process turns out to be equivalent to minimizing the sum of squared residuals, which is Ordinary Least Squares (OLS) objective when the variance is assumed to be constant.

With this MLE objective for linear regression (or OLS objective), we can easily know what are the optimal values for β0, β1 because this is a convex function for both β0, and β1 . According to some calculus knowledge, we can take the derivatives w.r.t. both β0, β1 as 0, and then will have two equations for solving the problem. Below are the results:

You can watch my YouTube video for the step-by-step derivation of the results.

https://medium.com/media/ec5877f1bab7d334e32dcd36fceae477/href

Section 5: Python Implementation

Below is the code:

def mle_estimate_for_linear_regression(X, Y):
    """
    Parameters:
    - X: numpy array of independent variable values.
    - Y: numpy array of dependent variable values.

    Returns:
    - beta_0: Estimated intercept of the regression line.
    - beta_1: Estimated slope of the regression line.
    """

    # Mean of X and Y
    X_mean = np.mean(X)
    Y_mean = np.mean(Y)

    # Calculate beta_1 (slope)
    beta_1 = np.sum((X - X_mean) * (Y - Y_mean)) / np.sum((X - X_mean) ** 2)

    # Calculate beta_0 (intercept)
    beta_0 = Y_mean - beta_1 * X_mean

    return beta_0, beta_1

Final Verification

Below is the test code to verify the result.

ture_beta_1 = 10
ture_beta_0 = 5
x = np.array([1, 2, 3, 4, 5])  
y = ture_beta_0 + ture_beta_1 * x 
plot_regression_line(ture_beta_1, ture_beta_0, x, y)

beta_0, beta_1 = ols_estimate_for_linear_regression(x, y)
print(f"beta_0: {beta_0}, beta_1: {beta_1}")

Here is the output. beta_0: 5.0, beta_1: 10.0, same as the ture_beta_0 and ture_beta_1 . So, our closed-form solution is mathematically sound and works perfect.

From Theory to Code in Machine Learning (Part 1): Maximum Likelihood Estimation in Regression was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

Python: Scripts, Modules andPackages

Xinzhe Li, PhD in Language Intelligence — Fri, 01 Mar 2024 12:50:42 GMT

Demystifying Python Scripts, Modules, and Packages

Photo by Rubaitul Azad on Unsplash

Section 1: Scripts v.s. Modules

They look common since

Both being Python files with .py extensions.
Both contain Python definitions (classes, functions, variables) and statements.

They are essentially different. If you search online, you can find many sources defining their use as below:

Think: Is hello.py below a Python module or script?

print("hello, you are running a python code")

Section 2: Python Modules

Why do we need it? Simply speaking, it enables code reuse with better readability and maintainability.

For example, as a machine learning engineer, you commonly implement many models (e.g., Model-A, Model-B, Model-C) and metrics (e.g., Metric-A, Metric-B, Metric-C). Without Python module, you implement all the metrics for all the models.

Readability: The duplication challenges you for reading the code
Maintainability: It also challenges you for maintaining the code.
For example, if you put the incorrect code below in the 3 model files to calculate the mean square error (MSE), you have to go to 3 files to correct the mistake.

difference = actual - predicted
mse = difference.mean()

In constrast, if you save the mean square error (MSE) function in a Python module, you only need to change the function once.

To do this, you can create a module named metrics.py that contains the MSE function. This way, you can simply import the MSE function from your metrics module whenever you need to evaluate a regression model. Here is how you could define the MSE function within the metrics.py module:

def mean_square_error(actual, predicted):
    """
    Calculate the mean square error between actual and predicted values.
    
    Parameters:
    actual (list or numpy array): The actual values.
    predicted (list or numpy array): The predicted values.
    
    Returns:
    float: The mean square error.
    """
    difference = actual - predicted
    squared_difference = difference ** 2
    mse = squared_difference.mean()
    return mse

Then, in your script or any other module where you want to use the MSE function, you can import it and use it as follows:

from metrics import mean_square_error

# Example actual and predicted values
actual = [3, 5, 2.5, 6]
predicted = [2.5, 5.0, 2.0, 8]

# Calculate MSE
mse = mean_square_error(actual, predicted)
print(f"Mean Square Error: {mse}")

(Optional) What if you have different packages providing the same function. Python allows the same function names to be used in different modules without conflict. For instance, the mean_square_error function you've defined in your metrics.py module can coexist with the mean_squared_error function available in the popular machine learning library, Scikit-learn (sklearn.metrics). Thanks to Python's namespace management, when you import these functions into your scripts, you can distinguish between them by prefixing them with the module name. For example, you could use your custom mean_square_error for certain tasks and sklearn.metrics.mean_squared_error for others, without any naming conflict. This is achieved by importing them like so:

from metrics import mean_square_error as mse_custom
from sklearn.metrics import mean_squared_error as mse_sklearn

# Example usage of your custom MSE
actual = [3, 5, 2.5, 6]
predicted = [2.5, 5.0, 2.0, 8]
mse_custom_result = mse_custom(actual, predicted)
print(f"Custom MSE: {mse_custom_result}")

# Example usage of Scikit-learn's MSE
mse_sklearn_result = mse_sklearn(actual, predicted)
print(f"Scikit-learn MSE: {mse_sklearn_result}")

Section 3: Python Package

Package structures Python modules. Normally, the presence of __init__.py files in each directory of a Python package is what distinguishes a package from a mere folder or directory.

Note: Before Python 3.3, __init__.py file is used for Python interpretor to recognize directories as packages, allowing its modules to be imported. Now, it is not necessary.

A Practical Example

Here is a practical machine learning package I devised called ml. Within this package, we have two subpackages: models and evaluation

ml/
  models/
    __init__.py
    fully_connected_layer.py
    regression.py
python_script.py

Here is the content of the module fully_connected_layer.py .

class FullyConnectedLayer:
    """Fully connected layer class for neural network model"""
    def __init__(self, learning_rate=0.01):
        self.learning_rate = learning_rate
        self.weights = None
        self.bias = None
        self.input_shape = None
        self.output_shape = None

    def forward(self, inputs):
        """Forward pass of the fully connected layer"""
        return inputs @ self.weights + self.bias

Assess Python Definitions in A Modules:

we use “dotted module names” to access Python’s module namespace by chaining the package and subpackages/modules (from package.module import item where item can be a function, a class, a variable or subpackage), e.g., from ml.models.fully_connected_layer import FullyConnectedLayer as below.

# code in the python_script.py
from ml.models.fully_connected_layer import FullyConnectedLayer

model = FullyConnectedLayer()

Or we can use the pattern like from package import module and then use module to access the classes, functions and variables.

# code in python_script.py
from ml.models import fully_connected_layer
model = fully_connected_layer.FullyConnectedLayer()

Modifying __init__.py Files for Direct Access: To directly access FullyConnectedLayer class from the models subpackage, modify the __init__.py files as follows:
ml/models/__init__.py:

from .models.fully_connected_layer import FullyConnectedLayer

Now, users of your ml package can import the LinearRegression class and the mean_squared_error function directly from the package, without having to navigate through the subpackage structure:

# code in pytho_script.py
from ml import FullyConnectedLayer

Section 4: Using Public Packages

To leverage the vast ecosystem of Python for data science and machine learning projects, you’ll likely need to install several key public packages. These packages provide a wide range of functionalities, from matrix manipulation to advanced machine learning algorithms and data visualization.

pip install numpy         # Numpy for matrix manipulation
pip install scipy         # SciPy for statistic optimization and functions
pip install pandas        # pandas for reading data (Dealing with data as Numpy matrix)
pip install scikit-learn  # Sklearn for API call of ML algorithms
pip install matplotlib    # Matplotlib for visualization

Instead of executing each pip install command separately, you can streamline the process by specifying all required packages in a single file named requirements.txt. This file lists each package (and optionally, the desired version) on a separate line. To install all the packages listed in requirements.txt, run the following command:

pip install -r requirements.txt

This approach not only saves time but also ensures consistency across different environments, making it easier to manage dependencies for projects.

Section 5: Can We … ?

Can We Directly Run A Module?

If we directly run the code in regression.py , an import error occurs: ImportError: attempted relative import with no known parent package . Because regression.py is used as a Python script, while the relative import via dots should be used between modules within a package.

# code in ml/models/regression.py
import numpy as np
from ..metrics import mean_squared_error, r2_score, variance_in_cv_scores

def ols_estimate_for_linear_regression(X, Y):

    # Mean of X and Y
    X_mean = np.mean(X)
    Y_mean = np.mean(Y)

    # Calculate beta_1 (slope)
    beta_1 = np.sum((X - X_mean) * (Y - Y_mean)) / np.sum((X - X_mean) ** 2)

    # Calculate beta_0 (intercept)
    beta_0 = Y_mean - beta_1 * X_mean

    return beta_0, beta_1

if __file__ == "__main__":
    import pandas as pd
    df = pd.read_csv("data/house-prices/train.csv") # from https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data
    x = df['GrLivArea'] # Above grade (ground) living area square feet
    y = df['SalePrice']
    beta_0, beta_1 = ols_estimate_for_linear_regression(x, y)
    predictions = beta_0 + beta_1 * x
    error  = mean_squared_error(y, predictions)

To run this successfully, we have to tell Python interpreter that it works as a Python module within the ml library. You can use python -m command, e.g., python -m my_ml_package.models.regression .

Can We Run A Notebook inside A Python Package?

Yes. But you have to manually add the parent directory of the package into sys.path so that the Python import system can find the package.

Python: Scripts, Modules andPackages was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

From Simple to Complex: A Complete Overview of Reinforcement Learning

Xinzhe Li, PhD in Language Intelligence — Thu, 29 Feb 2024 05:24:54 GMT

A Comprehensive Overview of Reinforcement Learning

Photo by Alexander Mils on Unsplash

This article aims to comprehensively demonstrate RL-related content.

Working In Progress

Section 1: Unraveling the Complexity of Reinforcement Learning

The exploration of Reinforcement Learning (RL) within the machine learning spectrum reveals a sophisticated landscape where traditional learning paradigms undergo a significant transformation.

This complexity is further nuanced by the shift from static data representations to dynamic states and actions, where the objective is not merely to predict the correct outcomes but to maximize rewards through sequential decision-making.

Section 1.1: Reviewing Machine Learning

In the realm of machine learning, various paradigms are utilized to teach machines how to interpret data and make predictions or decisions. At the core of these paradigms, symbols x and y are often used to delineate the process: x represents the input , while y symbolizes the ground-truth output we desire our models to generate. The objective is to maximize the probability of outputting the ground-truth given the input, i.e., p(y|x)

Supervised learning is a paradigm that directly targets the optimization of p(y|x) by leveraging datasets where both x (input features) and y (target outcomes) are provided. The goal is to explicitly learn the mapping from x to y, thereby enabling the model to make predictions or decisions when presented with new, unseen instances of x.

Unsupervised learning comes into play when y is not provided. This paradigm focuses on discerning patterns, structures, or insights from data where the outcome, or y, is unknown. The learning process revolves around the analysis of x alone, without direct guidance on what the output should be.

Section 1.2: RL vs Supervised Learning

An RL model aims to

learn how to map situations (states) to actions — so as to maximize a numerical reward signal.

— from the Sutton&Barto’s book (the Bible book in the field of RL),

To achieve the goal, we need to learn one or some of the things below:

A model mapping a given state s to an action a, denotated as policy π(s)
A model telling what’s the next state after executing a on s. Since. the transitions between s and s’ is not deterministic, this is actually a density estimation problem, where the transition function T(s’|s, a ) outputs a probability distribution over the state space. s’ is the next state after executing a on s.
A model evaluating whether the action a on s is good or bad, denoted as reward model R(s'|s, a) . Actually, the output of policy can be either a distribution of R(s, a) over all the possible actions and a deterministic action sampled from the distributions

In contrast, in typical supervised learning tasks, the task is simple (but universal): modeling the probability distribution p(y|x) so that we directly choose y based on p(y|x). Now, we do not have simple but general x, y mapping, although the state s is always input over all the three models above, while s’ is always the output in R and T.

This universal form can be used to model policy, reward function and transition function.

Section 1.3: Fundamental Difference: Dynamic Learning

Firstly, “dynamic” comes from the requirement of making sequences of decisions. RL models take the job to generate a sequence of decisions. During this process, rather than static mapping from a input to one or more ground-truth labels, both inputs and outputs are dynamically changed.

Dynamically changed states and actions: The original input is the perceived state from the environment. Obviously, for one example/trajectory, a sequence of different inputs could be injected into the model. Correspondingly, outputs can be different.

Secondly, dynamic comes from the unknown reward models for learning purpose. Ideally, the agent knows the reward models and transition models. Hence, they only need to adjust their actions (Policy output) conditioned on a state (Policy input) based on the weighted feedback/reward they receive.
However, reward models are often unknown. Hence, agents have to learn reward models through a sequence of trials and errors (or experience) by interacting with an environment. This leads to:

Dynamically changed objective function: Through the learning process, it is dynamically determined.
Dynamically changed policy and optimal actions: The policy and the best action for a same input can be dynamically changed through the learning process.

Section 1.4: Fundamental Difference: Independent vs Dependent

Independent Assumption in Unsupervised/Supervised Learning Algorithms: This assumption exists in most of Supervised learning and unsupervised learning. In supervised learning, the independent assumption is crucial for the training process. It posits that each example in the dataset is independent of the others, meaning that the occurrence of one example does not influence the occurrence of another. This assumption simplifies the model training by allowing the algorithm to treat each example as an isolated instance, which is essential for the effectiveness of statistical inference and the reliability of predictive performance. However, this assumption can sometimes be violated in real-world scenarios where data points may be related, leading to challenges in model accuracy and generalization. For example, if the model assumes that each day’s stock price is independent of the previous day’s, it overlooks the critical temporal dependencies that often drive financial markets.

“Identically distributed” is another important assumption in common unsupervised/supervised Learning algorithms to ensure the training and future unseen data come from the same distribution, facilitating the model’s ability to generalize well from the training set to the test set or real-world scenarios. For example, the model might associate the word “avenger” with positive sentiments due to its positive connotation in the context of movie reviews, particularly for the “Avengers” movie series which is widely appreciated. When applied to a product review that uses “avenger” in a completely different context, e.g., “Total disappointment. This product turned me into an avenger, hunting for a refund. Save your money!”, the model may incorrectly predict the sentiment as positive due to its reliance on the learned association.

Dependent in Sequential Decision Making in RL: In the realm of Reinforcement Learning (RL), the scenario is markedly different due to the sequential and dependent nature of decisions, as opposed to the independent examples typically found in Supervised and Unsupervised Learning. The concept of dependency is pivotal, where the outcome of a previous action (a) directly influences the subsequent state or observation (s) encountered by the agent. The 1st challenge it leads to is exploration and exploitation. This intertwined relationship necessitates a nuanced approach in RL, known as the balance between exploration and exploitation. Exploration involves trying new actions (a) to gauge their outcomes, essential for uncovering potentially superior strategies. On the other hand, exploitation leverages the best-known actions to maximize immediate rewards. The art of RL lies in effectively managing this balance, as it directly impacts the learning efficiency and the ability of the agent to adapt to dynamic environments, thereby achieving optimal decision-making over time. The 2nd challenge it leads to is iterative calculations of objective. The Objective in Supervised Learning is easy: p(y|x). However, this dependency in Reinforcement Learning (RL) naturally leads to an iterative process for calculating the objective, which is typically defined as the sum of future rewards. To make it solvable, the concept of value is devised: a value under a policy Vπ(s) is the expected return (cumulative discounted reward) from state s .

Thinking: Why do I use the term “expected return” rather than “expected reward”? The “return” in RL contexts refers to the sum of all rewards an agent receives, often discounted by a factor at each time step to account for the uncertainty or diminishing value of future rewards. This includes not just the immediate reward for the next action, but all subsequent rewards the agent will collect. The concept of return encompasses the entire sequence of rewards that follow a state or a state-action pair, reflecting the long-term consequences of actions.

Specifically, following a policy at the state s yields a random path. The utility (or return) of a policy is the (discounted) sum of the rewards on the path.

The return from the timestep t is a random quantity. The random variable cannot be an objective for optimizing the policy. Hence, we need the expectation over the path coming from the current state sand ending at a terminal state or reaches a finite time horizon.

(Optional) Section 1.5: An Example Using All The Three Learning Paradigms — — ChatGPT-like LLMs

Since I always boast myself as a language intelligent researcher, let’s show off my expertize. All of you know ChatGPT, a Large Language Model. However, maybe you do not know is that ChatGPT is trained by including all the three learning paradigms:

Unsupervised Learning: Self-supervised learning (SSL), a particular type of unsupervised learning, is used to let language models learn huge amounts of world knowledge. For detail, here is a very early but typical paper by Yoshua Bengio. For a more intuitive one, I will release my own work later to demonstrate this)
Supervised Learning: It is used to tune the naive ChatGPT to behave like a human. In the leftmost part, you can see, in this stage, the ChatGPT model is called policy (A good time for you to refresh the terminology).
Reinforcement Learning: The middle section talks about training a reward model (Again, a good time for you to refresh the terminology) for optimizing the less naive ChatGPT.

Section 3: Modeling Dynamic World

The model underlying RL is a decision process where an action y on a state x at the time t results in an indeterministic state x at the time t+1. i.e., the environment is dynamic . Here is a simple running example.

For each round r=1,2, …
- You choose stay or quit.
- If quit, you get $ 10 and we end the game.
- If stay, you get $ 4 and then I roll a 6 -sided dice.
- If the dice results in 1 or 2 , we end the game.
- Otherwise, continue to the next round.

The reason why we want to model this as a MDP problem is because the indeterministic resulting states when the action “state” is performed on the “in”, which is represented by the chance node (in, stay).

Action: In the toy example, the action space A = {stay, quit}.
States: In the toy example, the state space S = {in, end}.
Transition models: Commonly, the transition models can be represented by the distribution of possible states and their rewards Pr(S, R|s, π) for each state and each action.
For the current state “in”: (the total number of transitions = the number of actions * the number of states)
For the action “stay”, p(end|in, stay)=1/3, p(in|in, stay)=2/3;
For the action “quit”, p(end|in, quit)=1, p(in|in, quit)=0.
If we use each state as the current state, the total number of transitions = the number of actions |A|* the number of states |S|²
Rewards: r(in, stay->end) = $4; r(in, stay->in) = $4

Section 4: Objective and Policy Evaluation

As defined previously, “Policy” is a mapping from states of the environment to actions to be taken. Formally, it is a state-dependent distribution over the action space π(a|s). The number of policies = |A| to the power of |S|, which is combinatorially large. If there are 11 states and 4 actions, it equals to ⁴¹¹.

Objective: Maximizing Utilities of Following A Policy π

The Objective in Supervised Learning is easy: p(y|x) . The objective for RL is the cumulative reward of following a policy. Specifically, following a policy yields a random path. The utility (or return) of a policy is the (discounted) sum of the rewards on the path.

The return from the timestep t G_t is a random quantity. The random variable cannot be an objective for optimizing the policy. Hence, we need the expectation.
Note that the path comes from the current state s and ends at a terminal state or reaches a finite time horizon. Below show four paths and their utilities for the toy example.

Terminology: “Horizon” refers to the number of future steps or decisions a model considers when calculating values like rewards.

However, these utilities can give us an intuitive sense of how the policy performs if we can simulate tons of them and average them, which is the idea of the Monte Carlo method. The method calculates the expected utility by sampling set of trajectories and average their returns.

Objective: Maximizing Value of Following A Policy π

The value referes to the expected utility (or return) received by following policy π.

Why do I use the term “expected return” rather than “expected reward”?Here is the difference between utility (or return) and reward: The “return” in RL contexts refers to the sum of all rewards an agent receives, often discounted by a factor at each time step to account for the uncertainty or diminishing value of future rewards. This includes not just the immediate reward for the next action, but all subsequent rewards the agent will collect. The concept of return encompasses the entire sequence of rewards that follow a state or a state-action pair, reflecting the long-term consequences of actions.

Each value is related to a particular state (so-called state-value function) Now, what we miss is the state. We can only follow a policy from a given state. Let’s rephrase again in a specific way by considering the state: the expected return from each individual state while adhering to the policy. Why do I emphasize this? Well, because I want to emphasize that policy evaluation involves computing the value for each state under the current policy but is not to sum the values of all states.

The key point is that when evaluating a policy, you don’t add up these individual state values into one aggregate sum. Each state’s value is kept distinct because it serves a specific purpose in the context of policy evaluation and improvement.
For example, in algorithms like Policy Iteration or Value Iteration, these individual state values are used to inform decisions about which actions to take in each state (policy improvement) and to check for convergence to an optimal policy
The calculation of Vπ(s) for the particular state s take into account the other states that may occur in the future under policy π. This expected return is not just about the immediate rewards but also includes all the future rewards the agent can expect to receive, appropriately discounted.

Objective: Defining Value Function for Maximization

Formally, the value function Vπ(s) is defined to map from a state s to the expected return. Intuitively, It answers the question: what is the expected return by starting in state s and following policy π?

Section 5: Using Model-based Approaches with Reward Model and Transition Model

Due to the probabilistic nature of state transitions, it is calculated by aggregating (technically we use the weighted sum) rewards (it can be discounted rewards) for all the one-timestep future states and corresponding rewards by taking an action according to the policy. Note that the rewards may be aggregated with discounts, e.g., more discounts for further future.

The latter part of the expression is a recursive decomposition for the value function of a policy in a MDP, which is known as the Bellman equation.

IMPORTANT: The Bellman equation is a fundamental concept in dynamic programming and reinforcement learning.

This recursive relationship provides a powerful tool for solving MDPs, especially when combined with algorithms like value iteration and policy iteration.

Generally, the Bellman equation is the backbone of algorithms like Q-learning, Value Iteration, and Policy Iteration

Below is a less common recursive form, which simplifies the immediate reward as a direct consequence of being in state s.

Stochastic policy: We need two levels of expectation for stochastic policy where π(a′∣s′) gives the probability of taking action a′ when in state s.

The sections below will give a more detailed introduction of the model-based method — Dynamic Programming and other model-free methods based on sampling experience.

Why Does The Expressions Make Sense? Why is V(s) always stationary?

Markov Means “Memoryless”: The formulas and algorithms described here are often (not always) used for Markov Decision Process (MDP). It is “Markovian” because we assume that

State: The next state is probabilistically determined by only the current state and action.
Value: No matter how we got to the current state, the value of the state is identical;
Action: the optimal decision at any point depends only on the current state and not on the path taken to reach that state ( how the state was reached); 4) Hence, we know the independence of future optimal decisions (actions) from past decisions (actions).

In summary, the sequential decision-making process is thus simplified to a series of independent decisions, each based solely on the current state. This is why the above regressive forms make sense, and algorithms like Value Iteration and Policy Iteration can systematically improve policies by considering the optimality of actions at each state independently of the actions that led to that state.

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (See Bellman, 1957, Chap. III.3.)

Further Thought: How about it’s not stationary?

Section 6: Model-based Approach: Dynamic Programming

Dynamic programming is used for policy evaluation via Bootstrapping and Recursion.

Policy Iteration and Value Iteration in the context of Dynamic Programming (DP) are indeed two parallel and distinct methods for finding the optimal policy in Markov Decision Processes (MDPs).

In DP, the solution to a larger problem (finding V(s) for all s) is built upon the solutions to smaller problems (finding Vk−1(s′) for all ′s′). This decomposition into subproblems is made possible by the model’s structure (the transition probabilities and rewards) and the recursive nature of the Bellman equations.
Model’s structure: In DP for MDPs, the model of the environment (specifically, the transition probabilities and reward structure) is known. This model tells us the probability of transitioning to various future states given the current state and action, as well as the rewards associated with these transitions.

Policy Iteration

Note: while value-based methods derive policies from value functions, policy-based methods optimize the policy directly, and actor-critic methods use a hybrid approach.

Under the assumption of an infinite horizon
1. Stationary policy, i.e., the policy does not change over time: In policy iteration, a policy is evaluated and improved iteratively. During the evaluation phase, the policy is considered stationary. That is, it’s assumed that the policy does not change while its value function is being calculated. Just note that the Bellman equation is typically formulated under a stationary policy π.(I am not sure whether it can be applied for non-stationary policies.)
2. Convergence of Value Function, i.e., V(s) becomes constant over time: When the value function under this stationary policy is computed (policy evaluation) over an infinite horizon, this value function converges to a constant as long as the policy remains unchanged. Specifically, the value function becomes stable and stops changing after sufficient iterations. Once the value function stabilizes, the policy is then improved based on the current value function, and the process repeats.

Policy Iteration involves two iterative steps: policy evaluation and policy improvement, iterated until convergence.
1. Policy Evaluation:

Thought: Why iterative update of value function will finally converge?

In this step, the value function for a given policy is calculated until it stabilizes. This involves solving a system of linear equations to find the expected returns from all states under the current policy.

The subscripts k and (k-1): Using just V(s) and V(s′) without specifying the iteration might imply that we’re referring to a static value function or the final, converged value function. By using the subscripts , we explicitly acknowledge that we’re working with the value function as it was estimated in certain iterations, not some static or final value function. For example, (k-1) is the previous iteration of k.

Bootstrapping referring to use the bootstrapping term below as a substitute for actually doing all the roll-out of future steps.

Specifically, the essence of DP is to estimate the expectation over all the possible future by using a one-timestep values exactly. In other words, what would it be like if I start from each possible state of one-time step future, as denoted below? This is called

We now have the estimate of values for all the possible states. However, our goal is to estimate the expectation over all the possible future. So we must know the distribution (probabilities)of reaching potential future states given the current state and action, as denoted below.

Now would be the best time to explain the word “Dynamic”. It actually refers to the dynamics of the environment given by the probability function above and the reward model that returns the immediate reward.

Note that the dynamic model (i.e., the probability distribution) and reward model above are the components of the Markov Decision Process Model.

2. Policy Improvement: Once the value function is stable, the policy is greedily updated for each state s with respect to its value function V(s) under the policy improvement theorem. This means for the state s, the policy chooses the action on s that leads to the highest expected return from the next state according to the current value function.

3. Convergence to Optimal Policy: This process of evaluation and improvement is repeated until the policy no longer changes, at which point it is considered optimal.

Value Iteration

Single Step Process: Value Iteration simplifies the process by combining policy evaluation and improvement into a single step.
Value Function Update: Instead of fully evaluating a policy, Value Iteration continuously updates the value function for each state. It does this by choosing the action at each state that maximizes the expected return, based on the current estimate of the value function.

Implicit Policy Improvement: The policy is implicitly improved in each iteration as the value function gets updated. The optimal policy can be easily derived from the converged value function by choosing the best action at each state.
Faster Convergence: Value Iteration often converges faster than Policy Iteration because it doesn’t require the value function to stabilize under a particular policy before improving it.

Simpler Methods

When the state and action spaces are small and the transition probabilities and reward structures are relatively simple, it might not be necessary to use complex algorithms like Value Iteration or Policy Iteration. In such cases, simpler methods might suffice.

Imagine a simple MDP with three states (S1, S2, S3) and two actions (A1, A2), where actions lead to different states with certain probabilities:

Transitions and Rewards:

From S1, taking A1 leads to S2 with a probability of 0.7 and to S3 with a probability of 0.3. The reward is +1 in either case.
From S2, taking any action leads back to S1 with a reward of 0.
From S3, taking any action leads back to S1 with a reward of -1.

Policy:

Consider a policy where:

In S1, the policy chooses A1.
In S2 and S3, the action choice doesn’t matter (as all actions lead to the same outcome).

Objective:

To evaluate this policy, we calculate the expected return from each state under the policy.

Setting Up the Equations:

Assuming a discount factor ( = 0.9 ), the Bellman equations for our policy become:
For S1 under A1: ( V(S1) = 1 + 0.9 [0.7 V(S2) + 0.3 V(S3)] )
For S2: ( V(S2) = 0 + 0.9 V(S1) )
For S3: ( V(S3) = -1 + 0.9 V(S1) )

Solving the System:

This system of linear equations needs to be solved to find the values of V(S1) , V(S2) , and V(S3) .
This can be done using matrix algebra methods or iterative techniques like Gauss-Seidel iteration.

Section 7: Roll-out

In a rollout algorithm, one simulates or “rolls out” an episode from a given state until a terminal state or a predefined horizon is reached according to a policy. This approach is used to estimate the value function or the quality of an action at a state.

It can be considered as a form of Monte Carlo simulation, as it relies on averaging over multiple random samples of trajectories (or episodes).

Section 8: Model-free Approach: Monte-Carlo Methods — Learning from Experience

If we do not know the dynamic models and reward models, we will use other methods like Monte-carlo and Temporal Difference methods.

Policy Evaluation (Converged Value Function)

The method calculates values (estimates the value function) by sampling set of trajectories(episodes) and average their returns.

It only suits for episodic MDP (e.g., we cannot use it for optimizing robots) but the good thing may be that it does not have markov assumption of states.

It suits episodic tasks where episodes are guaranteed to terminate after a finite number of steps, allowing the calculation of the return for each state visited in the episode. Specifically, we. sample actions from π ,which leads to the sample episode: s_1, a_1, r_1, s_2, a_2, r_2, …, s_T
On-policy: Monte Carlo policy iteration, where a policy is evaluated using the returns generated by episodes produced under the same policy, and then the policy is improved based on the evaluation.

Off-policy: Off-policy Monte Carlo methods evaluate or improve a policy different from the one generating the data. An example of this would be using importance sampling to re-weight the returns generated under a different policy (behavior policy) to evaluate or improve a target policy.

Section 9: Model-free Approach: Temporal Difference

Now, I can easily understand the following expression, where we simply estimate the future reward G_t using the bootstrapping term in DP.

Compared to the Monte-Carlo method, it has the following advantages:

Faster Learning: TD methods update the value estimate after each step rather than waiting for the end of an episode as in Monte Carlo methods. This means that TD learning can learn from incomplete sequences and update its estimates partway through an episode. This leads to potentially faster learning since the algorithm doesn’t have to wait for the episode to finish before making value updates.
Handling of Continuous Tasks: Monte Carlo methods are primarily suited for episodic tasks where the episodes terminate. In contrast, TD methods can handle both episodic and continuous (non-terminating) tasks efficiently. This is because TD methods do not require the final outcome of an episode to update the value estimates.
Reduced Variance: The updates in TD learning are based on the difference between estimated values of successive states (TD error), which often leads to lower variance in the updates compared to Monte Carlo methods. Monte Carlo updates are based on full returns, which can have high variance especially in stochastic environments.

~~~~ Uncompleted Article (TBC) ~~~~

~~~~ Welcome to correct me if I made any mistake here ~~~~~

How to solve optimize a policy model to maximize this objective?

1. Getting **a reward model** to calculate the reward?

2. Getting a value function? (or a action value function)?

3. Getting a dynamics model, i.e., the distribution of $x_{t+1}$ and action $a$ of the state $x_t$ will be transited to ($Pr(x_{t+1}, r|x_t, y’)$)?

Summary

From Simple to Complex: A Complete Overview of Reinforcement Learning was originally published in Artificial Corner on Medium, where people are continuing the conversation by highlighting and responding to this story.