Statistical Learning Theory Part 3

Approaches to the Learning Problem Part 1

Siladittya Manna

Published in

The Owl

4 min readJul 27, 2020

Recap

In the previous post in the Statistical Learning Theory series, we introduced the general model of learning from examples and also discussed two approaches to the Learning Problem.

Statistical Learning Theory Part 2

Approaches to the Learning Problem Part 0

medium.com

In this post, we are going to discuss the learning problem with the problem of Imitation.

The problem of Minimizing the Risk Functional from Empirical Data

As said in the previous post, the learning process is a process of choosing an appropriate function from a given set of functions. This function should satisfy a certain quality criterion in the best possible manner.

Basic elements

χ be a subset of Rⁿ and X be a random variable in the set χ. A particular realization of X is denoted by x. P(X) be the cumulative probability distribution function of X and the probability density function be given as p(x). Thus if X is continuous, then

and if X is discrete, then

If f(x) be a function, then the expectation of the function f(x) over X, is defined as

Instead of dealing with two different expressions for E[f(x)], we can re-define the above equations to

In the same way we can write

A superscript denotes a random variable and a subscript dnotes parameter values at which the expectation is to be taken.

The symbol π(x) will be used to represent a prior density of x.

Problem formulation

On the subset Z of the vector space Rⁿ, a set of admissible functions {g(z)}, z∈Z, is given and a functional R = R(g(z)) is defined which is the criterion of quality of the chosen function. We are required to find g*(z) from the set {g(z)} which minimizes the functional R, provided it exists in the set {g(z)}. This functional R is called the risk functional of the decision rule g(z) and is defined by

where L(z,g(z)) is integrable for any g(z) ∈ {g(z)}. F(z) is the probability distribution function defined on Z.

For a no-data problem, R(g(z)) = L(z,g(z)).

The main objective is to minimize this functional when F(z) is unknown but a few samples of observations z₁, z₂,…, zₗ drawn randomly and independently according to F(z) is available.

When we intend to directly minimize R(g(z)) by choosing a function g*(z) from the set {g(z)}, the question that arises is, how can we obtain the minimum of the functional R(g(z)) in the given set of functions {g(z)}?

Whereas, when our objective is to find the minimum of the functional R(g(z)) from the empirical data, we need to find answer to the question, what should be minimized in order to select from the set {g(z)} a function which will ensure that the functional R(g(z)) is minimized?

The minimization problem can be re-formulated to give a parametric form, such that the set of functions {g(z)} are now represented as {g(z,α), α ∈ Λ}. Here, α is used as a parameter from the set Λ, such that the vaue α = α* defines the specific function g(z,α*) in the set g(z,α). In the new notation,

The function Q(z,α) is called the Loss function.

For a specific value of α = α*, α ∈ Λ, each function Q(z,α*) represents the loss for the vector z ∈ Z. Thus the expected loss for the function Q(z,α*) with respect to z, is given by