Statistical Learning Theory Part 3
Approaches to the Learning Problem Part 1
Recap
In the previous post in the Statistical Learning Theory series, we introduced the general model of learning from examples and also discussed two approaches to the Learning Problem.
In this post, we are going to discuss the learning problem with the problem of Imitation.
The problem of Minimizing the Risk Functional from Empirical Data
As said in the previous post, the learning process is a process of choosing an appropriate function from a given set of functions. This function should satisfy a certain quality criterion in the best possible manner.
Basic elements
χ be a subset of Rⁿ and X be a random variable in the set χ. A particular realization of X is denoted by x. P(X) be the cumulative probability distribution function of X and the probability density function be given as p(x). Thus if X is continuous, then
and if X is discrete, then
If f(x) be a function, then the expectation of the function f(x) over X, is defined as
Instead of dealing with two different expressions for E[f(x)], we can re-define the above equations to
In the same way we can write
A superscript denotes a random variable and a subscript dnotes parameter values at which the expectation is to be taken.
The symbol π(x) will be used to represent a prior density of x.
Problem formulation
On the subset Z of the vector space Rⁿ, a set of admissible functions {g(z)}, z∈Z, is given and a functional R = R(g(z)) is defined which is the criterion of quality of the chosen function. We are required to find g*(z) from the set {g(z)} which minimizes the functional R, provided it exists in the set {g(z)}. This functional R is called the risk functional of the decision rule g(z) and is defined by
where L(z,g(z)) is integrable for any g(z) ∈ {g(z)}. F(z) is the probability distribution function defined on Z.
For a no-data problem, R(g(z)) = L(z,g(z)).
The main objective is to minimize this functional when F(z) is unknown but a few samples of observations z₁, z₂,…, zₗ drawn randomly and independently according to F(z) is available.
When we intend to directly minimize R(g(z)) by choosing a function g*(z) from the set {g(z)}, the question that arises is, how can we obtain the minimum of the functional R(g(z)) in the given set of functions {g(z)}?
Whereas, when our objective is to find the minimum of the functional R(g(z)) from the empirical data, we need to find answer to the question, what should be minimized in order to select from the set {g(z)} a function which will ensure that the functional R(g(z)) is minimized?
The minimization problem can be re-formulated to give a parametric form, such that the set of functions {g(z)} are now represented as {g(z,α), α ∈ Λ}. Here, α is used as a parameter from the set Λ, such that the vaue α = α* defines the specific function g(z,α*) in the set g(z,α). In the new notation,
The function Q(z,α) is called the Loss function.
For a specific value of α = α*, α ∈ Λ, each function Q(z,α*) represents the loss for the vector z ∈ Z. Thus the expected loss for the function Q(z,α*) with respect to z, is given by
This is also called the Risk functional or the risk.
The problem of minimizing of the risk functional on the basis of empirical data includes three statistical problems:
- The problem of Pattern Recognition
- The problem of Regression Estimation
- The problem of Density Estimation
In the next part, we will discuss about the problem of Pattern Recognition and Regression Estimation.
Please clap if you like the post or feel that it will help other learners.
Comment your feedback to help us improve!!
Reference
Statistical Learning Theory by Vladimir N. Vapnik