Statistical Learning Theory — 4
THE PROBLEM OF PATTERN RECOGNITION
Suppose, we have data distributed according to the probability distribution function P(x). An element x appears randomly and independently according to the distribution P(x).
Again, suppose you are the supervisor. You classify and label each such element x into k classes. We denote the distribution according to which the supervisor, that is, you carry out the classification, using P(ω|x), where ω ∈ {0, 1,..,k-1}.
In a Pattern Recognition problem, neither the data distribution P(x), nor the conditional probability distribution P(ω|x) is known. But, we do have the empirical data x and if P(x) and P(ω|x) exist, then P(ω, x) = P(ω|x).P(x) also exists.
From the previous article in this series
we can learn that in the problem of risk minimization from empirical data, the objective is to select a function g*(z) based on a criterion, from a set of functions {g(z)} such that the RISK
R(g(z)) = ∫ L(z,g(z)) dF(z)
is minimized (where F(z) is the probability distribution function on Z ).
Reformulating the risk functional in the parametric form, we get
R(α) = ∫ Q(z,α) dF(z), α ∈ Λ
where,
Q(z,α) = L(z,g(z,α))
is called the Loss function.
Now, for the pattern recognition problems, the set of functions g(x,α) will be such that they take only values from the set {0, 1, …, k-1}. Also, note that the empirical data consisting of a random independent sample of pairs
is given.
Consider the simplest loss function
where, L(ω,g(x,α)) = 1 if ω ≠ g and 0 otherwise. This loss function does not contribute to the total risk if the decision given by the function g(x,α) is correct but adds 1 for every incorrect decision made.
The problem of pattern recognition is to minimize the risk functional
R(α) = ∫ L(ω, g(x,α)) dF(ω, x)
over the set of functions g(x,α), α ∈ Λ and the given empirical data.
The above expression for the risk functional actually denotes the probability of classification error for any decision rule g(x,α).
Explanation
Since, F(ω,x) is the cumulative joint distribution function
since ω takes only discrete values from the set {0, 1, .., k-1}.
So, we can write dF(ω,x) as
Thus, the risk functional R(α) can be written as
For a chosen α, g(x,α) gives the decisions on the random independent samples of empirical data. If all the decisions of the function g(x,α) are wrong, the risk functional value becomes equal to F(k-1,∞), i.e. 1. Otherwise, we can say that R(α) ≤ 1. This follows the rule of probability.
For a binary classification problem and for a single sample x (let’s say this is a positive sample, i.e. ω = 1), if the sample is misclassified then,
R(α) = L(1, g(x,α)). f(1,x) + L(0, g(x,α)). f(0,x) = f(1,x) = 1
Had the decision rule correctly classified the sample x, then
R(α) = L(1, g(x,α)). f(1,x) + L(0, g(x,α)). f(0,x) = f(0,x) = 0
However, the joint distribution are not this simple in real-life data and often have multivariate properties. The discrete values will be replaced by probability values in the above equations.
The problem of pattern recognition can thus be reduced to a problem of risk minimization based on the empirical data even though the Cumulative Joint Distribution Function F(ω,x) is unknown.
The set of loss functions Q(z,α) is not arbitrary and follows the following restrictions:
- The vector z consists of n+1 coordinates: coordinate ω, which takes on only a finite number of values and n coordinates which form the n-dimensional vector x.
z = [ω , x] - The set of functions Q(z,α), α ∈ Λ, is given by L(ω, g(z,α)) and takes only finite values.
This specific feature of the risk minimization problem characterizes the pattern recognition problem.
Next, we will look into the problem of regression estimation.
Reference:
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley-Interscience.
Please clap on this story if you like it!!
For the previous articles in this series click the links below