Chapter 2 — Inductive bias — Part 3

9 min readFeb 1, 2020

Every machine learning algorithm with any ability to generalize beyond the training data that it sees has, by definition, some type of inductive bias. That is, there is some fundamental assumption or set of assumptions that the learner makes about the target function that enables it to generalize beyond the training data.

Below is a chart that shows the inductive biases for various machine learning algorithms:

In the earlier story, we saw that, the candidate elimination algorithm converge towards the true target concept provided it is given accurate training examples and provided its initial hypothesis space contains the target concept.

What if the target concept is not contained in the hypothesis space?
Can we avoid this difficulty by using a hypothesis space that includes every possible hypothesis ?
How does the size of this hypothesis space influence the ability of the algorithm to generalize to unobserved instances ?
How does the size of the hypothesis space influence the number of training examples that must be observed ?

These are fundamental questions for inductive inference in general. Here we examine them in the context of the candidate elimination algorithm.

1. A Biased Hypothesis Space

We defined consistent as any hypothesis h is consistent with a set of training examples D if and only if h(x) = c(x) for each example (x, c(x)) in D.

Suppose the target concept c(x) is not contained in the hypothesis space H, then none of the hypothesis of H will be consistent with a set of training examples D. In that case, solution would be to enrich the hypothesis space to include every possible hypothesis.

Consider the EnjoySport example in which the hypothesis space is restricted to include only conjunctions of attribute values.

Because of this restriction, the hypothesis space is unable to represent even simple disjunctive target concepts such as “Sky = Sunny or Sky = Cloudy”.

In the above three training examples, the target concept is “Sky = Sunny or Sky = Cloudy”. Now we will check whether the candidate elimination algorithm will learn the concept.

After the second iteration, the algorithm will produce G2 = <?,?,?,?,?,?> and S2 = <? Warm Normal Strong Cool Change>.
This hypothesis, although it is the maximally specific hypothesis from H that is consistent with the first two examples, is already overly general: it incorrectly covers the third (negative) training example. The problem is that we have biased the learner to consider only conjunctive hypotheses. In this case we require a more expressive hypothesis space.

2. An Unbiased Learner

In the following figure, the box on the left represents the set X of all instances, the box on the right the set H of all hypotheses. Each hypothesis corresponds to some subset of X.

Suppose if we have 3 instances then we can have pow(2,3) = 8 subsets. Each subset corresponds to one hypothesis in hypothesis space. Each hypothesis will learn a concept that is represented by the subset of the instances. By having such a hypothesis space will represent every teachable concept that is representing every possible subset of the instances X.

Note — out of 8 hypothesis, only 1 hypothesis is a conjunctive and rest 7 hypothesis are disjunctions, conjunctions, and negations combinations.

In the EnjoySport learning task the size of the instance space X of days described by the six attributes is (3.2.2.2.2.2 = ) 96 instances. In total, we can have pow(2,96) subsets from the 96 district instances.

The solution to the problem of assuring that the target concept is in the hypothesis space H is to provide a hypothesis space capable of representing every teachable concept that is representing every possible subset of the instances X.
The set of all subsets of a set X is called the power set of X.

Note, the conjunctive hypothesis space is able to represent only (4.3.3.3.3.3 =)973 of these — a biased hypothesis space indeed.

Let us reformulate the Enjoysport learning task in an unbiased way by defining a new hypothesis space H’ that can represent every subset of instances; that is, let H’ correspond to the power set of X. One way to define such an H’ is to allow arbitrary disjunctions,conjunctions, and negations of our earlier hypotheses.

For instance, the target concept “Sky = Sunny or Sky = Cloudy” could then be described as (Sunny, ?, ?, ?, ?, ?) v (Cloudy, ?, ?, ?, ?, ?).

However, while this hypothesis space eliminates any problems of expressibility, it unfortunately raises a new, equally difficult problem: our concept learning algorithm is now completely unable to generalize beyond the observed examples!.

For example, Let Positive example be (x1, x2, x3 ) and negative example be (x 4, x5 ). Now the S and G boundary will be

S:{(x1 ∨ x 2 ∨ x 2)}, G:{ ￢ (x 4 ∨ x 5)}

S boundary will always be simply the disjunction of the observed positive examples, while the G boundary will always be the negated disjunction of the observed negative examples. The only examples that will be classified by S and G are the observed training examples themselves. In order to converge to a single, final target concept, we will have to present every single instance in X as a training example!

3. The Futility of Bias-Free Learning

The fundamental property of inductive inference —

“a learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances”

In fact, the only reason that the candidate elimination algorithm was able to generalize beyond the observed training examples in our original formulation of the EnjoySport task is that it was biased by the implicit assumption that the target concept could be represented by a conjunction of attribute values. In cases where this assumption is correct (and the training examples are error-free), its classification of new instances will also be correct. If this assumption is incorrect, however, it is certain that the candidate elimination algorithm will miss-classify at least some instances from X.

Let us define this notion of inductive bias more precisely.

Consider a concept learning algorithm L for the set of instances X. Let c be an arbitrary concept defined over X, and let Dc = {<x,c(x)>} be an arbitrary set of training examples of c.
After training, L is asked to classify a new instance xi. Let L(xi, Dc) denote the classification (e.g., positive or negative) that L assigns to xi after learning from the training data Dc.
We can describe this inductive inference step performed by L as follows

Where the notation y > z indicates that z is inductively inferred from y. For example, if we take L to be the candidate elimination algorithm, Dc, to be the training data as below, and xi to be the fist instance from test data as below, then the inductive inference performed in this case concludes that L(xi, Dc) =(EnjoySport = yes).

Definition:

Consider a concept learning algorithm L for the set of instances X. Let c be an arbitrary concept defined over X, and let Dc = {〈x, c(x) 〉} be an arbitrary set of training examples of c. Let L(xi, Dc) denote the classification assigned to the instance xi by L after training on the data Dc. The inductive bias of L is any minimal set of assertions B such that for any target concept c and corresponding training examples Dc

y | — z indicates that z follows deductively from y (i.e., that z is provable from y).

4. Inductive Bias for Candidate Elimination

Assume a training set Dc. The candidate elimination algorithm computes the version space VS_HD.
Classify the new instance xi by a vote among hypotheses in this version space. Here let us assume that it will output a classification for xi only if this vote among version space hypotheses is unanimously positive or negative and that it will not output a classification otherwise.
Conjecture: B = { c ∈ H } is the inductive bias (’the underlying target concept c is in H’)
From c ∈ H it follows that c is a member of the version space.
L (xi, Dc) = k implies that all members of VS_HD, including c, vote for class k (unanimous voting). Therefore: c(xi) = k = L( xi, Dc ).
This means, that the output of the learner L(xi, Dc) can be logically deduced from B ∧ Dc ∧ xi
→ The inductive bias of the Candidate Elimination Algorithm is: “c is in H”

5. Inductive Systems and Equivalent Deductive Systems

We saw two terms — inductive and deductive. But what is the difference between these two — inductive and deductive logic ?
Deductive logic uses given information, premises or accepted general rules to reach a proven conclusion. So that is the reason, we called that the learner L(xi, Dc) can be logically deduced from B ∧ Dc ∧ xi.

On the other hand, inductive logic or reasoning involves making generalizations based upon behavior observed in specific cases. So that is the reason, we called that the learner L(xi,Dc) inductively inferred from (Dc ∧ xi). It depends on choice of the learning algorithm used and what assumption it makes.

The below figure explains

Modelling inductive systems by equivalent deductive systems.
The input-output behavior of the candidate elimination algorithm using a
hypothesis space H is identical to that of a deductive theorem prover utilizing the assertion “H contains the target concept.” This assertion is therefore called the inductive bias of the candidate elimination algorithm.
Characterizing inductive systems by their inductive bias allows modelling them by their equivalent deductive systems. This provides a way to compare inductive systems according to their policies for generalizing beyond the observed training data.

One advantage of viewing inductive inference systems in terms of their inductive bias is that it provides a non procedural means of characterizing their policy for generalizing beyond the observed data. A second advantage is that it allows comparison of different learners according to the strength of the inductive bias they employ.

The following three learning algorithms are listed from weakest to strongest bias.

1.Rote-learning : storing each observed training example in memory. If the instance is found in memory, the store classification is returned.

Inductive bias : nothing — Weakest bias

2.Candidate-Elimination algorithm : new instances are classified only in the case where all members of the current version space agree in the classification.

Inductive bias : Target concept can be represented in its hypothesis space

3. Find-S : find the most specific hypothesis consistent with the training examples. It then uses this hypothesis to classify all subsequent instances.

Inductive bias : Target concept can be represented in its hypothesis space + All instances are negative instances unless the opposite is entailed by its other knowledge — Strongest bias

More strongly biased methods make more inductive leaps, classifying a greater proportion of unseen instances!!

Previous — Chapter 2 — Concept Learning — Part 2

References —

Machine Learning — Tom Mitchell