Chapter 2 — Concept Learning — Part 1

12 min readFeb 1, 2020

Here I am writing notes for the second chapter “Concept learning and the general-to-specific ordering” from book “Machine Learning — Tom Mitchell”. Hope this story, will help them, who are studying Machine Learning in their university or willing to understand the theoretical fundamentals of machine learning.

Chapter 1 — 3-steps-introduction-to-machine-learning-and-design-of-a-learning-system
Chapter 2 — Concept Learning — Part 1
Chapter 2 — Concept Learning — Part 2
Chapter 2 — Inductive bias — Part 3

We learn our surrounding through 5 senses — eye, ear, nose, tongue and skin. We learn lot of things during the entire life. Some of them are based on experience and some of them are based on memorization. On the basis of that we can divide learning methods into five types:

Rote Learning (memorization): Memorizing things without knowing the concept/ logic behind them.
Passive Learning (instructions): Learning from a teacher/expert.
Analogy (experience): Learning new things from our past experience.
Inductive Learning (experience): On the basis of past experience, formulating a generalized concept.
Deductive Learning: Deriving new facts from past facts.

1. Let us begin

In machine learning, our interest is in inductive learning and its based on formulating a generalized concept after observing a number of instances of examples of the concept.

Tom Mitchell defines the concept learning as — “Problem of searching through a predefined space of potential hypotheses for the hypothesis that best fits the training examples”

To get through the theory of concept learning, Please read my “Introduction to machine learning” story. In the earlier “Introduction to machine learning” story, I have explained about the learning process. The learning algorithm searches a hypothesis space for a potential hypothesis that best fits the training examples.

Assume that we have collected data for some attributes/features of the day like, Sky, Air Temperature, Humidity, Wind, Water, Forecast. Let these set of instances be denoted by X and many concepts can defined over the X. For example, the concepts can be
- Days on which my friend Sachin enjoys his favorite water sport
- Days on which my friend Sachin will not go outside of his house.
- Days on which my friend Sachin will have night drink.
Target concept — The concept or function to be learned is called the target concept and denoted by c. It can be seen as a boolean valued function defined over X and can be represented as c: X → {0, 1}.

For the target concept c , “Days on which my friend Sachin enjoys his favorite water sport”, an attribute EnjoySport is included in the below dataset X and it indicates whether or not my friend Sachin enjoys his favorite water sport on that day.

Now the target concept is EnjoySport : X →{0,1}. With this, a learner task is to learn to predict the value of EnjoySport for arbitrary day, based on the values of its attribute values. When a new sample with the values for attributes <Sky, Air Temperature, Humidity, Wind, Water, Forecast> is given, the value for EnjoySport (ie. 0 or 1) is predicted based on the previous learning.

Let H denote the set of all possible hypotheses that the learner may consider regarding the identity of the target concept. So H = {h1, h2, …. }.

What hypothesis representation shall we provide to the learner in this case ?. Let us begin by considering a simple representation in which each hypothesis
consists of a conjunction of constraints on the instance attributes. Let each hypothesis be a vector of six constraints, specifying the values of the six attributes <Sky, Air Temperature, Humidity, Wind, Water, Forecast>. In hypothesis representation, value of each attribute could be either

“?’ — that any value is acceptable for this attribute,
specify a single required value (e.g., Warm) for the attribute, or
“0” that no value is acceptable.

For example :

the hypothesis that my friend Sachin enjoys his favorite sport only on cold days with high humidity (independent of the values of the other attributes) is represented by the expression < ?,cold, High,?, ? ,?>
The most general hypothesis — < ?, ? , ? , ?, ? , ?> that every day is a positive example
The most specific hypothesis — < 0, 0, 0, 0, 0, 0>that no day is a positive example.

So far we looked into what is concept , target concept and concept learning. Also extended the definition of concept, target concept and concept learning to an example where my friend Sachin enjoys the water sport on certain day. Also looked into the hypothesis representation. With this knowledge we can say, EnjoySport concept learning task requires

learning the sets of days for which EnjoySport=yes and then
describing this set by a conjunction of constraints over the instance attributes.

2. EnjoySport Concept Learning Task

In my last story, we looked into defining a learning task for various examples like spam mail detection, credit approval and checker game. Now lets define the learning task for EnjoySport Concept.

The below points covers the simplified overview.

X — The set of items over which the concept is defined is called the set of instances, which we denote by X. In the current example, X is the set of all possible days, each represented by the attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast.
c — The concept or function to be learned is called the target concept, which we denote by c. In general, c can be any boolean valued function defined over the instances X; that is, c: X → {0, 1}. In the current example, the target concept corresponds to the value of the attribute EnjoySport (i.e, c(x)=1 if EnjoySport=Yes, and c(x)=0 if EnjoySport= No).
(x, c(x)) — When learning the target concept, the learner is presented by a set of training examples, each consisting of an instance x from X, along with its target concept value c(x). Instances for which c(x) = 1 are called positive examples and instances for which c(x) = 0 are called negative examples. We will often write the ordered pair (x, c(x)) to describe the training example consisting of the instance x and its target concept value c(x).
D — We use the symbol D to denote the set of available training examples.
H — Given a set of training examples of the target concept c, the problem faced by the learner is to hypothesize, or estimate, c. We use the symbol H to denote the set of all possible hypotheses that the learner may consider regarding the identity of the target concept.
h(x) — In general, each hypothesis h in H represents a Boolean-valued function defined over X; that is, h : X →{0, 1}. The goal of the learner is to find a hypothesis h such that h(x) = c(x) for all x in X.

Notice that, the learning algorithm objective is to find a hypothesis h in H such that h(x) = c(x) for all x in D.

We know that, the Inductive learning algorithm tries to induce a “general rule” from a set of observed instances. So the above case is same as inductive learning where a learning algorithm is trying to find a hypothesis h (general rule) in H such that h(x) = c(x) for all x in D. For a given collection of examples, in reality, learning algorithm return a function h (hypothesis) that approximates c (target concept). But the expectation is, the learning algorithm to return a function h (hypothesis) that equals c (target concept) ie. h(x) = c(x) for all x in D.

So we can define, Inductive learning hypothesis is any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over any other unobserved examples.

3. Hypothesis Space

Let X denote the instances and H as hypotheses in the EnjoySport learning task. Lets compute the distinct instances and hypothesis in X and H respectively as below. Out hypothesis h is a vector of six constraints, specifying the values of the six attributes <Sky, Air Temperature, Humidity, Wind, Water, Forecast>. In this hypothesis representation, value of each attribute could be either “?” or “0” other than defined values. So the hypothesis space H has 5120 distinct hypothesis.

The number of combinations: 5×4×4×4×4×4 = 5120 syntactically distinct hypotheses. They are syntactically distinct but not semantically. For example, the below 2 hypothesis says the same but they look different.

h1 = <Sky=0 AND Temp=warm AND Humidity=? AND Wind=strong AND Water=warm AND Forecast=same >

h2 = <Sky=sunny AND Temp=warm AND Humidity=? AND Wind=strong AND Water=0 AND Forecast=same>

Neither of these hypotheses accept any “day”, so semantically the same. All such hypothesis having same semantic is counted as 1. So we can have total number of combinations as below.

1 (hypothesis with one or more 0) + 4×3×3×3×3×3 (add ? to each attribute) = 973 semantically distinct hypotheses

We have already seen that, the concept learning can be viewed as the task of searching through a large space of hypotheses. The goal of this search is to find the hypothesis that best fits the training examples. By selecting a particular hypothesis representation, the designer of the learning algorithm implicitly defines the space of all hypotheses that the program can ever represent and therefore can ever learn.

General-to-Specific Ordering of Hypotheses

Consider two hypotheses, - h1 = (Sunny, ?, ?, Strong, ?, ?) and - h2 = (Sunny, ?, ?, ?, ?, ?)
By looking at the 2 hypothesis representations, the hypothesis h2 has fewer constraints on attributes than hypothesis h1. The sets of instances that are classified positive by hl will be less than sets of instances that are classified positive by h2 as h2 imposes fewer constraints on the instance.
In other words, we can also say that, any instance classified positive by hl will also be classified positive by h2.
Therefore, we say that h2 is more general than hl. In reverse, we can also say that, h1 is more specific than h2.

Lets formalize this concept. Hypothesis h1 and h2 classifies the instance x as positive can written as h1(x) = 1 and h2(x) = 1. Now h2 is more general than h1, so it can be written as if h1(x) = 1 implies h2(x) = 1. In symbols ..

Now lets describe a definition for more-general-than-or-equal-to —

In similar way, more-specific-than, also be defined as below.

By now, we understood the kind of relationship could exist between 2 hypothesis like more-general-than-equal-to or more-general-than. Lets illustrate this relationships using the below figure.

In the figure, the box on the left represents the set X of all instances, the box on the right the set H of all hypotheses.
Each hypothesis corresponds to some subset of X, that is, the subset of instances that it classifies positive. In the figure, there are 2 instances x1 and x2, and 3 hypothesis h1, h2 and h2.
h1 classifies — x1, h2 classifies — x1 and x2, and h3 classifies — x1. This indicates, h2 is more-general-than h1 and h3.
The arrows connecting hypotheses represent the more-general-than relation, with the arrow pointing toward the less general hypothesis.
Note the subset of instances characterized by h2 subsumes the subset characterized by hl , hence h2 is more-general–than h1

Till now, we performed, ordering the hypothesis in hypothesis space from specific to general. Many learning algorithms for concept learning organize the search through the hypothesis space by relying on a general-to-specific ordering of hypotheses. By taking advantage of this naturally occurring structure over the hypothesis space, we can design learning algorithms that exhaustively search even infinite hypothesis spaces without explicitly enumerating every hypothesis.

4. Find-s: finding a maximally specific hypothesis

Now after learning the concept of general-to-specific ordering of hypotheses, Now its time to use this partial ordering to organize the search for a hypothesis, that is consistent with the observed training examples. One way is to begin with the most specific possible hypothesis in H, then generalize this hypothesis each time it fails to cover an observed positive training example.

FIND-S algorithm is used for this purpose. Here are the steps for find-s algorithm.

To illustrate this algorithm, assume the learner is given the sequence of training examples from the EnjoySport task

The first step of FIND-S is to initialize h to the most specific hypothesis in H h — (Ø, Ø, Ø, Ø, Ø, Ø)
First training example x1 = < Sunny, Warm, Normal, Strong ,Warm ,Same>, EnjoySport = +ve. Observing the first training example, it is clear that hypothesis h is too specific. None of the “Ø” constraints in h are satisfied by this example, so each is replaced by the next more general constraint that fits the example h1 = < Sunny, Warm, Normal, Strong ,Warm, Same>.
Consider the second training example x2 = < Sunny, Warm, High, Strong, Warm, Same>, EnjoySport = +ve. The second training example forces the algorithm to further generalize h, this time substituting a “?” in place of any attribute value in h that is not satisfied by the new example. Now h2 =< Sunny, Warm, ?, Strong, Warm, Same>
Consider the third training example x3 =< Rainy, Cold, High, Strong, Warm, Change>,EnjoySport = — ve. The FIND-S algorithm simply ignores every negative example. So the hypothesis remain as before, so h3 =< Sunny, Warm, ?, Strong, Warm, Same>
Consider the fourth training example x4 =<Sunny,Warm,High,Strong, Cool,Change>, EnjoySport =+ve. The fourth example leads to a further generalization of h as h4 =< Sunny, Warm, ?, Strong, ?, ?>
So the final hypothesis is < Sunny, Warm, ?, Strong, ?, ?>

The above example, can be illustrated with the below figure.

The search begins (ho) with the most specific hypothesis in H, then considers increasingly general hypotheses (hl through h4) as mandated by the training examples. The search moves from hypothesis to hypothesis, searching from the most specific to progressively more general hypotheses along one chain of the partial ordering. At each step, the hypothesis is generalized only as far as necessary to cover the new positive example. Therefore, at each stage the hypothesis is the most specific hypothesis consistent with the training examples observed up to this point.

The key property of the FIND-S algorithm —

FIND-S is guaranteed to output the most specific hypothesis within H that is consistent with the positive training examples
FIND-S algorithm’s final hypothesis will also be consistent with the negative examples provided the correct target concept is contained in H, and provided the training examples are correct.

5. Unanswered questions by FIND-S

There are several questions still left unanswered, such as:

Has the learner converged to the correct target concept ?. Although FIND-S will find a hypothesis consistent with the training data, it has no way to determine whether it has found the only hypothesis in H consistent with the data (i.e., the correct target concept), or whether there are many other consistent hypotheses as well.
Why prefer the most specific hypothesis ?. In case there are multiple hypotheses consistent with the training examples, FIND-S will find the most specific. It is unclear whether we should prefer this hypothesis over, say, the most general, or some other hypothesis of intermediate generality.
Are the training examples consistent ?. In most practical learning problems there is some chance that the training examples will contain at least some errors or noise. Such inconsistent sets of training examples can severely mislead FIND-S, given the fact that it ignores negative examples.
What if there are several maximally specific consistent hypotheses?. There can be several maximally specific hypotheses consistent with the data. Find S finds only one.

NEXT → Chapter 2 — Concept Learning — Part 2

References —

Machine Learning — Tom Mitchell