From the Archives: Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms (Best Paper Award, EMNLP 2002)

Published in

UCI NLP

4 min readOct 24, 2018

Sequence tagging is a fundamental task in NLP. For instance, parts-of-speech tagging and named entity recognition are both sequence tagging problems, which are both important problems in their own right as well as crucial intermediate stages in pipelines for solving more complicated tasks such syntactic parsing and relation extraction. Around the time this paper was written (e.g. before LSTMs and skip-gram embeddings ruled the land) the predominant approach for sequence tagging was to hand engineer a large number of features, and use them to train some kind of maximum-entropy model (e.g. logistic regression)[1] or conditional random field (CRF)[2]. In this paper, Michael Collins introduces a surprisingly effective method for training these kinds of feature-based tagging model using the perceptron algorithm.

The key observation is this. Suppose we have a dataset consisting of sequences of inputs x=[x⁽¹⁾,…,x⁽ᴺ⁾] along with corresponding tags t=[t⁽¹⁾,…,t⁽ᴺ⁾]. One of the simplest ways to model the joint probability of these sequences is with a trigram Hidden Markov model (HMM):

Where the first alpha term is the probability of the current tag given the previous two tags, and the second alpha term is the probability of the current word given the current tag. The common way to estimate these parameters is by counting the relative frequency of tag trigrams and word-tag pairs in our data set.

Collins notes that there is an alternative way to produce these estimates using the perceptron algorithm. For each iteration of the algorithm, we begin by using the Viterbi algorithm to decode the best possible tag sequence z=[z⁽¹⁾,…,z⁽ᴺ⁾] for each observed sequence x under the current model. Then we update our parameters by comparing the number of times a tag trigram / word-tag pair occurs in the true tag sequence t to the number of times it occurs in the predicted tag sequence z:

(Note: the update rule for the other alpha term is defined analogously)

It should be pretty clear that, when performed over the entire data set, this update rule immediately yields exactly the same parameters as the standard frequency-based approach.

Interestingly, this approach can be applied to estimate the parameters of more complicated models such as a feature-based maximum-entropy (e.g. logistic regression) model:

where we are using h⁽ⁱ⁾ to denote all of the contextual variables (e.g. the previous two tags as well as the observed words), and ϕ’s to denote features. The perceptron algorithm works almost exactly the same. Each iteration again consists of using Viterbi to decode the predicted tag sequence, then updating the parameters by comparing differences between the true and predicted sequence. The only change we need to make is to compare features instead of trigram/word-tag pair counts. That is:

where Φ is just the sum of ϕ over the whole sequence (e.g. if ϕ is an indicator function measuring whether the previous tag was NOUN, then Φ counts the number of times this happens over the sequence). Also, Collins notes that this approach works better in practice if the update is averaged over multiple examples from the data set.

The remainder of the paper provides theorems that prove that this idea in fact works, and describes experiment results on two tasks: parts-of-speech tagging and noun phrase chunking. When compared to a maximum-entropy model trained using the standard approach, it is shown that the perceptron-based model:

Converges quicker (the standard approach requires up to 100x training iterations).
Has lower error rates for POS tagging (an absolute decrease of ~0.5%) accuracy and higher F1-scores (an absolute increase of ~1.0) provided the aforementioned update averaging trick is used.

Having recently spent an eternity training a CRF (on modern hardware) for a research project, I’d be willing to wager that the first result was what pushed reviewers to give this the best paper award. However, I find the second result more surprising — given that the model being compared is the same (e.g. is a maximum-entropy model) it seems counter-intuitive to me that the training algorithm which actually tries to maximize entropy produces a model that does not generalize as well on test data as the perceptron-based training algorithm. If you have any thoughts on why this would be the case, I would love to hear them in the comments!

References

[1] Ratnaparkhi, Adwait. “A maximum entropy model for part-of-speech tagging.” Conference on Empirical Methods in Natural Language Processing. 1996.

[2] Lafferty, John, Andrew McCallum, and Fernando CN Pereira. “Conditional random fields: Probabilistic models for segmenting and labeling sequence data.” (2001).

From the Archives: Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms (Best Paper Award, EMNLP 2002)

Written by Robert L. Logan IV