Language Models & Lexicons

Published in

Nwamaka Imasogie’s Machine Learning and Artificial Intelligence Projects

16 min readSep 26, 2019

A statistical language model is a probability distribution over a sequence of words. If we can create a language model from a text corpus, we can use it to perform tasks like text classification or text generation in the style of that corpus because it captures the essence of it. In this post I cover:

Generative vs Discriminative Text Classifiers: Naive Bayes, Logistic Regression, Linear SVM
Dependency Parsing, Context-Free Grammar (Phrase Structure)
Hidden Markov Models (HMM) and I use PyTorch to build a HMM and implemented with Viterbi, Forward-Backward and Baum Welch algorithms.
Text Classification, Precision, Recall, Cross-validation, comparing classifiers (hypothesis testing), TF-IDF

Lecture videos 2–6 from the University of Washington’s NLP course are here.

Language Models (Video lecture 2)

Language models are a predictive theory of what word will come next.

Log-Linear Models

Log-Linear models are a feature vector representation of the word’s history and the word. Usually the features are things we’ve engineered (found to work well). Usually in NLP the features tend to be binary but don’t have to be. Estimation/learning is carried out by an iterative algorithm like Stochastic Gradient Descent. We really focus on using linguistic patterns to come up with features!

There are several estimation/learning algorithms but here are the 2 most popular today. Here are some convex optimization algorithms for calculating maximum likelihood:

Batch methods (L-BFGs is popular). It calculates the whole function, over all of the data. Calculates its data and takes a step. Except it does some things more cleverly than Gradient Descent. In addition to calculating the gradient, it keeps an approximation to the 2nd derivative in its memory.
Stochastic Gradient Descent/Ascent is most widely used. The nice thing about SGD, that’s different than the Batch method, is that we start making progress right away. Every time I look at an example I immediately make an update. In practice, SGD gets to a pretty good solution faster than Batch (in a relatively small amount of passes). It’s a good choice when you have a budget/deadline.

I’ll summarize the most important take-aways to remember:

We probably aren’t going to implement a log-linear model. But will definitely run into them again in life.
We have to use an iterative method like Stochastic Gradient Descent because there is no closed form; we can’t set the weights by simply implementing a function that counts, normalizes or does a transformation on the data.
Log-linear models are expensive but it never really took off because the computation of the summation is just too expensive so it’s unrealistic to use in real-time (because it incorporates the vocabulary size per calculation).
Regularization is very important! Subtract a quadratic term at the end of our objective function and suffer a penalty when the weights get big. We don’t actually do maximum likelihood.

Text Classification (Video lecture 3)

Before jumping directly into text classification lets first look at how to measure performance.

There are some issues with Test-Set accuracy. For example:

Class imbalance: You could get 99% accuracy by simple labelling all emails as “not spam”.
Variance due to the test data: Your test data is really just an “approximation” to the true distribution to the universe of data. And changing your test data a little bit might lead to some variance. So it’s easy to get lucky or unlucky.

Let’s look at a different way to do evaluate performance.

Evaluation, in a Two-Class Case

Side note:
If you’d like deeper explanations on performance metrics (with visualizations) Andrew Long wrote a very helpful blog on Recall, Specificity, Accuracy and Precision.
If you would like to read an article that uses equations for performance metrics, check out this article by William Koehrsen.

Suppose we have two classes, and one of them is the “target” class and the other is the non-target class. If we use a search engine as an example you can think of someone entering a query, and the target class is a set of relevant documents to that query. Let’s look at two complimentary measures of performance called precision and recall.

Precision — Using the search engine as an example, precision measures the fraction of the documents that the classifier returns versus how many of them are actually relevant. Precision measures the goal of returning a “pure” set of target instances.

Recall — Out of every document that I should have returned, how many did I actually return correctly? Recall measures the goal of capturing all of the targeted instances.

Every document on the web will be either inside our outside of these circles.

Instances outside of the circle are not relevant to the query, so we won’t see them and don’t care about them.

The documents labeled in circle A are the documents that are actually in the “target” class. These are the set of relevant documents that you want to have be returned.

Circle B contains the set of documents that your classifier believes to be a relevant document. A.k.a. what your classifier will show you.

And circle C contains all of the documents that are correct. In other words, its the set of documents that are both in the target class and what your classifier returned.

F-measure a.k.a. Harmonic Mean — F1 is a single number that balances, combines, and considers both precision and recall.

Formulas

P(classify) = |C|/|B|R(classify) = |C|/|A|F1(classify) = 2 * [(P * R)/(P + R)]

It’s important to know that often times in NLP there will be more than 2 classes. For example, maybe we want to classify an article as sports, another as news, and another as business, etc. We can still use precision and recall as either macroaveraged or microaveraged.

Evaluation with > 2 Classes

Macroaveraged precision and recall — Let each class be the target and report the the average precision and recall across all classes. Basically this treats every class as having an equal weight.

Microaveraged precision and recall — pull all one-vs.-rest decisions into a single contingency table, calculate precision and recall from that. Basically this will weight the classes that are more dominant, heavier.

Cross-Validation

It’s important to remember that accuracy, precision, recall and F1 are just estimates of how the classifier’s would perform under a true data distribution. But the estimates are noisy! Because we only have a finite sample.

K-fold cross-validation

Partition the training set into K non-overlapping “folds” (10 is most commonly used, but it can be another value). Then do a series of training runs on all of the data except for one fold. So hold out one fold on each turn and use it as the development set. Then you estimate the quality on the ith development set. And you repeat this K times. So now we’ll either have 10 measures of accuracy, 10 measures of precision, 10 measures of recall, etc and finally you report the average. You might also want to report the standard error which gives us a sense of how much does it wobble around based on splits of the data.

Hypothesis Testing — Statistical Significance

Supposed we have two classifiers and want to know if classifier1 is significantly better than classifier 2. If classifier1 has a better accuracy, we can’t automatically assume it’s really better. It might have just gotten lucky.

In our experiment, we can think of the “null hypothesis (H0)” as your skeptical friend who says no, classifier1 is not better.

But if accuracy1 >> accuracy2 we are tempted to believe otherwise. So that brings us to a great question, how much larger must accuracy1 be than accuracy2 to reject the null hypothesis?

McNemar’s (1947) Hypothesis Test for Text Classifiers

McNemar’s is one test that’s clean and simple, transparent statistical test related to NLP.

The null hypothesis: true accuracy1 = true accuracy2
Pick a significance level alpha, an “acceptably” high probability of incorrectly rejecting the null hypothesis. — It’s like how bad would it be to retract your statement later if you decide that the null hypthesis is. Then set your alpha as low as possible so that you’re comfortable with this risk. Usually in scientific literature, its common to set alpha = 5%. Meaning 1 experiment in 20 is going to incorrectly reject the null hypothesis. This is subjective.
Calculate the test statistic, k (explained below)
Calculate the probability of an as extreme or more extreme value of k, assuming the null hypothesis is true; this is the p-value. In other words, how probable is it that we would observe k or something more extreme if the null hypothesis is true. How surprising is what we’ve seen?
Reject the null hypothesis if the p-value is less than alpha.

Be careful. The p-value is the probability of this observation (or one more extreme) given that the null hypothesis is true.

The p-value is p(this observation | H0 is true), not the other way around!

Details of the McNemar Test

Contingency table is shows whether a classifier got it right or wrong. We are interested in the cells where a classifier is right, and a classifier is wrong, so c01 and c10.

We want to know how surprising is the smaller of either c01 or c10. So k is the minimum of the two.

Here’s how to think about it: If we had a binomial distribution where we draw coins from a bucket. In other words, if we were to toss c01+c10 amount of coins, what’s the probability of getting k, or fewer, heads.

Features in Text Classification

Often times features are simply term frequencies. For example, from the sentence: the vodka was great, but don’t touch the hamburger, the term frequencies would be a vector containing counts:

hamburger = 1
the = 2
delicious= 0
don’t touch = 1
...

Transformations on Word Frequencies

Logarithm — When the frequencies of a word count get really really big it starts to have a disproportionate influence on the category. So it’s better to damp down the frequencies. So taking the log will tune it down. So the difference between 1000 and 10,000 becomes less significant.

Idf weighting, a.k.a. TF-IDF —Document frequency is a different way of counting. It checks how many documents a word occurred in. If a word like “the” occurs in every single document, it’s not very helpful. So before we even build our classifier let’s take that into account in our features. Conversely if a word occurs in relatively few documents its more likely to be useful.

Formula is: log(total number of documents / number of documents a word occurred in)

So for a word like “the” we would ignore it because log(n/n) = log(1) = 0.

But if a word like “hamburger” occurs in only 5% of documents then log(1/0.05) = log(20) = 1.30 so we will multiply the frequency of hamburgers by log(20) or 1.3.

Disjunction of terms; Clustering: You can run clustering algorithms to get words that behave similarly. Suppose that Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, and Saturday all land in a cluster together. We could have a feature that indicates if there’s a day of the week in this document. The feature will fire if either of those days of the week exist, doesn’t matter which one.

Generative vs. Discriminative Classification

Generative (or Joint) models place probabilities over both the observed data and the hidden classes (generates the observed data from hidden stuff).

Language models, Naïve Bayes, N-gram models, and hidden Markov models are also generative.

Graphical model diagram. The circles are random variables. The lines are direct dependencies between them. Each node is a classifier based on its incoming arch. For Naïve Bayes, at classification time we observe the various words of the document (d1, d2, d3) and we want to predict the class based on that. There’s the prior probability of the class P(c) and the probability of the words in the document given the class P(d1 | c).

Naïve Bayes Classifier — Generative

Naïve Bayes is a generative classifier because it tells a story about how the data is generated: first you pick the label, then for each feature you pick the feature condition, then the label. Then we get a random feature distribution that tells us how probable our data is. In other words, we have a generative direction where the words are “generated” from the class so you have a prior probability P(class) and P(data | class).

Naïve Bayes calculates the probability of all of the features together with a label. So it gives us a distribution of documents with their labels together. Here’s another way to think about how Naïve Bayes works, step by step. Let’s imagine that we have a dice that has our various labels on it:

Roll the dice to pick a label.
For this particular label, walk though the features from 1 to d, and get a distribution of the feature values for each of the features.
Let’s imagine that we also have a dice for each label and each feature. So the total number of different dice = d*number of labels. And then we roll each dice.
Don’t forget to multiply in the prior probability, Π, of that label.

Naiive Bayes is a probabilistic classifier. Here are some pros and cons:

Pros

Often works quite well. Old/studied a lot.
Simplest. Easy to build so it’s a good starting point.
Fast.
No special optimization required.

Cons

The features are all treated as independent (that’s why it’s called naïve. Because we are separately estimating each feature distribution given each label, this means that if 2 features are going to collude, it does not taking that into account. If we have 2 features that co-occur very often, there’s no joint consideration about how much each feature should influence our classifier’s decision.

Discriminative Classifiers

Sometimes we don’t need to spend all of this time building a (generative) model that tells us how probable our data is. For example, we might really be interested in the probability of the label given the text (output given the input).

In a discriminative model (also called conditional model) such as logistic regression we’ve observed the words of the document and want to predict the class. However, this time we are directly putting a probability over the class given all of the data we’ve observed P(c | d1, d2, d3).

We are directly putting a probability over the class given all of the data we’ve observed P(c | d1, d2, d3).

Discriminative models focus on optimizing a performance measure like accuracy or something similar. Benefits include:

They give high accuracy performance
They make it easy to incorporate lots of linguistically important features
They allow automatic building of language independent, retargetable NLP modules

The rule of thumb/conventional wisdom is to use discriminative models when you have a large amount of data. Discriminative models will typically perform better than generative. However, if you have a small dataset, discriminative has a bigger problem with over-fitting.

There are many more discriminative classifiers, but I’ll focus on multinomial logistic regression and linear SVM.

Logistic Regression vs. Linear SVM

In Logistic Regression we seek to maximize conditional likelihood. We will choose the lambda (λ “weights”) parameters that maximize the conditional likelihood of the data according to this model. LR constructs a probability distribution over classifications.

Support Vector Machine (SVM) also known as max-margin classifier is an algorithm used for classification problems. SVM is also a good way of discriminating classes however it does not provide a probability distribution over classes.

If you look at the optimization problems of linear SVM and (regularized) Logistic Regression, they are very similar. They only differ in the loss function. SVM minimizes hinge loss while logistic regression minimizes log loss.

Purple is hinge loss. Blue is log loss. Red is zero-one loss (error).

There are 2 differences to note:

Logistic loss diverges faster than hinge loss. So, in general, it will be more sensitive to outliers.
Logistic loss does not go to zero even if the point is classified sufficiently confidently. This might lead to minor degradation in accuracy.

Rumor: random forests are widely used in industry when performance matters more than the ability interpretability.

Sequence Models (Video lecture 4)

Hidden Markov Model

The first 10 minutes of this video does a great job of explaining Hidden Markov Models using a very simple example.

A Hidden Markov Model (HMM) is a sequence classifier. As other machine learning algorithms it can be trained, i.e.: given labeled sequences of observations, then uses the learned parameters to assign a sequence of labels given a sequence of observations. Let’s define an HMM framework containing the following components:

States (e.g., labels): T=t1, t2, …, tN
Observations (e.g., words) : W=w1, w2, …, wN
Two special states: t_start and t_end which are not associated with the observation

Definitions:

Transition probability: A matrix A with the probabilities from transitioning from one state to the next state, over time.
Emission probability: A matrix B with the probabilities of an observation being generated from a state.
Initial probability: An initial probability distribution over states
Final probability: A final probability distribution over states

Graphical Representation of Hidden Markov Models. Grey are elements that I know. T’s are labels. W’s are words. The arrows tell you which node is conditioning the current node.

Intuition and Assumptions

In HMM if we already knew the label of a node, then we would know more about the nodes to the left and right. For example if I knew y3 it would give me some evidence about y2.

The probability of a particular state is dependent only on the previous state. HMM does not have a direct way to look at all of the other states. So if we’re making a decision about t3, the other words in the sentence aren’t immediately available. So that is a weakness.

Some Use Cases

HMM can be used in Supervised Training for POS training.
Another example, is named-entity recognition.
Unsupervised learning

Sequence Models Takeaways

Sequence models are extremely useful:

Syntax:

part-of-speech tags,
base noun phrase chunking

Semantics:

supersense tags
named entity recognition
multiword expressions

All of these above are called “shallow” methods. Why? Because if we want to represent the full semantics of a sentence like who did it, why and where none of those methods will get us there. They are seen as “pre-processing” to something else (another method) richer or deeper.

Syntactic Structure

Constituency (phrase structure)

Constituents are units of a sentence, such as Noun Phrase (NP) or Verb Phrase (VP). E.g. with the sentence Fed raises interest rates we have their parts of speech. Fed is a proper noun. Raises is a verb. Interest is a noun. And rates is a noun. Those roll up into constituents. The verb phrases and noun phrases are constituents. So interest rates is a constituent.

Verb Phrase and Noun Phrase are constituents of the sentence.

How do we know what is a constituent? We can look at distribution because a constituent behaves as a unit that can appear in different places, e.g. the constituents in the sentence John talked [to the children][about drugs] can be rewritten as John talked [about drugs][to the children].

Christopher Manning of Stanford’s NLP group made a great video that explains this in a simple way.

Dependency Structure (Linguistic Structure)

Dependency structure shows which words depend on which other words.

The “head” of the entire sentence is **put**. Put has 3 arguments itself: “`boy`”, “tortoise”, and “on the rug”. It represents who…put what…where. Next, we ask if boy has a modifier? Yes, it’s modifier is “The”. Does “tortoise” have any modifiers or arguments? Yes, it’s modifier is “the”. Then “on”, being a preposition, takes a compliment which is “rug”. And lastly, “the” is the modifier for “rug”.

CFGs and PCFGs

CFG and PCFG stand for Context-Free Grammars and Probabilistic Context-Free Grammars, respectively. Phrase structure grammar = context-free grammar. Context-free grammar is a category which rewrites as a sequence of other categories, eventually writing down to the terminal (word).

A phrase structure grammar a.k.a. a context-free grammar.

It can represented, formally, as a 5 tuple, G=(T, N, S, R, P).

T is a set of terminal symbols (words, at the end)
N is a set of nonterminal symbols
S is the start symbol
R is a set of rules
P is a probability function

Here’s an example of a PCFG.

Example of a PCFG. Next to each rule there’s a probability. There’s only one possible rewrite for S so the probability is 1.0. However, there are 3 possible rewrites for NP so if we add up 0.1, 0.2, and 0.7 we get a total probability of 1.0.

Grammar Transforms (Chomsky Normal Form)

Chomsky Normal Form says that all rules must be of the form X→Y Z or X→w …where all of X, Y, and Z are nonterminals…or a nonterminal X rewrites as a terminal w.

Internally, when the system is doing Chomsky transformation steps it’s getting rid of any unary rules and empties.

You should think of this as a transformation that allows for more efficient parsing. Binarization is crucial for cubic time CFG parsing.

CKY Parsing Algorithm

Parsing of (P)CFGs exactly in polynomial time. In particular this will give us the means of finding the most probable paths of a sentence according to a PCFG in polynomial time (rather than doing exponential amount of work). The CKY algorithm gives us a cubic time algorithm.

Visual example of the CKY triangle and probability math for “factory payrolls”.

Dependency Parsing

A dependency grammar has a notion of a head. Modern statistical parsers like the Stanford parser have “head rules”. E.g.

The head of a Noun Phrase is a noun/number/adj/…
The head of a Verb Phrase is a verb/modal/…

Methods of Dependency Parsing

There are several methods of dependency parsing:

Dynamic programming (like in the CKY algorithm)
Graph algorithms
Constraint Satisfaction
Deterministic parsing

Deterministic parsing is where you go from left to right through the sentence and make greedy decisions based on machine learning classifiers, as to which words to connect to other words as dependents. The MaltParser (transition-based) is one of the most well known methods for doing dependency parsing and it works very well (accurately and exceedingly quickly). So I’ll focus on that.

The parser does a sequence of bottom up actions — kind of like “shift” or “reduce” in a shift-reduce parser, but the “reduce” actions are specialized to create dependencies with head on left or right.
In order to choose the next action, each action is predicted by a discriminative classifier (often SVM, could be maxent classifier).
The model’s accuracy is slightly below the best LPCFGs (evaluated on dependencies), but
It provides close to state of the art parsing performance
It provides VERY fast linear time parsing

Evaluating Dependency Parsers

The standard way is to look at accuracy. Since we choose a “head” for each word, we can look at how many of those we got right.

Acc = (# correct dependencies)/(total # of dependencies). There are two ways we can do that.

Ignore the labels Unlabeled Attachment Score (UAS)
Uses the Labeled Attachment Score, LAS

Latent Dirichlet Allocation (LDA)

This blog is an excellent (and interactive!) guide to LDA for topic modeling.

Build a Hidden Markov Model in PyTorch

[View my code] [or GitHub]

I use PyTorch to Build a Hidden Markov Model for both Weather Prediction and to predict whether a person is Healthy or Feverish. PyTorch is a deep learning neural networks package for Python. Here’s a video of PyTorch Explained.

The following algorithms are implemented:

Viterbi
Forward-Backward
Baum Welch

I added my own personal enhancements (for example, clearer code documentation/interpretation) and modifications to this original project put together by TreB1eN.

A dynamic programming table is also used to make the repetitive recursive computations more efficient.