Conditional Random Fields Explained
Conditional Random Fields is a class of discriminative models best suited to prediction tasks where contextual information or state of the neighbors affect the current prediction. CRFs find their applications in named entity recognition, part of speech tagging, gene prediction, noise reduction and object detection problems, to name a few.
In this article, I will first introduce the basic math and jargon related to Markov Random Fields which is an abstraction CRF is built upon. I will then introduce and explain a simple Conditional Random Fields model in detail which will show why are they suited well to sequential prediction problems. After that, I will go over the likelihood maximization problem and related derivations in context of that CRF model. Finally, I will demonstrate the CRF model by putting it to training and inference on a handwriting recognition task.
Markov Random Fields
A Markov Random Field or Markov Network is a class of graphical models with an undirected graph between random variables. The structure of this graph decides the dependence or independence between the random variables.
- A Markov Network is represented by a graph G = (V, E) with the vertices or nodes representing random variables and the edges collectively representing the dependencies between those variables.
- The graph can be factorized into J different cliques or factors, each governed by a factor function ϕⱼ with its scope being a subset of random variables Dⱼ. ϕⱼ(dⱼ) should be strictly positive for all possible values of dⱼ.
- For a subset of random variables to be represented as a factor or clique, all of them should be connected to each other in the graph. Also, the union of scopes of all the cliques should be equal to all the nodes present in the graph.
- The unnormalized joint probability of the variables is the product of all the factor functions i.e. for the MRF shown above with V= (A, B, C, D), the joint probability can be written as :-
The denominator is a summation of the product of factors over all possible values that the random variables may take. It is a constant, also known as the partition function and is commonly denoted by Z.
Gibbs Notation
We can also represent the joint as a Gibbs distribution by operating on factor functions in log space. Using β(dⱼ)= log(ϕ(dⱼ)), we can express the joint in Gibbs notation as shown below. Note here that X is the set of all the random variables in the graph. β functions are also known as factor potentials.
We will use the Gibbs notation for our likelihood maximization problem derivations, later in this article.
Conditional Random Field Model
For a moment, let us assume a Markov Random Field and divide it into two sets of random variables Y and X respectively.
Conditional Random Field is a special case of Markov Random field wherein the graph satisfies the property : “When we condition the graph on X globally i.e. when the values of random variables in X is fixed or given, all the random variables in set Y follow the Markov property p(Yᵤ/X,Yᵥ, u≠v) = p(Yᵤ/X,Yₓ, Yᵤ~Yₓ), where Yᵤ~Yₓ signifies that Yᵤ and Yₓ are neighbors in the graph.” A variable’s neighboring nodes or variables are also called the Markov Blanket of that variable.
One such graph that satisfies the above property is the chain-structured graph shared below :-
Since CRF is a discriminative model i.e. it models the conditional probability P(Y/X) i.e. X is always given or observed. Therefore the graph ultimately reduces to a simple chain.
As we condition upon X and we are trying to find the corresponding Yᵢ for every Xᵢ , X and Y are also called the evidence and label variables respectively.
We can verify that the “factor reduced” CRF model shown above follows the Markov property as shown for variable Y₂ below. As shown, the conditional probability of Y₂ given all other variables finally depends only on its neighboring nodes.
CRF Theory and likelihood optimization
Let us first define the parameters and then build up the equations for joint(and conditional) probabilities using the Gibbs notation.
- Label domain : Assume that random variables in set Y have a domain : {m ϵ ℕ | 1≤m ≤M} i.e. first M natural numbers.
- Evidence structure and domain : Assume that random variables in set X are real valued vectors of size S i.e ∀ Xᵢ ϵ X, Xᵢ ϵ Rˢ.
- Let the length of CRF chain be L i.e. L labels and L evidence variables.
- Let βᵢ(Yᵢ, Yⱼ) = Wcc’ if Yᵢ = c, Yⱼ = c’ and j = i+1, 0 otherwise.
- Let β’ᵢ(Yᵢ, Xᵢ) = W’c . Xᵢ, if Yᵢ = c and 0 otherwise, where . represents the dot product i.e. W’c ϵ Rˢ.
- Notice that the total number of parameters is M x M + M x S i.e. there is one parameter for each label transition(M x M possible label transitions) and S parameters for every label(M possible labels) that will be multiplied to the observation variable(a vector of size S) at that label.
- Let D = {(xn, yn)} for n=1 to N, be the training data consisting of N examples.
Keeping the above in mind, the energy and likelihood can be expressed as follows :-
Therefore, the training problem reduces to maximizing the log likelihood wrt all model parameters Wcc’ and W’cs.
The gradient of the log likelihood with respect to W’cs has been derived below :-
Note that second term in the equation above denotes the sum(over all the possible values that y’ can take) of marginal probability of y’ᵢ being equal to c, weighted by xnis. y’-i here denotes the set of label/y variables at each position except position i.
A similar derivation can be worked out for dL/dWcc’ that I am directly sharing below :-
Now that we have the expressions for the derivatives as well as the log likelihood, we actually have all that we need to code a CRF model from scratch. We can code the likelihood using equations mentioned above, use belief propagation to calculate the marginals and work out the derivatives and then optimize the likelihood using off-the-shelf optimization algorithms like L-BFGS.
However, for the sake of simplicity and sanity, we will not re-invent the wheel and use the existing CRFSuite library for our demo.
Demo — Handwriting Recognition
By now it should be fairly clear why and how the structure of CRFs make them ideal for tasks that capture sequential relations like POS tagging a sentence, named entity recognition etc. For this demo, I will be using CRF for a handwriting detection task.
To prepare the dataset for this demo, I used a combination of Stanford OCR dataset as well as the Gutenberg project archive.
Dataset preparation
Stanford OCR dataset contains a total of 6877 handwritten words in total divided into 9 different folds. The first character for each word is missing to have all lowercase characters in the dataset. For every handwritten character in each word, the dataset contains a flattened binary array of length 128 which can be converted to an image of size 16x8. Some of the word visualizations from the dataset are share below :-
Upon analysis, I found that the number of unique words in this entire dataset is just 24.
Our hope is that CRF model will learn to label observations(xᵢ) which are character pixel vectors in our case that co-occur together. Although there are 6,877 unique examples in the dataset as far as character pixel vectors are concerned, having 24 unique words behind it all is too less to be able to probabilistically capture character co-occurrence in english literature in general and prepare a good handwriting/word recognizer.
To work around this scarcity, I decided to create a dataset on my own using all the character vector representations. I captured all the different character pixel vector variants available in the dataset for each character in a map/dictionary. Having done that, I imported all the words present in the famous book Moby Dick, filtered out all tokens with length less than 3 or containing anything other than the alphabet set and converted the filtered tokens to lower-case. A total of 18,859 words were extracted this way which were then split 80–20 into training and test sets respectively, stratified on word length.
To compose the actual training and test sets for CRF model, I used the character to pixel array vector map I had created in the beginning. For creating image/x for any word, I went over its characters sequentially and picked up a pixel array vector variant for that character from the dictionary using uniform sampling. A few samples of the dataset thus created are shared below :-
Also sharing the data preparation script for anyone wanting to play around with it :-
With the training and test datasets ready, it was time for training the model and evaluating it on the task.
Model training and evaluation
Using the script above, I trained a CRF model over the training set consisting of 15088 words and achieved nearly 85% accuracy on the test set.
Not bad for a simple experiment..Let me know what you think!