NLP: Text Segmentation Using Conditional Random Fields

Phylypo Tum
6 min readDec 18, 2019

From the previous article on MEMM, we see that there is one issue with label bias. This causes the path to prefer state transition with the fewer number of possible transitions. We will see how the Conditional Random Fields (CRF) algorithm solves this issue. CRF is a special case of the log-Linear model that we have seen earlier similar to MEMM.

Without going into detail yet on CRF, we can see a general overview of different categories.

For a single class, Naive Bayes (NB), Maximum Entropy (ME) and Logistic Regression which only predict non-sequential data assuming each pair of input values and labels are conditionally independent of each other. For sequential data, the Hidden Markov Model (HMM) and the Maximum Entropy Markov Model (MEMM) and CRF are well suited in performing the prediction.

The other distinction is the between generative model vs discriminant model. The generative models are Naive Bayes and HMM. While ME, Logistic Regression, MEMM, and CRF are discriminant models using the conditional probability rather than joint probability.

Below is an illustration extending to general graphs and shows different types of CRFs (linear-chain and general).

Suton et. al [1]

Conditional Random Fields (CRF)

CRF is a discriminant model for sequences data similar to MEMM. It models the dependency between each state and the entire input sequences. Unlike MEMM, CRF overcomes the label bias issue by using global normalizer.

In this article, we are focusing on linear-chain CRF which is a special type of CRF that models the output variables as a sequence. This fits our use case of having sequential inputs.

Let x be inputs vector, y is the label vector, and w is the weight vector. In MEMM, we define P(y|x) earlier as:


In contrast, Conditional Random Fields is described as:

with Z(x) defined as:

  • The summation of j=1 to n is the sum of all data points. This is needed in comparison to the Maximum Entropy Model. The whole label sequence is considered in the prediction instead of a single label. The variable j specifies the position of the input sequence x.
  • The summation of i=1 to m is the sum of all feature functions.
  • The summation of y is the sum of all possible label sequences. It is performed to get the feasible probability.
  • f_i is the feature function with detail below.
  • Z(x) will be discussed next.

Feature Function

Similar to MEMM, f(y_j-1,y_j,x,j) is the feature function. For example, if index j=2 and current state M as denoted as y_j=M and the previous state is B as denoted as y_j-1=B and x is character ‘e’, then this is 1 else 0.

Example function j=2 for letter ‘e’ in ‘test’

The feature function can be any real value but often just has the value of 0 or 1 where 1 is for specific feature and 0 otherwise. The feature function can overlap in many ways. In NLP, it can be if the word is capitalized, punctuation or prefix or suffix. Feature function has access to all of the observation x. So we can also look at the word to the left or right. In our segmentation case, it can be one letter to the left or right or type of that letter. In Khmer text, the type can be consonant, vowel, independent vowel, diacritic, etc.

Partitioning Function

To overcome the label bias problem, CRF uses the global normalizer Z(x) instead of local normalizer as in the MEMM. So Z(x) in CRF takes a sum of all the possible sequences of tag y ∈ Y. As shown earlier:

Note that y here is not the same as y in the numerator. This y is local to this calculation and generally notated as y’.


The goal of inferences is to predict the most likely sequences of the entire label y for given input values x such that the sum of the probabilities is maximized. We want to maximize the number of correct labels in the entire sequences, not to maximize individual state. We can Viterbi algorithm to find the best sequences of labels.

Here we can ignore the Z(x) since it is a normalizer and not a function of y. We also remove the exponent since it does not affect the argmax function.

We can again solve this similarly to MEMM earlier by using the Viterbi algorithm which makes this problem tractable with the running time of O(n*k²) where n is the length of the input sequence and k is the number of possible states.

To perform the inference, we first need to find the weight vector used above. This is what we will discuss next, the training.


The goal of the training is to find the parameter vector w that gives the best possible prediction that satisfy the conditional probability of the training data. Our parameter w tells us the relevance between the different features. The higher the positive value of an element of the vector w, the higher correlated the corresponding features.

In linear chain CRF, training is to find w^ for training data x as:

In P(y|x;w), we can ignore the denominator Z(x) and exponent function in the numerator which results in:

Uses of CRF

In addition to using CRF on word segmentation, this algorithm has many other uses especially for sequence prediction such as:

  • Identify gene and protein, gene prediction
  • NLP Part of Speech (POS) Tagging
  • NLP Named Entity Recognition (NER)

For word segmentation, this algorithm showed great performance for word segmentation on Khmer documents. We will go into detail implementation in the next article.


CRF overcomes the label bias problem found in MEMM since CRF uses the global normalizer. Linear Chain CRF has shown to give a great performance on sequence data like our use case of segmentation Khmer text. The state of the art result by Vichet Chea et. al. [2] in 2015 showed a result of 98.5% accuracy. We are going to implement a similar approach using a Khmer news corpus. See the next article on the implementation and you can also run the code which included in the Python notebook and corpus on Github.