NLP: Text Segmentation Using Maximum Entropy Markov Model (MEMM)

Phylypo Tum
10 min readDec 18, 2019

In an earlier Hidden Markov Model (HMM) approach, we see that it can capture dependencies between each state better than Naive Bayes (NB). NB assumes input values are conditionally independent, while HMM captures dependencies between each state and only its corresponding feature.

MEMM is a combination of HMM with the Maximum Entropy (MaxEnt) model. MaxEnt model is a type of log-linear model which we will discuss next.

Maximum Likelihood

Before going into the log-linear model, we will start with the principle of maximum likelihood which log-linear model is based on.

Assume we have a random sample with a training set of n examples x_1 to x_n. These input values are assumed to be independent so the probability of the function f(x;w) is the product of the probabilities of each input. So we have:

The likelihood function considers the training data as fixed but varies the parameter values w. The principle of maximum likelihood says that we want to find the parameter values w such that it models the training data x with the maximum probability.

The variable w is a vector of weights or the embedding vector. The goal is to find the weight parameters that will maximize the probability of the training data.

Similar to maximum likelihood, the maximum conditional likelihood says that we choose a parameter estimate w^ that maximizes the product f(y_i|x_i; w). In this conditional probability, we do not need to assume that x_i are independent. We only need to assume y_i are independent conditionally on x_i. For a specific value of x, we estimate w^ as:

To find w^, we can use gradient-based solutions like gradient descent. This involves starting with some random weights w, then loop through the training dataset and calculate the gradient. At each iteration, you update the weights w by moving some distance in the direction of the gradient. You repeat this until it converges or completes its iteration count.

Log-linear model

Logistic regression is one of the popular classification algorithms in scikit-learn that used the sigmoid or logistic function to classify the data into 2 different classes.

The log-linear model is an extension of logistic regression. It uses a linear combination of features and weights to find the predicted label with maximum log-likelihood. Log-likelihood is the logarithm of the likelihood function. Since logarithm function is a monotonic increasing function, maximizing the log-likelihood is maximizing the likelihood.

We describe the conditional probability of the log-linear model as:

The function f(x,y) is a feature function that can take account of relations between both data and label. It expresses some kind of characteristic of the data point. It results in value 0 or 1 depending on the absence or presence of a feature. This is created beforehand to supply the feature for the algorithm. Here is an example of feature function (B is a class ‘Begin letter of a word’, ‘t’ is the alphabet letter from the input observation):

The w_j is a weight of the feature function that captures how closely a given feature is related to a given label. In the training process, w_j is randomly initialized then it will learn the weight during training through gradient descent with some sophisticated optimization method. We will discuss how to train this weight vector later.

So w_j * f_j(x,y) measures the probability of label y given input x. For a given input x, we can calculate the probability of each possible label y.

To create a well-formed distribution P(y|x), we want this product to be strictly positive by taking a log. Then we divide it by the denominator. The denominator is a normalization term that helps normalizes the output making it a valid probability between 0 and 1. In the denominator, y′ is a local variable different than the y on the numerator.


In the training phase, we want to find vector w. Let’s start with the log-likelihood function:

This function L(w) for a given w measures how well w explains the labeled examples. A high value of p(y|x ; w) gives a high value for L(w).

The maximum-likelihood estimates use the argmax to find the best values for the parameter w that best fit the training data:

So in the training phase, we want to find the maximum-likelihood parameter estimates w^. This can be done using gradient descent. This process involved iterating through training data many iterations.

  1. First, initialize the w to some random values,
  2. Then iterate through each input and each iteration you update the weight by finding the derivative of L(w) with respect to w_j. See the derivative of L(w) below.
  3. Updating vector w as below and repeat until converge

In this equation, the 𝛼 is the learning rate that determine how much to offset. The w^t is the weight w at current time step t in the iteration and t-1 is the previous time step. So at each iteration, we move w to some distance 𝛼 in the direction of the gradient.

The partial derivative of L(w) is:

where the first sum is the sum of the feature function of j position of labels data and the second is the sum over training example of the probability distribution of p(y|x;w).


After finding the weight vector, we now can make a prediction. For the Log-Linear model, we make inferences by taking argmax to find the best-predicted label for a given input data. The inference is done using a linear combination of weights and features by searching all the label space as follow:

We will go over the detail of inference in the MEMM model.

Maximum Entropy Model

Similar to logistic regression, the maximum entropy (MaxEnt) model is also a type of log-linear model. The MaxEnt model is more general than logistic regression. It handles multinomial distribution where logistic regression is for binary classification.

The maximum entropy principle is defined as modeling a given set of data by finding the highest entropy to satisfy the constraints of our prior knowledge.

The feature function of MaxEnt model would be multi-classes. For example, given (x,y), the feature function returns 0,1, or 2.

The maximum entropy model is a conditional probability model p(y|x) that allows us to predict class labels given a set of features for a given data point. It does the inference by taking trained weights and performs linear combinations to find the tag with the highest probability by finding the highest score for each tag.

To find the probability for each tag/class, MaxEnt defined as:

We define f_i as a feature function and w_i as the weight vector. The summation of i=1 to m is summing of all feature functions where m is the number of unique states. The denominator Z(x) helped normalize the probability as:

The MaxEnt model makes uses of the log-linear model approach with the feature function but does not take into account the sequential data.

Maximum Entropy Markov Model (MEMM)

From the Maximum Entropy model, we can extend into the Maximum Entropy Markov Model (MEMM). This approach allows us to use HMM that takes into account the sequence of data and to combine it with the Maximum Entropy model for features and normalization.

The Maximum Entropy Markov Model (MEMM) has dependencies between each state and the full observation sequence explicitly. This is more expressive than HMMs.

In the HMM model, we saw that it uses two probabilities matrice (state transition and emission probability). We need to predict a tag given an observation, but HMM predicts the probability of a tag producing a certain observation. This is due to its generative approach. Instead of the transition and observation matrices in HMM, MEMM has only one transition probability matrix. This matrix encapsulates all combinations of previous states y_i−1 and current observation x_i pairs in the training data to the current state y_i.

Our goal is to find the p(y_1,y_2,…,y_n|x_1,x_2,…x_n). This is:

Since HMM only depends on the previous state, we can limit the condition of y_n given y_n-1. This is the Markov independence assumption.

So the Maximum Entropy Markov Models (MEMM) defines using Log-linear model as:

where x is a full sequence of inputs of x_1 to x_n. Let y be corresponding labels or sequence of tags (0 and1 in our case). The variable i is the position to be tagged and n is the length of the sentence. The denominator Z(y_i-1,x) is the normalizer that defines as

MEMM can incorporate more features from its feature function as input while HMM required the likelihood of each of the features to be computed since it is a likelihood-based. The feature function of MEMM also has dependencies on previous tag y_i-1. As an example:

Example function for letter ‘e’ in ‘test’ where the current tag is M and the previous tag is B.

The MEMM has a richer set of observation features that can describe observations in terms of many overlapping features. For example in our word segmentation, we could have features like capitalization, vowel or consonant, or type of the character.


An inference is to find the best value of Y (sequence label that yields the highest probability). In HMM, we use argmax of the product of P(x|y) and P(y_i|y_i-1) as:

But in MEMM, we find the inference by taking the product of the conditional probability of P(y_i|y_i-1,x).

In log linear model format, the inference of MEMM is:

To find the best sequence, we will need to search through the different possible states. There are k^n possible state sequences. So brute-force search would be intractable for a large n. But we can see in HMM, we can use the Viterbi algorithm. This is similar for MEMM and running-time would be in the order of O(n*k²) where n is the length of sentence, k is the number of possible tags which 2 in our case (0 and 1).

MEMM models the dependencies between each state and the full observation sequences x. It is more expressive than HMM. In addition, we can see its learning objective function is consistent with predictive function p(y|x).

MEMM is a discriminant model since it uses conditional probability which models the decision boundary between classes. In contrast to the generative model like HMM uses the joint probability of the inputs and its labels.

Shortcoming of MEMM

Although it seems like this algorithm solves all the issues from previous ones, it has a shortcoming. It is called the label bias problem. The transitions from a given state are competing against each other (where the sum of the transition is equal to 1). This creates a bias where the states with fewer arcs are preferred. Here is an example where state 1 has only 2 transitional states to 1 and 2 but state 2 has the transition to all other states. In this case, the transition to state 1 is always preferred. (Eric Xing)


We can see the MEMM take the good thing about HMM and combine it with MaxEnt. But there is still one shortcoming with MEMM with the label bias problem due to local normalization. Next, we can take a look at our final algorithm Conditional Random Fields that solves the label bias problem.