Named Entity Recognition(NER) using Conditional Random Fields (CRFs)in NLP

Mehul Gupta
Data Science in your pocket
8 min readMay 18, 2020

--

It's time to jump on Information Extraction in NLP after a thorough discussion on algorithms in NLP for pos tagging, parsing, etc.

My debut book “LangChain in your Pocket” is out now !!

Not interested in theory? you can jump straight to train a custom NER in my 2nd post

Also, if you are a vlog person:

What is Information Extraction?

It is basically extracting important information based on the semantic(logical) meaning of the sentence. As we know text data is unstructured in nature, Information Extraction plays a very crucial role in understanding the logical meaning of a sentence.

By unstructured, I mean it doesn’t have a defined format like SQL Tables. For eg: text-files, youtube, songs(mp3), etc.

Information Extraction can have various forms like Named Entity Recognition(NER), Relationship extraction, Event extraction, template filling, temporal expressions, etc. which I would be covering one by one in my upcoming posts.

For now, let's focus on NER:

Named Entity Recognition

It refers to extracting ‘named entities’ from the text. Named entities denote words in a sentence representing real-world objects with proper names like:

  • Person’s name (Ramu, Raja, Seeta, etc.),
  • Countries (India, Sri Lanka, etc),
  • Organization (Google, Facebook, etc.)
  • or anything that has been given a specific name.

‘Ramesh works at Google’.

Here, we wish to extract ‘Ramesh’ & ‘Google’ that too with tags what they represent. Like ‘Ramesh’: Person, ‘Google’: Organization.

Below is a table giving an idea about what tags are used for which type of Named Entities:

Problems in NER:

There exist some issues which are required to be addressed.

  • Segmentation(Detection) ‘ambiguity’.

Consider ‘New York’. This can be detected as ‘New’ & ‘York’ as separate entities. Hence, deciding over the boundary is crucial whether to consider ‘New York’ as a single entity or ‘New’ and ‘York’ as 2 different entities.

  • Tag assignment(Recognition) ‘ambiguity’.

We have ‘Nirma’ as a girl’s name (PERSON )& as a detergent brand(Organization in India). Below are some more such ambiguities.

Below are some sentences using ‘Washington’ as different Named Entities.

Before understanding how NER is done, we must have a basic understanding of sequence labeling.

Sequence Labeling:

Sequence labeling refers to assigning labels/tags to each element of a sequence being passed as input using an algorithm or machine learning model. This sequence can be words of a sentence passed in the same order as in the sentence. An LSTM can be taken as a Sequence labeler.

NER can be done using a number of Sequence Labelling methods listed below alongside Rule-Based methods:

  1. Linear Chain Conditional Random Fields (Linear Chain CRF)
  2. Maximum Entropy Markov Models
  3. Bi-LSTM

I would be following Linear Chain CRF in this post.

Linear Chain Conditional Random Fields

CRF is amongst the most prominent approach used for NER.

A linear chain CRF confers to a labeler in which tag assignment(for present word, denoted as yᵢ) depends only on the tag of just one previous word(denoted by yᵢ₋₁).

Moving quickly, let’s figure out all the steps to follow up:

Deciding Feature Functions:

We need to extract multiple features per word of the sentence that will be assisting in recognizing a Named Entity. For this, we need functions called feature functions that will assist in generating unique features. These features function can consider any logic(depending on the programmer) but the output has to be either True:1 or False:0. A few examples of feature functions are given below. Do remember that we need to put some conditions (some of the below functions won’t return True or False directly) so as to get an output as only True:1 or False:0.

Notes:

  1. w_i: i_th word of a sentence

2. Embeddings refer to the numerical(vector) representation of a word. More can be explored here

3. Gazetteer: It is a list of places' names (India, Agra, etc) with their geographical & political information. It has millions of entries.

4. Word shape: It is a notation in which letters of a word are denoted in the following way:

Small letters: ‘x’

Capital letters: ‘X’

Digits: ‘d’

Punctuations & other symbols are untouched

Hence, if I get the word ‘Delhi%123%DD’, using Word shape, it can be transformed into ‘Xxxxx%ddd%XX’

5. Short word shape: Similar notation to Word shape with a slight change. Here, we would be removing consecutive similar type letters.

If we encounter consecutive small letters, they would be represented by a single ‘x’ & not multiple ‘x’.

Short word shape for the above example ‘Delhi%123%DD’= ‘Xx%d%X’.

It can be observed that we replaced ‘xxxx’ with ‘x’. Similarly, ‘ddd’ got replaced by ‘d’. But do remember, we will be using this shorthand only when similar type symbols are ‘consecutive’ as here.

It must be noted that every Feature Function intakes the below parameters:

Index of current word=’i’

Label of current word=’y_i’

Label of previous word=’y_i-1’

Sentence=’x’

Consider, ‘Ram is cool’

with Named Entity Labels as [PER O O] where we have

Ram: PER, is:O, cool:O

It must be noted that O tag confers that the word isn’t a Named Entity.

Consider a Feature function(Fⱼ(x, y, y-1, i)) with the definition:

The i-th word in ‘x’ is capitalized return 1 else 0

If i=2 (considering indexing from 1 & not 0), hence we are calculating the feature for ‘is’, the above feature function is demonstrated below:

Fⱼ(‘Ram is cool’, ‘O’, ‘PER’, 2): return 0 as ‘is’ isn’t capitalized.

  • It must be noted that some feature_functions may consider all parameters being passed & some may not (like this function which uses only the index ’i’ & ‘x’: ‘Ram is cool’)
  • The suffix ‘j’ refers to the jᵗʰ feature function where j goes from 1 →total feature functions

Mathematics Ahead!!

Consider the below formula:

Here,

pθ (y|x) refers to the probability of calculating a Label sequence(y) given a word sequence(x).

exp() refers to exponential in the above formula

Fⱼ(x,y)= summation of values of a feature function for all words. The numerator can be written as:

exp (Σⱼ wⱼΣᵢ feature_functionⱼ(x,y_i,y_i-1,i))

Understanding each term one by one:

  • The inner summation goes from i=1 to i=length of sentence ‘L’. Hence we are summating the value of any feature function for all words of the sentence

if we have a sentence ‘Ram is cool’, the inner summation will add values of the output of the jᵗʰ feature function for all 3 words of the sentence

  • The outer summation goes from j=1 to the total number of feature functions. It is doing something like this W₁*Σfeature_function₁+W₂*Σfeature_function₂……
  • Wⱼ refers to weights assigned to a feature_functionⱼ.

The denominator is referred to as a Normalizing constant. It looks quite similar to the numerator with a couple of changes in the formula

  • y’ in place of y
  • The third summation over y’.

Let us understand this.

y’ refers to all the possible Label Sequence that can be assigned to a word sequence (sentence).

Bringing back ‘Ram is cool’. If you remember the named Entity Tag table mentioned above, we have a number of possible tags like PER, LOC, VEH, ORG, etc. Hence the assigned label sequence can be [PER, PER, LOC], [O, O, O], [VEH, ORG, O], and many many more!!!

Now, if I wish to calculate the P([PER, PER, LOC] | ‘Ram is cool’)=

Numerator=exp (Σⱼ wⱼ Σ Fⱼ(‘Ram is cool’,’PER PER LOC’))

Denominator=exp (Σⱼ wⱼ Σ Fⱼ(‘Ram is cool’,’O O O’))+exp (Σⱼ wⱼ Σ Fⱼ(‘Ram is cool’,’ VEH ORG O’))+exp (Σⱼ wⱼ Σ Fⱼ(‘Ram is cool’,’PER ORG ORG’))….

Hence, the 3rd summation in the denominator refers to the summation for all label sequences possible for the sentence, and y’ refers to these possible combinations.

A couple of points note:

  • As we know the correct sequence label for ‘Ram is cool’ is [PER O O] (mentioned above when I introduced the problem). Hence,

P([PER O O ]|’Ram is cool’) should be highest amongst all other possible sequences for ‘Ram is cool’ if our CRF is trained well.

  • Linear CRF appears quite similar to logistic regression. Read this post for more.

Are we done?

We didn’t talk about how we got weights wⱼ!!

How are the values for these weights corresponding to each function to be set?

Before moving on, what should we do to optimize the weights of any Machine Learning model?

Training!!

Training CRFs

CRFs can be trained as any other machine learning model.

  • Training dataset

For preparing dataset, you must follow this post which suggests some online tools for preparing training dataset.

  • Weight updation

We will be using Gradient Descent to update our weights

Reconsider the conditional probability formula:

Calculating derivative with respect to a weight wⱼ:

Updating wⱼ where alpha is the Learning rate:

  • Loss function

A number of loss functions can be considered which can be explored here.

Now, as we are done with the training, how can the trained CRF be used to derive the Named Entity tag sequence given a sentence?

Testing CRFs

We need to pass all possible Named Entity tag Sequence for the sentence and extract the sequence with max probability

In our example ‘Ram is cool’, we should get the highest probability for [PER O O] as compared to all other possible tag sequences like [O O O], [ORG O PER], etc.

Wish to train a custom NER using CRF? follow my 2nd post

Before signing off, do remember that NER can be done using many other methods apart from CRFs. Similarly, CRFs can be very handy for many other NLP problems like POS Tagging.

--

--