Training custom NER system using CRFs

using sklearn-crfsuite from scratch in python with codes

Mehul Gupta
Data Science in your pocket
9 min readMay 2, 2022

--

https://pixabay.com/

In my previous post, we discussed the theoretical explanation of Conditional Random Fields (aka CRFs) used for developing Named Entity Recognition (NER) systems.

My debut book “LangChain in your Pocket” is out now

This post deals with coding a custom information extraction system using the sklearn-crfsuite library in python for detecting NP Chunks (We will discuss what they actually are).

NERs can be very handy to detect any sort of entity from given text be it Person names, Organization names, Cities from normal text to even medicines from prescriptions text (I tried this). Though, the only major requirement is 1) labeled data 2) some knowledge base around the entity we wish to extract like city names list, medicine names list. As we are lacking these resources, for now, we are going with NP Chunk extraction using CRF. Though the methodology remains the same for building any custom NER that we would be following, the labels generated in today’s activity are using some rules (just for demo purpose) & that’ why I went on. If you have labeled data, you can skip the Chunking part.

Before we move ahead, let’s pen down the steps we would be following that majorly include

We will be discussing each of them in detail. But let’s first pick up some datasets on which we wish to train the NER system. The one that I picked up is Wikipedia movie plots from kaggle. Do have a look at some samples before moving ahead

We will be using the ‘Plot’ column to train our NER system. Let’s have a closer look at one of the plots:

Let’s get started !!

  1. Import all required libraries
import sklearn_crfsuite
import nltk
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from highlight_text import HighlightText, ax_text, fig_text
import random
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger)

Do remember to downgrade sklearn to a version <0.24 as sklearn_crfsuite is compatible with versions older than 0.24. Also, you might need to download certain internal nltk packages as mentioned above.

2. Load dataset, split in test & train & convert them in a single paragraph

df = pd.read_csv('wiki_movie_plots_deduped.csv')#taking 1k samples as test & 10k samples for training as just a demo
df_train = df[:10000]
df_test = df[-1000:]
#Combining all plots into one single paragraph for both train & test
training = ' . '.join(df_train['Plot'])
test = ' . '.join(df_test['Plot'])

3. Perform Sentence segmentation, Word tokenization & POS Tagging

Before jumping onto this, let’s have a one-line description

Sentence Segmentation: Separating sentences from a given chunk of text.

Tokenization: Breaking a sentence into smaller parts (words or even smaller tokens). To know about different tokenization algorithms, do check out here

POS Tagging: We can divide all words in a sentence into some pre-defined categories depending upon their job in the sentence used. These categories are called Part Of Speech aka POS. Verb, Noun, Adjective, etc. are some common POS tags. How is POS Tagging done? do check out here

def pos_tags(document):
sentences = nltk.sent_tokenize(document)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
return sentences
training = pos_tags(training)
test = pos_tags(test)

Let’s observe the output at different stages for a sample text:

‘That boy is very naughty. He failed last year in high school’

The output of the above function is POS tags associated with each token. A few common abbreviations used are ‘DT’: determiner, ‘NN’: ‘Noun’, ‘VB’: Verb, etc

4. Chunking

Note: Only for label generation. If you already have some labeled data, skip this

Chunks can be directly related to phrases, that aren’t complete sentences but a bigger unit than tokens. So what we aim to do is take input of the POS Tagged tokens (done in the last step) & group them into chunks. The below example can be helpful

Chunks can be of multiple types, Noun phrases (NP, used above), Adjective phrases, etc.

How chunking is done? Using Tag patterns usually.

Different types of chunks (be it NP, VP, etc) can be detected using different types of tag patterns which are nothing but regex patterns. Even to detect all types of chunks of the same category (say NP), we may need multiple regexes.

As we are interested in NP Chunks (to get labels), we will be using a regex for detecting NP Chunks only for now: <DT>?<JJ>*<NN> which should cover most of the Noun Phrases. Though we can declare multiple regex patterns for the same category (NP here) or even multiple categories (& not just NP) to cover all possible cases

grammar = "NP: {<DT>?<JJ>*<NN>}"
chunk_parser = nltk.RegexpParser(grammar)
training = [chunk_parser.parse(x) for x in training]
test = [chunk_parser.parse(x) for x in test]

For the sentence ‘The boy is very naughty.’, we get the below output after chunking. Observe that it is a Tree object & not some common list

Or even better,

Here, the regex is able to detect 2 NP chunks as shown in the tree.

5. IOB Tagging

The last step before our training data would be ready, we need to go for IOB Tagging.

What is this?

IOB stands for Inside-Outside-Beginning & the logic exactly replicates the name. Consider the sentence: ‘I went to New York’. Here, we know that ‘New York’ is an entity (with multiple words) where ‘New’ is the beginning word of the entity, and ‘York’ is also a part of an entity but not the beginning word. Words like ‘went’ and ‘to’ are part of no entity. Hence

‘New’: gets a ‘B’ tag

‘York’: gets an I tag

‘went’: gets an ‘O’ tag

How is it helpful?

IOB Tagging is helpful when a single entity name has multiple words. If tagged as it is, the system will learn ‘New’ & ‘York’ as 2 different entities & not one !!

Note: A combination of IOB Tag alongside Chunk type will form the final label for us i.e. if a word is a part of NP Chunk & gets a ‘B’ tag from IOB tagging, the final labeled for it becomes ‘B-NP’

Again to rescue, NLTK has a function tree2colltags which intakes a tree object & outputs the final label as discussed above.

training = [nltk.chunk.tree2conlltags(x) for x in training]
test = [nltk.chunk.tree2conlltags(x) for x in test]

For the dummy sentence, ‘‘That boy is very naughty. He failed last year in high school’’, we get the below output after passing through tree2conlltags()

As visible, for each token, we now have its POS tag & combination of IOB+NP Chunk tag in the tuple.

Our labels for each word of a sentence are ready !!

6. Feature Engineering

If you remember my previous post on CRFs, they take up features built for each word as input (similar to any other usual classifier) & give a predicted tag as output. So, let’s jump onto this part as well

def word2features(sent, i):
word = sent[i][0]
postag = sent[i][1]
features = {
'bias': 1.0,
'word.lower()': word.lower(),
'word[-3:]': word[-3:],
'word[-2:]': word[-2:],
'word.isupper()': word.isupper(),
'word.istitle()': word.istitle(),
'word.isdigit()': word.isdigit(),
'postag': postag,
'postag[:2]': postag[:2],
}
if i > 0:
word1 = sent[i-1][0]
postag1 = sent[i-1][1]
features.update({
'-1:word.lower()': word1.lower(),
'-1:word.istitle()': word1.istitle(),
'-1:word.isupper()': word1.isupper(),
'-1:postag': postag1,
'-1:postag[:2]': postag1[:2],
})
else:
features['BOS'] = True
if i < len(sent)-1:
word1 = sent[i+1][0]
postag1 = sent[i+1][1]
features.update({
'+1:word.lower()': word1.lower(),
'+1:word.istitle()': word1.istitle(),
'+1:word.isupper()': word1.isupper(),
'+1:postag': postag1,
'+1:postag[:2]': postag1[:2],
})
else:
features['EOS'] = True
return featuresdef sent2features(sent):
return [word2features(sent, i) for i in range(len(sent))]
X_train = [sent2features(s) for s in training]
X_test = [sent2features(s) for s in test]

Not going into much detail (pretty self-explanatory), we will overview what is happening in the above word2features function:

  • The above code develops 3 sets of features per token depending on the current token(for which features are getting generated), previous token & next token. All feature names starting from ‘-1’ are for the previous token while with ‘+1’ are for the next token
  • Flags like EndOfSentence(EOS) & BeginningOfSentence(BOS) are associated with beginning & ending tokens of a sentence
  • Most features are pretty simple in terms of logical complexity. Though, one can add as many features as he/she wishes to like

What is word shape & short word shape? refer here

Let’s have a look for features generated for the word ‘very’ in ‘That boy is very naughty’

features generated for ‘very’

For each sentence, we have a list of dictionaries with each dictionary comprising word features as shown above. This format is then fed to CRF. Let’s give our labels a final format for input to crfsuite

def sent2labels(sent):
return [label for token, postag, label in sent]
y_train = [sent2labels(s) for s in training]
y_test = [sent2labels(s) for s in test]

Finally, time to train up a CRF

crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
max_iterations=100,
all_possible_transitions=True
)
crf.fit(X_train, y_train)

The hyperparameters used are (there can be many)

  • Algorithm to be used = ‘lbfgs’
  • c1 & c2 are L1 & L2 regularization coefficients
  • max_iteration = iterations for lbfgs optimization
  • all_possible_transitions: If true, transition features are also generated by CRF while training.

What are transition features?

As per my understanding, they are numerical features encouraging transition between some states while discouraging other transitions. Let’s observe the transition features our model generated after training.

Note how a transition from B-NP → I-NP has a +ve value but vice-versa is -ve depicting a very unlikely transition from I-NP →B-NP. Similarly, O →O transition value is highest as most words are ‘O’ (non-entity).

After training, let’s see what it has predicted for some of the test samples


fig, ax = plt.subplots(figsize=(30,10))
font = {'family' : 'normal',
'size' : 16}
matplotlib.rc('font', **font)final_text = []
color = []
samples = 10
integer = random.randint(0,500)
prediction = crf.predict(X_test[integer:integer+samples])
for x,y in zip(test[integer:integer+samples], prediction):
for x1,y1, in zip(x,y):
if y1!='O':
final_text.append('<{}>'.format(x1[0]))
if y1[0]=='I':
color.append(color[-1])
else:
color.append ({'color':random.choice(['blue','green','red','magenta'])})
else:
final_text.append(x1[0])
final_text.append('\n')
# You can either create a HighlightText object
HighlightText(x=0, y=0.75,
s=' '.join(final_text),
highlight_textprops=color,
ax=ax)
plt.axis('off')

We have tried generating results for 10 test samples.

In the first observation, the results look decent with proper nouns going completely missing (as we trained the CRF for just NP Chunks). Though, a few errors observed can be due to

  • The way we generated labels is a very hacky way. These labels should have been reviewed/verified by a human before training the CRF as even just 1 regex pattern can’t cover all NP chunks.
  • As this was just for demo purposes, the complete dataset wasn’t used to save time.
  • The results should surely improve if we would have gone for 1) hyperparameter tuning 2) used the complete dataset 3) better label generation methodology 4) may be more features while engineering would have helped

Nonetheless, we have trained a custom CRF finally !!

As we observed, CRFs can be a big gun for extracting not just person or city names, but any custom category given we have some labels. Also, as we saw in my last post, it's no black box & works pretty easily with just some feature engineering required from the programmer's end.

--

--