Custom trained NLTK Stanford NER tagger and Spacy NER tagger comparison

Subodh Shakya
7 min readAug 22, 2019

--

Introduction to NER (Named Entity Recognition)

In Natural Language Processing NER is the process of tagging named entities present in the texts to recognize them as person, date,organization etc.

for example,

Shushant(person) is marrying Dolly(person) on 2019–8–10(date)

NER tagger becomes very useful in different NLP tasks. The two very popular tools or libraries for NLP task are Natural Language Processing Toolkit(NLTK) and Spacy which provide their own NER tagger. These comes pre-trained and can be used directly or they can be custom trained to make them more useful for our specific task.

This blog is all about custom training the NER Tagger of NLTK and Spacy and their comparision.

Step 1: Implementing NER using NLTK

Stanford NER for NLTK is written using Java programming language so we need to setup proper JVM (Java Virtual Machine) in our computer.

a. Install Java JRE 8 or higher. You can also install Java JDK as it contains JRE. Please follow the Java official documentation to install it. For linux users you can find information on this guide Install Java JRE and Java JDk

check your installation with command

echo $JAVA_HOME
  • My path for java is /usr/lib/jvm/java-8-oracle

b. Install NLTK with following command:

pip install nltk

c. Get stanford NER Tagger. Download zip file stanford-ner-xxxx-xx-xx.zip: see ‘Download’ section from The Stanford NLP website.

Unzip it and move ner-tagger ner-tagger.jar and zipped English model english.all.3class.distsim.crf.ser.gz to your application folder:

cd /home/scs/Downloads/
unzip stanford-ner-2017-06-09.zip
mv stanford-ner-2017-06-09/ner-tagger.jar {yourAppFolder}/stanford-ner-tagger/ner-tagger.jarmv stanford-ner-2017-06-09/classifiers/english.all.3class.distsim.crf.ser.gz {yourAppFolder}/stanford-ner-tagger/ner-model-english.ser.gz

There are two files in our Stanford-ner-tagger folder

  1. ner-tagger.jar (NER engine)
  2. ner-model-english.ser.gz (pretrained model)

create a new python file (say decide_ner.py) and copy the following code contents in the file.

import os
import nltk
from nltk.tag.stanford import StanfordNERTagger
sentence = u”John went to Herald University. There he read Python Programming Language.” jar_engine = os.getcwd()+‘/stanford-ner-tagger/stanford-ner.jar’
model = os.getcwd +‘./stanford-ner-tagger/ner-model-english.ser.gz’
# load NER tagger with english model
entity_tagger = StanfordNERTagger(model, jar_engine,encoding=’utf8')#tokenize the sentences
words = nltk.word_tokenize(sentence)
#use NER tagger on tokens
tagged_entity = entity_tagger.tag(words)
print(tagged_entity)

In terminal run

Python decide_ner.py

output

[('John','PER'),('went','O'),('to','O'),('Herald','ORG'),('University','ORG'),('.','O'),('There','O'),('he','O'),('read','O'),('python','O'),('Programming','O'),('Language','O')]

The outputs are decent but what if I want to tag python as a skill for certain use cases. We need to train the model on our own dataset where python is labeled as SKILL. Let’s see how to do it.

Step 2: Custom train Stanford NER tagger

you first need to create a data.tsv file in {yourAppFolder}/stanford-ner-tagger/train folder. The data should look like this:

John PER
went O
to O
Herald ORG
University ORG
. O
There O
he O
read O
python SKILL
Programming SKILL
Language SKILL

Note: for more documents keep adding the data like above. But separate the documents with new line.

e.g

token1_of_doc1 label 
token2_of_doc1 label
toke1_of_doc2 label
token2_of_doc2 label

After that, create a prop.txt file in the same folder. your prop file should look like this:

trainFile = train/data.tsv
serializeTo = own_ner_model.ser.gz
map = word=0,answer=1

useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true

To train it use the following command in the terminal:

cd stanford-ner-tagger/
java -cp "stanford-ner.jar:lib/*" -mx4g edu.stanford.nlp.ie.crf.CRFClassifier -prop train/prop.txt

Now change the decide_ner.py file above to load the custom trained model as:

import os
import nltk
from nltk.tag.stanford import StanfordNERTagger
sentence = u”John went to Herald University. There he read Python Programming Language.”jar_engine = os.getcwd()+‘/stanford-ner-tagger/stanford-ner.jar’
#load your own ner model
model = os.getcwd +‘./stanford-ner-tagger/own_ner_model.ser.gz’
entity_tagger=StanfordNERTagger(model,jar_engine,encoding=’utf8')#tokenize the sentences
tokens = nltk.word_tokenize(sentence)
#use NER tagger on tokens
tagged_entity = entity_tagger.tag(tokens)
print(tagged_entity)

output

[('John','PER'),('went','O'),('to','O'),('Herald','ORG'),('University','ORG'),('.','O'),('There','O'),('he','O'),('read','O'),('python','SKILL'),('Programming','SKILL'),('Language','SKILL')]

This is how you train and use Stanford NER tagger.

Step 3: Implementing NER using Spacy.

Follow the spaCy official documentation for NER tagging. Its simple as compared to stanford NER for NLTK as it comes built in with the spaCy package.

Step 4: Custom NER tagging in spaCy

we will now be going through the process of creating our own NER model in spaCy. create a list of tokens in a data.tsv. In the tsv file,label the tokens as per our requirements.

John PER
went O
to O
Herald ORG
University ORG
. O
There O
he O
read O
python SKILL
Programming SKILL
Language SKILL

The data is same as data.tsv prepared for Stanford NER of NLTK.

Now convert the tsv file to dataurks json format. There are also many other formats which can be used to create a training data for spaCy. Here, we will be using the dataurks json format.

import json
import logging
import sys
def tsv_to_json_format(input_path,output_path,unknown_labels):
try:
f = open(input_path,'r') #input file tsv
fp= open(outpt_path,'w') #output file
data_dict = {}
annotations =[]
label_dict={}
s=''
start=0
for line in f:
if line[0:len(line)-1]!='.\tO':
word,entity=line.split('\t')
s+=word+" "
entity=entity[:len(entity)-1]
if entity!=unknown_label:
if len(entity) != 1:
d={}
d['text']=word
d['start']=start
d['end']=start+len(word)-1
try:
label_dict[entity].append(d)
except:
label_dict[entity]=[]
label_dict[entity].append(d)
start+=len(word)+1
else:
data_dict['content']=s
s =''
label_list=[]
for ents in list(label_dict.keys()):
for i in range(len(label_dict[ents])):
if (label_dict[ents][i]['text']!=''):
l=[ents,label_dict[ents][i]]
for j in range(i+1,len(label_dict[ents])):
if (label_dict[ents][i]['text']==label_dict[ents][j]['text']):
di={}
di['start']=label_dict[ents][j]['start']
di['end']=label_dict[ents][j]['end']
di['text']=label_dict[ents][i]['text']
l.append(di)
label_dict[ents][j]['text']=''
label_list.append(l)

for entities in label_list:
label={}
label['label']=[entities[0]]
label['points']=entities[1:]
annotations.append(label)
data_dict['annotation']=annotations
annotations=[]
json.dump(data_dict,fp)
fp.write('\n')
data_dict={}
start=0
label_dict={}
except Exception as e:
logging.exception("unable to process files",str(e))
return 0
tsv_to_json_format("data.tsv","spacydataset.json",'O')

The above code converts the tsv file to the required json format. Here, we are passing the location where our tsv file is along with the location where we want to store the created json file. The ‘unkown_label’ → ‘O’ is used to ignore the unknown labels in the dataset which we do not want to include as an entity. Example, it a token is given a label ‘O’ then it will not be counted as an entity.

Now, after creating the json file, we will convert the json file to the format which spaCy requires for training the model.

def create_data(input_file='spacydataset.json',output_file   ="spacydatapckl"):
try:
training_data = []
lines = []
with open(input_file,'r')as f:
lines = f.readlines()
for line in lines:
data = json.loads(line)
text = data['content']
entities = []
for annotation in data['annotation']:
point = annotation['points'][0]
labels = annotation['label']
if not isinstance(labels,list):
labels = [labels]
for label in labels:
entities.append(point['start'],point['end']+1,label)
training_data.append(text,{"entities":entities})with open(output_file,'wb') as fp:
pickle.dump(training_data,fp)
except Exception as e:
logging.exception('unable to process',str(e))
return 0
create_data()

Here, we will serialize the json file using Pickle. In the above method, we have given the path where our json file lies and the location where we want to store the serialized file.

def train(model=None, new_model_name='new_model',output_dir="spacy/spacyNER",n_iter=100):
if model is not None:
nlp = spacy.load(model)
print("Loaded model %s" %model)
else:
nlp = spacy.blank('en')
print("Created blank 'en' model")
# create the built in pipeline components and add them to the pipelien
# nlp create pip works for built-ins that are registered with spacy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
else:
ner = nlp.get_pipe('ner')
for _,annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
#get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe !='ner']
with nlp.disable_pipes(*other_pipes):
if mode is None:
nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
batches = minibatch(TRAIN_DATA,size=compounding(4.0,32.0,1..001))
for batch in batches:
texts.annotations = zip(*batch)
nlp.update{
texts,
annotations,
drop = 0.5,
losses = losses,
}
print("Losses",Losses)
if output_dir is not_None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(outp)
print("Saved model to ",output_dir)
#test the saved model
print("Loaing From", output_dir)
nlp2 = spacy.load(output_dir)
for text,_ in TRAIN_DATA:
doc = nlp2(text)
print("entities",[(ent.text,ent.label) for ent in doc.ents])
print("Tokens",[(t.text,,t.ent_type,t.ent_iob) for t in doc])

With the help of above code, we will train our model in the spaCy. The major steps that we do in the above process are:

  • load the pickle file,
  • Create a new spaCy model
  • Create a new pipeline and added ‘ner’ on it
  • Selected only the entities from the training data (leaving out the tokens)
  • Disable all the other features from the pipeline beside ‘ner’
  • Start training the model
  • After the completion of training, save the model in the path which is passed in the function.

Now, we have successfully created our own spaCy model. Now to load the model, we can load it with the following process:

model = spacy.load(“spacy/spacyNER”)

Inside the load, we have to give the path, where our mode is saved.

Testing the Model on new data

Now, as we have completed creating our own spaCy model, we will test the model on new data. Here, we have tag the text.

model = spacy.load("spacy/spacyNER")
spacyResult = []
text = "John went to Oxford University. There he studied Java Programming Language."
doc = model(text)
for token in doc:
if token.is_space != True:
token.ent_type == 0:
t = [(token.text,token.ent_type_)]
spacyResult.append(t)

For the tokens which has no entity, we have set a default entity ‘O’. Then we have appended the token and its entity value in a list.

[('John','PER'),('went','O'),('to','O'),('Oxford','ORG'),('University','ORG'),('.','O'),('There','O'),('he','O'),('read','O'),('Java','SKILL'),('Programming','SKILL'),('Language','SKILL')]

Confusion Matrix for Custom trained Stanford NLTK NER tagger and Spacy NER tagger and their comparison:

We successfully built the custom NER tagger for both NLTK and Spacy using four resume documents latter. Then computed the Confusion Matrix of both as:

fig: code snippet for constructing the confusion matrix for Spacy NER

fig: code snippet for constructing confusion matrix for Spacy outputs

fig: Confusion matrix for spaCy NER

Similarly we computed the Confusion matrix for the Stanford NER of NLTK and got the following result:

Fig: Confusion matrix for stanford NER

In this confusion matrix table that we have received from the spaCy and NLTK, we can view the score the model’s accuracy, precision, recall and f1-score.

In comparison to the confusion matrix of NLTK, we can see some differences. In the support column we can see that the spaCy model has identified 26 labels as ORG and 4 as UNI whereas the NLTK model failed to identified any labels and ORG and UNI. And there are also difference in numbers of identifying other labels.

The difference in the average between the precision and recall is very minimum in spaCy (0.96, 0.97) in comparison to NLTK (1, 0.96).

Also, we can optimize the model in spaCy by:

changing the number of iterationschanging the batch sizechanging the dropout rate valueparameter averaging and others.

So, we believe it would be effective to use spaCy NER tagging over NLTK.

--

--