Training a Swedish NER-model for Stanford CoreNLP part 2

Andreas Klintberg
4 min readNov 2, 2015

--

The second part of creating a NER model for Swedish. Some time have passed since I wrote the first part of creating a NER model for Swedish. I’ve now gotten to the second part where we have prepared training data for Swedish and will start training the model using the Stanford CoreNLP.

Now we have our training data (see part 1) and we will now train our classifier. One important thing which I searched for was how the trainer handled multi-word entities, for instance “Barack Obama” and “United States of America”, if it would treat them as “Barack” and “Obama” or the full name. It turns out the trainer is quite smart so, any Label of the same type which is consecutive, will be treated as if it one entity as explained by Christoffer Manning her

Most of this tutorial will be based on Stanfords own tutorial found here: http://nlp.stanford.edu/software/crf-faq.shtml#b

Getting started, the properties file

As for POS training, we first create a properties file:

trainFile = output_clean_training.txt
serializeTo = se-ner-model.ser.gz
map = word=0,answer=1

useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true

What are all these ? I will the most important ones below:

trainFile = output_clean_training.txt

Is the txt-file with your training data.

serializeTo = se-ner-model.ser.gz

Is the filename for the model you are training using this properties file.

map = word=0,answer=1

This is a rather important one, it describes how you training file structure looks like, the mappings for your words and your label in your training file. More precisely, the number equates to the column for where your label is and where your word is. If you have other properties in other columns, you would define these here as well.

useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true

Since Stanford CoreNLP NER use CRF (conditional random fields) which are a special type of Markov models, basically it’s a sequence classifier, the different features to be used during training is what is next in the props file (shown above.)

Training time!

Next it’s time for training

java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop swedish-ner.prop

Is the command to run the training, first make sure to download latest stanford-ner found here http://nlp.stanford.edu/software/CRF-NER.shtml

Now in the same folder as the downloaded stanford-ner.jar, copy you training file and prop file to that folder and run the command from that folder (if you don’t add the folder to your PATH).

I had to add

-mx8g 

to increase the memory heap size, since I got out of memory, but it depends on your training data size.

Now we wait!

It will probably take a couple of hours to finish this, it’s difficult to say what your training data size should be. But there are some general rule of thumbs for your training data:

  • Try to keep the labels/categories somewhat balanced, PER, LOC and ORG, MISC.
  • Around ~35 000 sentences might be a good starting point to aim at, say each sentence is 10 words, this will give you 350 000 rows in your training data file. But this is something you will have to experiment with, depends on language, what accuracy your looking to achieve etc.
  • Don’t have to many sentences which does not contain any entities at all.
  • Stanford CoreNLP treats a line break as a new document, so make sure you have some line breaks in your training data, perhaps this is already the case, but I had scrambled sentences which did not have any line breaks, so the first runs did not work, so I put a line break between every sentence. Maybe not optimal, but better than no line breaks.

Testing the model

As you’ve splitted your data into one training part and one testing part, now it is time to test your model using the testing data. Usually 1/10th of the data should be testing data, so about in the case of above, around 3500 sentences.

java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier se-ner-model.ser.gz -testFile output_clean_test.txt

Results

  • P, Precision,
  • R, Recall,
  • F1, F-score (sort of an mashup of P and R)
  • TP, true positives, the ones it found which are correctly classified
  • FP, false positives, the ones that it thought was entities, but it was not.
  • FN, false negatives, it didn’t think these were entities, but they actually were

These are the results, probably not super-accurate, because in that case the model is VERY good. But I think because of the quality of the training data (automatically annotated using gazetteer) the results are probably not as good as it seems.

That is it, probably you would need to tweak the parameters a lot to get better results. Also another more important part is that your training data will have huge impact, if it is low quality the performance will be bad. I have an idea of doing sort of an mechanical turk job for annotating swedish data and releasing it on github for others to use, but we will see when that may happen.

The model is available for download here: https://bitbucket.org/klintan/ner-model-swedish/ (initial model, will hopefully be updated). The Swedish NER dataset is available for download here https://github.com/klintan/swedish-ner-corpus

I work for Meltwater where we do things like this everyday, feel free to check it out www.meltwater.com

--

--