Training a Swedish POS-tagger for Stanford CoreNLP
This will be a very short tutorial on how to train a CoreNLP POS model for Swedish, as it does not exist one for CoreNLP “package” and I haven’t found one open source out there just yet. The model will be available on https://github.com/klintan/corenlp-swedish-pos-model
Introduction
From Wikipedia: “part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context”
It is also sometimes called shallow parsing, since it is not creating a deeper structure of the different parts of the sentence. A sort of POS tagging is what you are learnt the first years in school, in the identification of words as nouns, verbs, adjectives, adverbs.
Training
First we need some training data for our Swedish POS-tagger, I’ve used the http://stp.lingfil.uu.se/~nivre/swedish_treebank/ for the Talbanken part, they also provide a conversion to Stanford dependencies.
After we’ve downloaded it, we get two files,
- talbanken-stanford-train.conll
- talbanken-stanford-test.conll
Java Docs for the training class: http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/tagger/maxent/MaxentTagger.html
Now it’s time to download Stanford CoreNLP library and its’ dependencies. http://nlp.stanford.edu/software/corenlp.shtml
After you’ve downloaded POS-tagger part (use the -full, to get all the models, german and french etc) it’s time to create your .props file, which contains all the information on training your model.
This is the props file I used for the first training iteration (probably going to add things for improving the performance but it is a good starting point hopefully):
## tagger training invoked at Tue Jul 08 16:08:39 PDT 2014 with arguments:
model = swedish-pos-tagger-model
arch = words(-1,1),unicodeshapes(-1,1),order(2),suffix(4)
wordFunction =
trainFile = format=TSV,wordColumn=1,tagColumn=4,talbanken-stanford-train.conll
closedClassTags =
closedClassTagThreshold = 40
curWordMinFeatureThresh = 2
debug = false
debugPrefix =
tagSeparator = _
encoding = iso-8859–1
iterations = 100
lang =
learnClosedClassTags = false
minFeatureThresh = 2
openClassTags =
rareWordMinFeatureThresh = 10
rareWordThresh = 5
search = qn
sgml = false
sigmaSquared = 0.0
regL1 = 0.75
tagInside =
tokenize = false
tokenizerFactory =
tokenizerOptions = asciiQuotes
verbose = true
verboseResults = true
veryCommonWordThresh = 250
xmlInput = null
outputFile =
outputFormat = slashTags
outputFormatOptions =
nthreads = 1
There are a number of important lines:
1.
trainFile = format=TSV,wordColumn=1,tagColumn=4,talbanken-stanford-train.conll
The “format=TSV”, tells it that the training file format is a tabular separated file, and the wordColumn, is the column in the file which contains the word, the tagColumn, is the column where the POS-tag is. The last section is the training file name.
2.
The other important line is the name for the model:
model = swedish-pos-tagger-model
When this file is saved using an appropriate name, for instance swedish-pos.props you are ready to go!
Run the command:
java -classpath stanford-postagger-3.5.1.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -prop swedish-tagger.props
In the directory of your downloaded “stanford-corenlp-pos-full”-folder, and you will soon have a trained model (mine was something like: “swedish/pos/stanford-postagger-full-2015-01-30”)
Testing
Testing is also described in the Maxent Javadocs http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/tagger/maxent/MaxentTagger.html
To run the testing, use:
java -classpath stanford-postagger-3.5.1.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -prop swedish.tagger.props -model swedish.tagger -testFile talbanken-stanford-test.conll
Same props file which was used at training time.