Training a Swedish POS-tagger for Stanford CoreNLP

Andreas Klintberg
Apr 3, 2015 · 3 min read

This will be a very short tutorial on how to train a CoreNLP POS model for Swedish, as it does not exist one for CoreNLP “package” and I haven’t found one open source out there just yet. The model will be available on https://github.com/klintan/corenlp-swedish-pos-model

Introduction

From Wikipedia: “part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context”

It is also sometimes called shallow parsing, since it is not creating a deeper structure of the different parts of the sentence. A sort of POS tagging is what you are learnt the first years in school, in the identification of words as nouns, verbs, adjectives, adverbs.

Training

First we need some training data for our Swedish POS-tagger, I’ve used the http://stp.lingfil.uu.se/~nivre/swedish_treebank/ for the Talbanken part, they also provide a conversion to Stanford dependencies.

After we’ve downloaded it, we get two files,

  1. talbanken-stanford-train.conll
  2. talbanken-stanford-test.conll

Java Docs for the training class: http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/tagger/maxent/MaxentTagger.html

Now it’s time to download Stanford CoreNLP library and its’ dependencies. http://nlp.stanford.edu/software/corenlp.shtml

After you’ve downloaded POS-tagger part (use the -full, to get all the models, german and french etc) it’s time to create your .props file, which contains all the information on training your model.

This is the props file I used for the first training iteration (probably going to add things for improving the performance but it is a good starting point hopefully):

There are a number of important lines:

1.

The “format=TSV”, tells it that the training file format is a tabular separated file, and the wordColumn, is the column in the file which contains the word, the tagColumn, is the column where the POS-tag is. The last section is the training file name.

2.

The other important line is the name for the model:

When this file is saved using an appropriate name, for instance swedish-pos.props you are ready to go!

Run the command:

In the directory of your downloaded “stanford-corenlp-pos-full”-folder, and you will soon have a trained model (mine was something like: “swedish/pos/stanford-postagger-full-2015-01-30”)

Testing

Testing is also described in the Maxent Javadocs http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/tagger/maxent/MaxentTagger.html

To run the testing, use:

Same props file which was used at training time.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store